Storing Binaries Externally (Archived Pattern)

Warning: This article references Sufia, a deprecated application similar to Hyrax. The details of this document should be used as a reference, but not as a direct solution to storing binaries externally. Additionally, there are mentions of Fedora v4, which has reached end-of-life.

Summary: A guide to some custom development in Sufia, related to storing binary content in a filesystem rather than Fedora.

Storing binaries externally

Some large repositories have found it desirable to store binary content in a filesystem instead of in Fedora 4. Reasons for this might include ease of migration, preservation strategy, performance, and horizontal scalability. Unfortunately, this is not an out-of-the-box feature, and will require some custom development. This guide was written during the migration of ScholarSphere, a Sufia 7 repository at Penn State University Library. Code examples can be found at the ScholarSphere github repository.

Goal: In a Sufia 7 / Fedora 4 application, store binary content externally to Fedora. The application should continue to function as usual. This general pattern should be applicable to Hyrax applications as well, but has not been tested there.

Summary of this solution:

  1. Binary content is stored on a filesystem to which the self-deposit application can write. In this case, /opt/heracles/binaries. This location is set via an environment variable, REPOSITORY_FILESTORE

  2. Files are stored in a pair tree directory structure. For example, a FileSet with id ht722h861h would be stored on the filesystem at /opt/heracles/binaries/ht/72/2h/86/ht722h861h/

  3. Within the pairtree, files are stored in a bagit format with a sha256 checksum

  4. Binary content in the expected directory is available via a web server. Our above example would be available at https://dce-fedora.vmhost.psu.edu/binaries/ht/72/2h/86/ht722h861h/data/world.png. The address and port of the webserver are set via an environment variable, REPOSITORY_FILESTORE_HOST

  5. External filestore functionality is controlled by an environment variable, REPOSITORY_EXTERNAL_FILES, and only enabled if that variable is set to ‘true’.

  6. Fedora objects use Fedora’s External Content feature. When the Fedora object is created, it stores a URL in the mime-type field. When the object is retrieved, it delivers a 307 redirect to the file’s URL.

  7. A one-time data migration is required. In order to store all content (including previous versions) externally, leaving no binary content in Fedora, we loop through all objects, store the binary content of each file version in a local tempfile, delete each file version in fedora, and re-create each file version with external content.

  8. SHA1 checksums (as calculated by Fedora) are recorded before migration, and after migration are compared to re-calculated checksums of the file as written to disk. Note that the fedora checksum service will not work against externally managed files, so once you’ve converted to external binary storage you need to have another way of tracking fixity.

  9. We do not attempt to migrate objects that have already been migrated

  10. We rescue any errors that happen in the migration process and add them to a log that can be re-processed

How to do it

A writeup of this work is available at https://docs.google.com/document/d/13RXoWPvBfsaKsI-miFXjcbduVQ-SbKANN0yHGBVatB8

Related Links