Big Files Breakout Notes

Samvera Community Wiki


Big Files Breakout Notes

Friday Jan. 24, 2014, 3:30-5:00

Pain point #1: Storing big files

  • Files that are too big for Fedora3 are current stored outside, not ideal

  • Fedora4 should accommodate single files up to 1.4TB

  • some rock hall files are bigger - they're currently using GPFS - distributed file system with an HSM

  • Indiana U also has bigger files - their current solutions: HBSS

  • They want to move to Fedora 4, need either a Fedora4 connector to read SDA file system or to remote-mount HBSS to server that's running Fedora

Fedora 4 large file testing - has been tested with large numbers of moderately big files (10K x 1GB files)

  • stores 10k x 1GB with no degradation but slow if many large files are stored in a single directory

  • Solution is to make a 2-3-level hierarchy to evenly distribute files

  • Fedora4 has not been tested on multiple huge files

Pain point #2: Moving big files around

  • Getting files in: http transfer limit is 2GB - that is the pipeline limitation

  • Moving large files within the system as we ingest/process them

  • FC4 can chunk large files (same approach as Amazon Glacier and OpenStack)

  • so Fedora4 will allow us to incorporate large files in hydra

  • DropBox, BitTorrent, or chunking are options for transferring large files

  • What effect will the loss of net neutrality have on this issue? We don't know - we don't know how providers will throttle bandwidth.

Pain point #3: HIPAA compliance in big data

 not restricted to big files - can be an issue for small files too

Pain point #4: User expectations

  • We don't understand the users' expectations - google crawling of large text files - probably just metadata is reasonable

  • Download/delivery - how long will people wait - approaches to managing expectations

  • Put file size next to download buttons, prompt user with estimate of download and an "are you sure" prompt

  • Server side data visualization/proxy usage - give users a faster experience

  • do as much of the work on the server side (GIS visualization, etc.) rather than making the client side do the work

  • let users drill down, see the manifest of the tar file and download only a component part

  • add a README file and direct users to use curl to access the 100GB+ files - involves a backend solution, not via hydra

Pain point #5: checksumming large files - how long it takes, best practices for scaling up

  • Fedora4 generates SHA1 of the path for large files

  • Chronopolis, Tripwire, file integrity scanning systems

  • Constant checksumming can contribute to degradation of media

  • Maybe just don't checksum as often? How much paranoia is appropriate for preservation?

  • Scott Turnbull at APTrust has a library that processes MD5 checksums more efficiently

  • Use SHA512, which is significantly faster than SHA256 or SHA1 - about the same as MD5

Pain point #6: Version control of large files

  • Option: store diffs instead of storing versions

  • Stanford has a solution for this

Pain point #7: Preservation of large files

  • Current best practice is triple-layered, assumes many large files will be accessed infrequently - layers:

  • Preservation copy furthest from active use

  • Fedora-linked copy of full file that users can access if necessary

  • Hydra derivative/proxy copy that serves most use cases

Fedora 4 advantages:

  • FC4 allows you to define a datastream for storage policy, so you could always send files under a certain size to cheap storage -

  • FC4 also keeps most recently uploaded files in cache (b/c most users want to confirm that their upload was successful).

  • For these reasons, large files are a really good use case for hydra on fedora 4.

Action Items:

(Mark Bussey)

Suggest large files as the driver of Fedora4 development and testing for Hydra, and arrange for an HSM dev environment to be available to fedora4 developers