Big Files Breakout Notes

Friday Jan. 24, 2014, 3:30-5:00

Pain point #1: Storing big files

Files that are too big for Fedora3 are current stored outside, not ideal
Fedora4 should accommodate single files up to 1.4TB
some rock hall files are bigger - they're currently using GPFS - distributed file system with an HSM
Indiana U also has bigger files - their current solutions: HBSS
They want to move to Fedora 4, need either a Fedora4 connector to read SDA file system or to remote-mount HBSS to server that's running Fedora

Fedora 4 large file testing - has been tested with large numbers of moderately big files (10K x 1GB files)

stores 10k x 1GB with no degradation but slow if many large files are stored in a single directory
Solution is to make a 2-3-level hierarchy to evenly distribute files
Fedora4 has not been tested on multiple huge files

Pain point #2: Moving big files around

Getting files in: http transfer limit is 2GB - that is the pipeline limitation
Moving large files within the system as we ingest/process them
FC4 can chunk large files (same approach as Amazon Glacier and OpenStack)
so Fedora4 will allow us to incorporate large files in hydra
DropBox, BitTorrent, or chunking are options for transferring large files
What effect will the loss of net neutrality have on this issue? We don't know - we don't know how providers will throttle bandwidth.

Pain point #3: HIPAA compliance in big data

not restricted to big files - can be an issue for small files too

Pain point #4: User expectations

We don't understand the users' expectations - google crawling of large text files - probably just metadata is reasonable
Download/delivery - how long will people wait - approaches to managing expectations
Put file size next to download buttons, prompt user with estimate of download and an "are you sure" prompt
Server side data visualization/proxy usage - give users a faster experience
do as much of the work on the server side (GIS visualization, etc.) rather than making the client side do the work
let users drill down, see the manifest of the tar file and download only a component part
add a README file and direct users to use curl to access the 100GB+ files - involves a backend solution, not via hydra

Pain point #5: checksumming large files - how long it takes, best practices for scaling up

Fedora4 generates SHA1 of the path for large files
Chronopolis, Tripwire, file integrity scanning systems
Constant checksumming can contribute to degradation of media
Maybe just don't checksum as often? How much paranoia is appropriate for preservation?
Scott Turnbull at APTrust has a library that processes MD5 checksums more efficiently
Use SHA512, which is significantly faster than SHA256 or SHA1 - about the same as MD5

Pain point #6: Version control of large files

Option: store diffs instead of storing versions
Stanford has a solution for this

Pain point #7: Preservation of large files

Current best practice is triple-layered, assumes many large files will be accessed infrequently - layers:
Preservation copy furthest from active use
Fedora-linked copy of full file that users can access if necessary
Hydra derivative/proxy copy that serves most use cases

Fedora 4 advantages:

FC4 allows you to define a datastream for storage policy, so you could always send files under a certain size to cheap storage -
FC4 also keeps most recently uploaded files in cache (b/c most users want to confirm that their upload was successful).
For these reasons, large files are a really good use case for hydra on fedora 4.

Action Items:

(Mark Bussey)

Suggest large files as the driver of Fedora4 development and testing for Hydra, and arrange for an HSM dev environment to be available to fedora4 developers