Big Files Breakout Notes
Friday Jan. 24, 2014, 3:30-5:00
Pain point #1: Storing big files
- Files that are too big for Fedora3 are current stored outside, not ideal
- Fedora4 should accommodate single files up to 1.4TB
- some rock hall files are bigger - they're currently using GPFS - distributed file system with an HSM
- Indiana U also has bigger files - their current solutions: HBSS
- They want to move to Fedora 4, need either a Fedora4 connector to read SDA file system or to remote-mount HBSS to server that's running Fedora
Fedora 4 large file testing - has been tested with large numbers of moderately big files (10K x 1GB files)
- stores 10k x 1GB with no degradation but slow if many large files are stored in a single directory
- Solution is to make a 2-3-level hierarchy to evenly distribute files
- Fedora4 has not been tested on multiple huge files
Pain point #2: Moving big files around
- Getting files in: http transfer limit is 2GB - that is the pipeline limitation
- Moving large files within the system as we ingest/process them
- FC4 can chunk large files (same approach as Amazon Glacier and OpenStack)
- so Fedora4 will allow us to incorporate large files in hydra
- DropBox, BitTorrent, or chunking are options for transferring large files
- What effect will the loss of net neutrality have on this issue? We don't know - we don't know how providers will throttle bandwidth.
Pain point #3: HIPAA compliance in big data
not restricted to big files - can be an issue for small files too
Pain point #4: User expectations
- We don't understand the users' expectations - google crawling of large text files - probably just metadata is reasonable
- Download/delivery - how long will people wait - approaches to managing expectations
- Put file size next to download buttons, prompt user with estimate of download and an "are you sure" prompt
- Server side data visualization/proxy usage - give users a faster experience
- do as much of the work on the server side (GIS visualization, etc.) rather than making the client side do the work
- let users drill down, see the manifest of the tar file and download only a component part
- add a README file and direct users to use curl to access the 100GB+ files - involves a backend solution, not via hydra
Pain point #5: checksumming large files - how long it takes, best practices for scaling up
- Fedora4 generates SHA1 of the path for large files
- Chronopolis, Tripwire, file integrity scanning systems
- Constant checksumming can contribute to degradation of media
- Maybe just don't checksum as often? How much paranoia is appropriate for preservation?
- Scott Turnbull at APTrust has a library that processes MD5 checksums more efficiently
- Use SHA512, which is significantly faster than SHA256 or SHA1 - about the same as MD5
Pain point #6: Version control of large files
- Option: store diffs instead of storing versions
- Stanford has a solution for this
Pain point #7: Preservation of large files
- Current best practice is triple-layered, assumes many large files will be accessed infrequently - layers:
- Preservation copy furthest from active use
- Fedora-linked copy of full file that users can access if necessary
- Hydra derivative/proxy copy that serves most use cases
Fedora 4 advantages:
- FC4 allows you to define a datastream for storage policy, so you could always send files under a certain size to cheap storage -
- FC4 also keeps most recently uploaded files in cache (b/c most users want to confirm that their upload was successful).
- For these reasons, large files are a really good use case for hydra on fedora 4.
Action Items:
(Mark Bussey)
Suggest large files as the driver of Fedora4 development and testing for Hydra, and arrange for an HSM dev environment to be available to fedora4 developers