Big Files Breakout Notes

Friday Jan. 24, 2014, 3:30-5:00

Pain point #1: Storing big files

  • Files that are too big for Fedora3 are current stored outside, not ideal
  • Fedora4 should accommodate single files up to 1.4TB
  • some rock hall files are bigger - they're currently using GPFS - distributed file system with an HSM
  • Indiana U also has bigger files - their current solutions: HBSS
  • They want to move to Fedora 4, need either a Fedora4 connector to read SDA file system or to remote-mount HBSS to server that's running Fedora

Fedora 4 large file testing - has been tested with large numbers of moderately big files (10K x 1GB files)

  • stores 10k x 1GB with no degradation but slow if many large files are stored in a single directory
  • Solution is to make a 2-3-level hierarchy to evenly distribute files
  • Fedora4 has not been tested on multiple huge files

Pain point #2: Moving big files around

  • Getting files in: http transfer limit is 2GB - that is the pipeline limitation
  • Moving large files within the system as we ingest/process them
  • FC4 can chunk large files (same approach as Amazon Glacier and OpenStack)
  • so Fedora4 will allow us to incorporate large files in hydra
  • DropBox, BitTorrent, or chunking are options for transferring large files
  • What effect will the loss of net neutrality have on this issue? We don't know - we don't know how providers will throttle bandwidth.

Pain point #3: HIPAA compliance in big data

 not restricted to big files - can be an issue for small files too

Pain point #4: User expectations

  • We don't understand the users' expectations - google crawling of large text files - probably just metadata is reasonable
  • Download/delivery - how long will people wait - approaches to managing expectations
  • Put file size next to download buttons, prompt user with estimate of download and an "are you sure" prompt
  • Server side data visualization/proxy usage - give users a faster experience
  • do as much of the work on the server side (GIS visualization, etc.) rather than making the client side do the work
  • let users drill down, see the manifest of the tar file and download only a component part
  • add a README file and direct users to use curl to access the 100GB+ files - involves a backend solution, not via hydra

Pain point #5: checksumming large files - how long it takes, best practices for scaling up

  • Fedora4 generates SHA1 of the path for large files
  • Chronopolis, Tripwire, file integrity scanning systems
  • Constant checksumming can contribute to degradation of media
  • Maybe just don't checksum as often? How much paranoia is appropriate for preservation?
  • Scott Turnbull at APTrust has a library that processes MD5 checksums more efficiently
  • Use SHA512, which is significantly faster than SHA256 or SHA1 - about the same as MD5

Pain point #6: Version control of large files

  • Option: store diffs instead of storing versions
  • Stanford has a solution for this

Pain point #7: Preservation of large files

  • Current best practice is triple-layered, assumes many large files will be accessed infrequently - layers:
  • Preservation copy furthest from active use
  • Fedora-linked copy of full file that users can access if necessary
  • Hydra derivative/proxy copy that serves most use cases

Fedora 4 advantages:

  • FC4 allows you to define a datastream for storage policy, so you could always send files under a certain size to cheap storage -
  • FC4 also keeps most recently uploaded files in cache (b/c most users want to confirm that their upload was successful).
  • For these reasons, large files are a really good use case for hydra on fedora 4.

Action Items:

(Mark Bussey)

Suggest large files as the driver of Fedora4 development and testing for Hydra, and arrange for an HSM dev environment to be available to fedora4 developers