Fedora 4 as the Hydra RDF store

Fedora 4 and Hydra
These are notes from the Fedora 4 discussion at the RDF in Hydra Summit at Stanford on Oct 21-22, 2013.  The topic of the hour-long discussion was to explore the opportunities and implications of moving Hydra to RDF as it goes from Fedora 3 to Fedora 4.

Issues & Agenda Points
  • Fedora 3 was hybrid XML and RDF

  • Fedora 4 is native RDF
  • Fedora 4 team would like user, use cases to guide development
  • Hydra could use Fedora 4 and its new capabilities, but needs to forecast/be aware of these to leverage them
Key Questions 
  • How RDFish is F4?
  • What does a data stream look like in Fedora 4?
  • How much native Fedora 4 capabilities do we want to use? 
    • store a single RDF statement in F4 for a descMD element? i.e., the "this objectID hasTitle Catcher in the Rye"
    • or follow the Fedora 3 pattern: create a DescMD document of serialized RDF (or XML) that includes a title attribute, and store it as an F4 data stream
Discussion

"RDFness"

  • Fedora 4 speaks and stores native RDF
    • but note that JCR is non-triple store repository (no SPARQL end-point)
  • use the event model if you want SPARQL; listen to the events, harvest the RDF and put in a triplestore
  • the Fedora public API speaks RDF natively
  • But internal storage format is not RDF: data stored in a JCR-compatible Infinispan repo. 
    • Does this matter? 
    • Internal mechanics are an implementation detail. 
    • One concerns is that for Preservation, you probably want human-readable documents that can stand independent of Fedora

"Linked Data Platform"

Fedora 4 stores RDF

  • does not have a triplestore internally. No SPARQL endpoint
  • includes support for writing to an external index (solr)
  • includes support for writing to an external triplestore w SPARQL

Fedora 4 doesn't include support for blank nodes on objects

  • blank nodes cause issues in indexing, in particular
  • SPARQL update doesn't get done. this a long-time, festering wound with the RDF community
  • lots of formats (MADS, MODS) use blank nodes for representing hierarchy 
  • could use this for structural relationships in particular
  • could model Fedora 4 blank nodes differently
    • use "Skolemization": take blank nodes, assign them formal URI's

You can pretend F4 is an F3 repo when you migrate. 

And what might you change, to leverage F4?

  • Get rid of Rels-ext data stream
  • Kill DC data stream
  • Things you'd normally put into descMD you'd that into triples too?
    • Sure. Why not? 
    • Good reason to do it: you don't have to serialize your descMD into an rdf.datastream 
    • much richer (external) triple store index if you store as RDF rather than an rdf.datastream in xml or n-triples

What Makes Fedora Preservation-Friendly?

  • rebuild the repo by crawling the file system
  • encapsulate the metadata with the data
  • fixity services (bit auditing)
  • versioning
  • audit trails of actions
  • relate whole objects & all their versions, surrogates, metadata, assoc. services, 

New Functionalities in F4

  • Hierarchy
    • Objects can contain objects
    • use it for administrative sets
    • use it to mimic the structure of things ingested to repo
      • file system that's been ingested from forensics lab
      • EAD...
  • Project other resources into the Repo. 
  • You can attach descriptions to binary objects

To Do
  • Have a Fedora 4 / Hydra modelling & use discussion at Worldwide Hydra Connect in January 2014.