Fedora 4 as the Hydra RDF store
Fedora 4 and Hydra
These are notes from the Fedora 4 discussion at the
RDF in Hydra Summit at Stanford on Oct 21-22, 2013. The topic of the hour-long discussion was to explore the opportunities and implications of moving Hydra to RDF as it goes from Fedora 3 to Fedora 4.
Issues & Agenda Points
Fedora 3 was hybrid XML and RDF
- Fedora 4 is native RDF
- Fedora 4 team would like user, use cases to guide development
- Hydra could use Fedora 4 and its new capabilities, but needs to forecast/be aware of these to leverage them
Key Questions
- How RDFish is F4?
- What does a data stream look like in Fedora 4?
- How much native Fedora 4 capabilities do we want to use?
- store a single RDF statement in F4 for a descMD element? i.e., the "this objectID hasTitle Catcher in the Rye"
- or follow the Fedora 3 pattern: create a DescMD document of serialized RDF (or XML) that includes a title attribute, and store it as an F4 data stream
Discussion
"RDFness"
- Fedora 4 speaks and stores native RDF
- but note that JCR is non-triple store repository (no SPARQL end-point)
- use the event model if you want SPARQL; listen to the events, harvest the RDF and put in a triplestore
- the Fedora public API speaks RDF natively
- But internal storage format is not RDF: data stored in a JCR-compatible Infinispan repo.
- Does this matter?
- Internal mechanics are an implementation detail.
- One concerns is that for Preservation, you probably want human-readable documents that can stand independent of Fedora
"Linked Data Platform"
- See W3C Linked Data Platform API 1.0 at https://dvcs.w3.org/hg/ldpwg/raw-file/default/ldp.html
- linked data plus binary files.
- Fedora 4 API conforms to this API (and has more calls)
Fedora 4 stores RDF
- does not have a triplestore internally. No SPARQL endpoint
- includes support for writing to an external index (solr)
- includes support for writing to an external triplestore w SPARQL
Fedora 4 doesn't include support for blank nodes on objects
- blank nodes cause issues in indexing, in particular
- SPARQL update doesn't get done. this a long-time, festering wound with the RDF community
- lots of formats (MADS, MODS) use blank nodes for representing hierarchy
- could use this for structural relationships in particular
- could model Fedora 4 blank nodes differently
- use "Skolemization": take blank nodes, assign them formal URI's
You can pretend F4 is an F3 repo when you migrate.
And what might you change, to leverage F4?
- Get rid of Rels-ext data stream
- Kill DC data stream
- Things you'd normally put into descMD you'd that into triples too?
- Sure. Why not?
- Good reason to do it: you don't have to serialize your descMD into an rdf.datastream
- much richer (external) triple store index if you store as RDF rather than an rdf.datastream in xml or n-triples
What Makes Fedora Preservation-Friendly?
- rebuild the repo by crawling the file system
- encapsulate the metadata with the data
- fixity services (bit auditing)
- versioning
- audit trails of actions
- relate whole objects & all their versions, surrogates, metadata, assoc. services,
New Functionalities in F4
- Hierarchy
- Objects can contain objects
- use it for administrative sets
- use it to mimic the structure of things ingested to repo
- file system that's been ingested from forensics lab
- EAD...
- Project other resources into the Repo.
- You can attach descriptions to binary objects
To Do
- Have a Fedora 4 / Hydra modelling & use discussion at Worldwide Hydra Connect in January 2014.