Fedora 4 as the Hydra RDF store

Fedora 4 and Hydra
These are notes from the Fedora 4 discussion at the RDF in Hydra Summit at Stanford on Oct 21-22, 2013. The topic of the hour-long discussion was to explore the opportunities and implications of moving Hydra to RDF as it goes from Fedora 3 to Fedora 4.

Issues & Agenda Points

Fedora 3 was hybrid XML and RDF
Fedora 4 is native RDF
Fedora 4 team would like user, use cases to guide development
Hydra could use Fedora 4 and its new capabilities, but needs to forecast/be aware of these to leverage them

Key Questions

How RDFish is F4?
What does a data stream look like in Fedora 4?
How much native Fedora 4 capabilities do we want to use?
- store a single RDF statement in F4 for a descMD element? i.e., the "this objectID hasTitle Catcher in the Rye"
- or follow the Fedora 3 pattern: create a DescMD document of serialized RDF (or XML) that includes a title attribute, and store it as an F4 data stream

Discussion

"RDFness"

Fedora 4 speaks and stores native RDF
- but note that JCR is non-triple store repository (no SPARQL end-point)
use the event model if you want SPARQL; listen to the events, harvest the RDF and put in a triplestore
the Fedora public API speaks RDF natively
But internal storage format is not RDF: data stored in a JCR-compatible Infinispan repo.
- Does this matter?
- Internal mechanics are an implementation detail.
- One concerns is that for Preservation, you probably want human-readable documents that can stand independent of Fedora

"Linked Data Platform"

See W3C Linked Data Platform API 1.0 at https://dvcs.w3.org/hg/ldpwg/raw-file/default/ldp.html
- linked data plus binary files.
- Fedora 4 API conforms to this API (and has more calls)

Fedora 4 stores RDF

does not have a triplestore internally. No SPARQL endpoint
includes support for writing to an external index (solr)
includes support for writing to an external triplestore w SPARQL

Fedora 4 doesn't include support for blank nodes on objects

blank nodes cause issues in indexing, in particular
SPARQL update doesn't get done. this a long-time, festering wound with the RDF community
lots of formats (MADS, MODS) use blank nodes for representing hierarchy
could use this for structural relationships in particular
could model Fedora 4 blank nodes differently
- use "Skolemization": take blank nodes, assign them formal URI's

You can pretend F4 is an F3 repo when you migrate.

And what might you change, to leverage F4?

Get rid of Rels-ext data stream
Kill DC data stream
Things you'd normally put into descMD you'd that into triples too?
- Sure. Why not?
- Good reason to do it: you don't have to serialize your descMD into an rdf.datastream
- much richer (external) triple store index if you store as RDF rather than an rdf.datastream in xml or n-triples

What Makes Fedora Preservation-Friendly?

rebuild the repo by crawling the file system
encapsulate the metadata with the data
fixity services (bit auditing)
versioning
audit trails of actions
relate whole objects & all their versions, surrogates, metadata, assoc. services,

New Functionalities in F4

Hierarchy
- Objects can contain objects
- use it for administrative sets
- use it to mimic the structure of things ingested to repo
  - file system that's been ingested from forensics lab
  - EAD...

Project other resources into the Repo.
You can attach descriptions to binary objects

To Do

Have a Fedora 4 / Hydra modelling & use discussion at Worldwide Hydra Connect in January 2014.