2016-02-26 State of RDF/Repository access to Triplestores

Time: 9:00am PDT / Noon EDT

WebEx Info: Join WebEx meeting - Meeting # 648 153 499, Meeting password: htIG0226  (hotel-tango-Igloo-Golf-zero-two-two-six  I'm not sure you need the password.)

Audio Connection:  Computer, or 1-855-244-8681 Call-in toll-free number (US/Canada), or 1-650-479-3207 Call-in toll number (US/Canada)

Moderator: E. Lynette Rayle (Cornell)

Notetaker:  cmharlow

Attendees:

(Please update/correct as needed)

Agenda:

  1. Next Call
    1. date/time:
    2. Moderator: 
    3. Notetaker: 
  2. Call for additional agenda items
  3. Status of Triplestore Implementations in Ruby
    1. Hector Correa - Blazegraph
    2. Stefano Cossu - Jena/Fuseki
    3. Tom Johnson - Apache Marmotta Overview
    4. Aaron Coburn - Rya (Apache project)

Meeting Notes:

Helpful Links: 

Notes based off Agenda:

  1. Next Call 
    1. The next call is scheduled in approximately one month - but this is during LDCX, and many folks here will be there. 
    2. Do we want to push this meeting into April 1? or March 18th?
    3. Preference for April 1st. Same time - noon EST
  2. Call for additional agenda items:
    1. Topic for next meeting? This is held off until the end of the meeting.
    2. No other items brought up.
  3. Status of Triplestore Implementations in Ruby
    1. Blazegraph - Hector Correa
      1. Notes on his demo: https://github.com/hectorcorrea/blazegraph_demo 
      2. He played with it outside of Ruby, used with Blazegraph's own tooling.
        1. When you install Blazegraph, you get a UI, can submit queries, play with it, see results.
        2. He didn’t play with it with ruby or any active triples scripting.
        3. Demo of the Blazegraph UI out of the box
          1. One can add triples in Sparql UPDATE tab.
          2. He is learning Blazegraph along with learning SPARQL in Fedora Semantic Web book club.
          3. Ran a sample SPARQL Query to pull all the data from the store:
            1. You can change the query to just see subjects, see all the predicates, etc. in the UI.
            2. Hector compared a bit the Blazegraph SPARQL Queries with SQL query structure/creation.
            3. This better than using Fedora to query 1 resource at a time.
          4. Demo of Inferencing in Blazegraph:
            1. Showed example with animals, mammals. 
            2. Showed how querying was working with that inferencing.
        4. James Griffin: Hector, did you happen to explore clustering? 
          1. The documentation on the wiki appears to be a bit sparse: https://wiki.blazegraph.com/wiki/index.php/ScaleOutTripleStore
        5. Lynette Rayle: is trying Blazegraph with ActiveTriples for a digital collection sample, with a small set of triples, but working well so far.
          1. Lynette Rayle will post notes of code for Blazegraph repository work.
        6. tamsin woo
          1. There's a ruby client for BG - https://github.com/ruby-rdf/rdf-blazegraph
          2. Look on GitHub for Blazegraph and RDF - https://github.com/ruby-rdf/rdf-blazegraph - this allows you to work with activeTriples
          1. Note: There's also a ruby client for BG - https://github.com/ruby-rdf/rdf-blazegraph
    2. Jena/Fuseki - scossu
      1. Stefano has implemented both Blazegraph and jena fuseki, and is now running on jena fuseki for a couple of reasons:
        1. Blazegraph had really heavy performance drops when deleting even single triples in large dataset so he reverted to Fedora recommendations - they officially support jena/fuseki ?
        2. Stefano discusses descriptions of ease of use and integration, scalability, features, support
        3. Lynette Rayle: are you using Jena Fuseki in conjunction with Fedora with Camel export?
          1. scossu: The whole Fedora index is in Jena, but it is not doing much more than sitting there and being used for analytics and monitoring input/output. 
          2. This is not in full production environment, but 300k metadata objects were exported to triplestore, ~15M triples
          3. No heavy SPARQL queries, and it is working.
        4. Apache Jena is two projects:
          1. TDB - storage back end where triples are stored
          2. Fuseki is front end/GUI for user, client. Abstracts storage, implements queries, etc.
        5. TDB claims to support ~1.7 billion triples.
          1. This documentation to be shared on wiki
      2. Ease of Use:
        1. You just plop a WARC file onto running tomcat or any server container. Configuration is minimal.
        2. There is no (needed) file editing for installation, you just set up a repository from web interface.
        3. Administration: there is a nice query interface. It is not sophisticated, and YAS GUI is separate open source project for querying triples in web interface. 
        4. Drawback: it is not easy to make SPARQL update queries. You can upload batch triples, but no way to make triple statements updates on interface, so you have to use API for that.
      3. Integration with ruby:
        1. Java-heavy
        2. Jena Fuseki does support SPARQL over HTTP
        3. Stefano uses SPARQL over HTTP to do some querying, but no updates at present because there is no current use case for him.
      4. Support for inference:
        1. Jena has an inference engine. There is limited OWL inference.
        2. You can integrate Palate - a powerful reasoner for complex OWL reasoning, but hasn’t been tried yet by Stefano.
        3. Also, there documentation there is not super clear, have to dig further.
        4. Setting up inference for RDFS is not hard- just make a configuration file. Enabled in Stefano’s work, not tested really yet.
      5. Software:
        1. This is an Apache project. Healthy maintenance and is popular.
        2. Is the 'reference' triple store for Fedora, which is an advantage.
      6. Lynette Rayle: Are you updating Fedora through a Hydra app, then using Camel messaging to keep triplestore in sync with Fedora?
        1. scossu: Right now, the triple store is completely disjoint from Hydra.
        2. Camel updates triples as it receives updates from Fedora.
        3. An integration point would be the SPARQL over HTTP.
          1. It would be good to have a reference implementation with SPARQL over HTTP so easy to swap out back ends
      7. Lynette Rayle: Is it difficult to do SPARQL Updates?
        1. scossu: Yes, in the web interface/GUI. You can upload a file with triples for updates, but SPARQL interface in Fuseki not meant to be used for SPARQL updating, just querying.
      8. tamsin woo
        1. There is an RDF::Jena gem that runs on JRuby, up to date with RDF.rb 1.99: https://rubygems.org/gems/rdf-jena/versions/0.3.3-java
        2. Fairly new, not sure what status is, but could be interesting for working with Jena directly in Ruby.
        3. Stefano: Does it use Java API then? Tom: Yeah, I think that’s right.
      9. Corey Harper: unrelated question, but do you know whether the repository adapter for Virtuoso works? It looks very out of date... 
        1. tamsin woo: Don’t know, but it probably requires significant work, and may not be the only one.
        2. Lynette Rayle: There is 1 under the RDF project as well, but it is not far enough along for her to use with ActiveTriples.
    3. tamsin wooMarmotta
      1. Another java-based triplestore. 
        1. Used to be Sesame, now called Eclipse RJ (?).
        2. Sesame provides a SAIL interface, akin to Ruby RDF’s repository interface.
        3. This makes it possible to implement a number of triple stores that have same (?) on other side.
        4. Marmotta sits on top of those. It ships its own triplestore called Kiwi that sits on relationship databases.
        5. DPLA uses Marmotta as a LDP server, runs Kiwi back end over top of a postgres database that is fairly optimized specifically for this case. Indexes over the triples table.
      2. Can in theory swap out back ends on Marmotta...
        1. there are some who have, not sure if anyone has done this in production. 
        2. tamsin woo has tested running Marmotta over Blazegraph, when it was called Big Data, also tried this over Titan. 
        3. More recent effort to implement a C++ backend called Ostrich, which is backed by levelDB, which is same interface Fedora is talking about leaving.
        4. TLDR: Backends are complicated, DPLA runs with postgresql and kiwi, thinks this is also used by Oregon Digital.
      3. Marmotta provides core SPARQL work - uses front end editor and result visualization SQUBI.
        1. This UI is not as nice as blazegraph, but it has the same functionality to query results and see in browser.
        2. There is a LDP server interface too that uses the quad store to separate references and work with nonRDF resources as well saved directly to disk.
      4. Marmotta also supports transactions and versioning
        1. DPLA doesn’t currently use the Marmotta transactions interface, as they use LDP which is transactional.
        2. The versioning has some value. Momento interface over SPARQL Describe style resource view.
        3. Can get the triples directly about resource queried plus any blank nodes that are directly connected over the Momento interface, then can retrieve any timestamp for those interfaces.
        4. Versioning - can’t get back a LDP resource specifically. They have triples not about resource directly in our LDP RDF sources, and they don’t get back versions of those without complex logic to handle.
          1. They have talked about figuring this out in LDP Next working group/community group.
          2. Want to try to align with conversations in Fedora about the same things
        5. Hector Correa: how does versioning work on the graph?
          1. Generally, graphs are version-able at a certain costs. Persistent data structures that work for this.
          2. Marmotta version - versions entire graph. Do a Momento request, get back a subset of that graph based on the request URI.
          3. The effect is that every triple written to repository stays. This is a fully persistent triplestore.
          4. In kiwi database, there is a versions table that points to the triples table, the triples table contains a flag that says deleted.
          5. You can turn versioning off, and deleting triples can be periodically cleaned up. That table doesn’t have to be maintained. 
            1. Makes Write operations somewhat faster because only writing to 1 table in that case. 
            2. Has a slightly larger effect than just the extra write to table because of depending on the database, there is locking. Trying to query while writing, then effect is slightly larger
      5. Marmotta reasoners + unique query interfaces:
        1. it has a transparent linked data cache - go get external referenced resources and store in special place. Ex. LCSH.
          1. DPLA doesn’t use that in part because it is not very ‘smart’ - you have a lot of nonRDF resources that are referenced, and Marmotta tries to go fetch those and parse as RDF, big waste of time.
      6. On Ruby side- there is a RDF library of RDF marmotta
        1. It is a very loose subclass of baseline RDF or SPARQL client repository.
        2. This does basically everything over SPARQL, which isn’t very efficient.
        3. DPLA uses this only in the test suite, to do entire clears of database.
        4. There are definitely some easy wins there if anyone interested in picking that work up.
      7. Lynette Rayle: looking at notes shared, it looks like with Marmotta, you can do LDP queries as well as SPARQL queries?
        1. tamsin woo: there are 2 different things to be called out there. 
          1. There is actually a LD query interface in marmotta called LDPath - a more direct graph traversal interface. It is similar to Gremlin - start a node and navigate down. That interface is tied with LD Cache stuff.  There is a Mamotta-specific query interface you could be interested in 
          2. Marmotta is a LDP Server. If used that way, can get back resources in RESTful, LDP way. Those resources just contain arbitrary graphs.
      8. Lynette Rayle: does it support HTTP SPARQL Queries?
        1. tamsin woo: yes
        2. Lynette Rayle: great. One of the limits of Fedora is it does LDP quite well, but can use SPARQL queries for work.
      9. Lynette Rayle: Are you using it with ActiveTriples?
        1. tamsin woo: Yes, but not over the repository interface. Using ActiveTriples overtop of a lightweight LDP client specifically in KriKri - KriKri LDP Resource.
        2. There is some interest in aligning that with Chris Beer’s LDP Gem in Project Hydra, but at the time it was written, that gem had heavy enough problems, so they didn’t look into it.
      10. Hector Correa: Marmotta supports non-RDF resources?
        1. tamsin woo: It does. It doesn’t have preservation concerns that Fedora does, but writes them to disc, can do your own back up and hashing if you wanted to.
        2. They use this at DPLA as well. When harvesting records from partners, save those records directly to bitstreams as nonRDF resources, then pull later, and try to parse the data out of them.
      11. Lynette Rayle: Integrating Marmotta through ActiveTriples into Hydra stack would be cool.
      12. scossu: Is Sesame is default back end?
        1. Sesame is the interface. The default back end is Kiwi.
      13. scossu: are there implementations that use blazegraph as a backend?
        1. There is an implementation of a Blazegraph back end specifically for Marmotta, doesn’t support versioning stuff, but believe that it works
        2. Don’t know of anybody using it in production
        3. More recent effort to build high-speed, levelDB SAIL back end for Marmotta with Ostrich. Just recently merged into Marmotta core.
    4. Add ongoing notes on these as child pages to this meetings Wiki pages, it will automatically also show up on the tools page too: https://wiki.duraspace.org/display/hydra/Tools+for+Working+with+Triples+Stores
      1. Lynette will add blaze graph notes too once that page is there
  4. To discuss next time:
    • More discussions like the ones today?
    • Rya -  Former user (Deleted) (pushed back from presenting this meeting)
    • Virtuoso
      • Not a lot on it from the ruby side, but reporting back on Virtuoso in general, state of it on ruby side
      • Corey Harper: not an expert, can demo/report on his experimentation with Virtuoso, and will reach out to David Lacy on this.
    • Other triple stores?
    • Corey Harper: What about the relationship between triplestores and non-triplestore graph databases like Neo4J? Experience with Neo4J?
      • hjc14: seconds interest in this question, because Neo4J is relatively well known as a graph database
      • tamsin woo: Can’t volunteer to demo, but some information:
        • Number of efforts to implement SPARQL support for Neo4J
        • Not sure of approach. Neo4J stores arbitrary graphs. Nodes and edges can be any values you like. Not sure of approach for enforcing/mapping to arbitrary graphs.
      • Corey Harper: Mapping RDF to graphs in non-RDF language post: http://www.snee.com/bobdc.blog/2015/03/spark-and-sparql-rdf-graphs-an.html
        • Information about Spark. how would connect to SPARQL.
        • Related to question of can you do this graph work in RDF in a tool like Neo4J as well.