Current modeling practices

A page for institutions to provide details about their current modeling practices in regards to ordered multi-file objects in Fedora/Hydra.

Example: Boston Public Library

Fedora:

We create an object for a book that represents the intellectual entity, and has several associated datastreams for descriptive metadata, etc. Each individual page is its own ImageFile object, which is related to the book object via an "isImageOf" relationship in the page object's RELS-EXT datastream:

<commonwealth-rel:isImageOf rdf:resource="info:fedora/#{pid_of_parent_book_object}"/>

Each page object also has a relationship indicating its order in the page sequence via "isPrecedingImageOf" and "isFollowingImageOf" relationships in the RELS-EXT datastream. So, for example, the object representing the 2nd page would have:

<commonwealth-rel:isFollowingImageOf>info:fedora/#{pid_of_first-page_object}</commonwealth-rel:isFollowingImageOf>
<commonwealth-rel:isPrecedingImageOf>info:fedora/#{pid_of_third-page_object}</commonwealth-rel:isPrecedingImageOf>

Solr:

The Fedora relationships are also mirrored in the Solr record for each ImageFile object:

is_image_of_ssim: #{pid_of_parent_book_object}
is_following_image_of_ssim: #{pid_of_first-page_object}
is_preceding_image_of_ssim: #{pid_of_third-page_object}

Hydra:

The above Fedora RELS-EXT relationships are created using this code in the ImageFile model:

has_many :next_image, :class_name => "Bplmodels::ImageFile", :property=> :is_preceding_image_of
has_many :prev_image, :class_name => "Bplmodels::ImageFile", :property=> :is_following_image_of

In order to create an ordered list of objects for display, we use several functions like this:

# find the image files of an object
def self.getImageFiles(pid)
  return_list = []
  Bplmodels::ImageFile.find_in_batches('is_image_of_ssim'=>"info:fedora/#{pid}") do |group|
    group.each { |solr_object|
      return_list << solr_object
    }
  end
  return sort_files(return_list)
 end

# return the objects in the correct order
def self.sort_files(file_list)
  return file_list if file_list.length <= 1

  following_key_final = nil
  preceding_key_final = nil

  ending_item_pid = nil
  next_item_pid = nil

  return_list = []
  file_list.each do |file|
    preceding_key = file.keys.select { |key| key.include?'preceding'}
    following_key = file.keys.select { |key| key.include?'following'}

    if following_key.blank?
      return_list.insert(0, file)
      preceding_key_final = preceding_key.first
      next_item_pid = file[preceding_key_final].first
    elsif preceding_key.blank?
      following_key_final = following_key.first
      return_list.insert(-1, file)
      ending_item_pid = "info:fedora/#{file['id']}"
    end
  end

  while next_item_pid != ending_item_pid
    next_item = file_list.select { |array| "info:fedora/#{array['id'].to_s}" == next_item_pid }.first
    return_list.insert(-2, next_item)
    next_item_pid = next_item[preceding_key_final].first.to_s
  end

  return return_list
 end

Blacklight

We use a helper method to take the hash of files returned by getImageFiles() and boil that down to an array of image file PIDs

def has_image_files? image_files_hash
  image_file_pids = nil
  unless image_files_hash[:images].empty?
    image_file_pids = []
    image_files_hash[:images].each do |image_file|
      image_file_pids << image_file['id']
    end
  end
  image_file_pids
end

The image_file_pids array can then be passed to our page-turner code. (We use WDL-Viewer currently.)

Example: University of Hull

Let's be clear, please, that this work was done as a proof-of-concept exercise and was never intended to become the basis of a production system - but for what it's worth...

The demo Internet Archive page turning software relies on a sequentially structured set of files to provide pages - so, for instance, page01.jpg, page02.jpg, page03.jpg etc. We bent the code so that it retrieved sequentially named datastreams from a compound Hydra/Fedora object (content01, content02, content03,...). No clever relationships or anything like that, rather it is crude and simple. But it works!

Clearly this is not scalable (we have an example that needs dealing with one day containing almost 500 pages => an object with 500+ datastreams?) and it relies on the fact that the pages we used for the demonstration are of uniform size and orientation. The work pre-dates Fedora 4 / ActiveFedora 9.

Example: Cornell University

At Cornell University we are working on converting our Digital Collections that are running currently in DLXS (University of Michigan’s Digital Library eXtension Service) to being served by Hydra. To do this a Rails project was created and blacklight, hydra and active-fedora gems were added to this project’s GEMFILE. This work was primarily done by John Cline (jac244@cornell.edu) with assist on the DLXS conversions and Page Turner set up by George Kozak (gsk5@cornelle.edu) and backlight and interface design by Melissa Wallace (mhk33@cornell.edu) and Jenn Colt (jrc88@cornell.edu).

Our current DIgital Collections under DLXS use a single XML Document Type Definition (DTD) that is based on the TEI Lite DTD for each collection. Basically, in DLXS, we create an XML file for each object (a book or a pamphlet or whatever) and then those objects are combined into a single XML file for the collection which is indexed using XPAT which is based on Open Text Corporation's pat50 source code.

For our first conversion, we took our most complicated XML from our Southeast Asia Visions collection which consisted of DIV1 and DIV2 structures and detailed image metadata. To handle this, we created 2 OM (Opinionated Metadata) objects: a Book object and a Page object. In order to populate these objects in Hydra (SOLR/Fedora) it was necessary to create in our Rails project the following files.

RAILS_ROOT/app/models/book.rb,

RAILS_ROOT/app/models/page.rb,

RAILS_ROOT/app/models/datastreams/book_metadata.rb,

RAILS_ROOT/app/models/datastreams/page_metadata.rb

We also edited the RAILS_ROOT/config/predicate_mappings.yml file with custom mappings: :is_page_of and :has_book predicates.

For the conversion, we took the DLXS collection XML files and use the tag names as field names in our OM objects.

A small example:

From sea029_georgeJune24_dims.xml:

<DLPSTEXTCLASS>
<HEADER>
<FILEDESC>
<TITLESTMT>
   <TITLE TYPE="245">Pen pictures of Annam and its people</TITLE>
   <AUTHOR>Cadman, Grace Hazenberg</AUTHOR>
</TITLESTMT>
<EXTENT>192 600dpi JPEG page images</EXTENT>
<PUBLICATIONSTMT>
   <PUBLISHER>Cornell University Library</PUBLISHER>
   <PUBPLACE>Ithaca, New York</PUBPLACE>
   <IDNO TYPE="dlps">sea029</IDNO>
</PUBLICATIONSTMT>
<SOURCEDESC>
   <BIBL>
    <TITLE TYPE="main">Pen pictures of Annam and its people</TITLE>
    <AUTHOR>Cadman, Grace Hazenberg</AUTHOR>
    <PUBLISHER>Christian Alliance Publishing</PUBLISHER>
    <PUBPLACE>New York</PUBPLACE>
    <DATE>1920</DATE>
   </BIBL>
</SOURCEDESC>

From RAILS_ROOT/app/models/book.rb

class Book < ActiveFedora::Base

has_metadata 'descMetadata', type: BookMetadata

has_file_datastream :name=>'digitalImage', :type=>ActiveFedora::Datastream, :mimeType=>"image/jpeg", :controlGroup=>'M'

belongs_to :derivation, :property=>:has_derivation

has_many :pages, :property=>:has_pages

has_attributes :titlestmt_titletype, datastream: 'descMetadata', multiple: false
has_attributes :titlestmt_title, datastream: 'descMetadata', multiple: false
has_attributes :titlestmt_author, datastream: 'descMetadata', multiple: false
has_attributes :extent, datastream: 'descMetadata', multiple: false
has_attributes :pubstmt_publisher, datastream: 'descMetadata', multiple: false
has_attributes :pubstmt_pubplace, datastream: 'descMetadata', multiple: false
has_attributes :pubstmt_idno_type, datastream: 'descMetadata', multiple: false
has_attributes :pubstmt_idno, datastream: 'descMetadata', multiple: false
has_attributes :bibl_titletype, datastream: 'descMetadata', multiple: false
has_attributes :title, datastream: 'descMetadata', multiple: false
has_attributes :author, datastream: 'descMetadata', multiple: false
has_attributes :publisher, datastream: 'descMetadata', multiple: false
has_attributes :pubplace, datastream: 'descMetadata', multiple: false
has_attributes :pubdate, datastream: 'descMetadata', multiple: false

<TITLESTMT><TITLE TYPE="245">Pen pictures of Annam and its people</TITLE> from the DLXS XML is in the book.rb file as

has_attributes :titlestmt_titletype, datastream: 'descMetadata', multiple: false
has_attributes :titlestmt_title, datastream: 'descMetadata', multiple: false

And these last two lines are referred to in RAILS_ROOT/app/models/datastreams/book.rb as

t.titlestmt_titletype(index_as: :stored_searchable)
t.titlestmt_title(index_as: :stored_searchable)

Once these files are correct we then move on to the code which will read from the DLXS XML and create the actual Hydra (SOLR/Fedora) objects.

We used two programs: createPages.rb and createRecordNew.rb to do this work. We batch executed this code by dumping a directory list of the directory containing all the XML into a .pl file and globally adding system('rails runner createPages.rb FOLDER_NAME_FROM_DIRECTORY_LIST ANY_OTHER_PARAMS_YOU_NEED'); to each line of this file. We then executed the perl file.

Our biggest problem was that the DLXS XML was not perfect and there was cleanup that was needed to make sure that the XML was well-formed.

For the Page Turning, we used the Internet Archive BookReader.

In htdocs/rails/public/bookreader/ we created a directory for each book.

In that directory is a BookReaderJSSimple.js file that has details for each book (including the height and width dimensions of each page image). This was all created by running a perl script designed to pull this information from the XML file master we have for each book. Also, each directory contains a link to a general BookReader CSS file and an index.html file.

To see how this all comes together, check out our site: http://seasiavisions.library.cornell.edu/

Example: University of York

The book (or any resource that contains a structured sequence of children, eg. audio) contains a datastream called STRUCT. The STRUCT simply contains a list of 'members' in order. This isn't indexed in solr.

<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:rel-york="http://dlib.york.ac.uk/rel-york">
  <rdf:Description rdf:about="info:fedora/york:817060">
    <rel-york:hasMembers rdf:parseType="Collection">
      <rdf:Description rdf:about="info:fedora/york:817652"></rdf:Description>
      <rdf:Description rdf:about="info:fedora/york:817653"></rdf:Description>
      <rdf:Description rdf:about="info:fedora/york:817654"></rdf:Description>
      <rdf:Description rdf:about="info:fedora/york:817655"></rdf:Description>
      <rdf:Description rdf:about="info:fedora/york:817656"></rdf:Description>
      <rdf:Description rdf:about="info:fedora/york:817657"></rdf:Description>
      <rdf:Description rdf:about="info:fedora/york:817658"></rdf:Description>
      <rdf:Description rdf:about="info:fedora/york:817659"></rdf:Description>
      <rdf:Description rdf:about="info:fedora/york:817660"></rdf:Description>
      <rdf:Description rdf:about="info:fedora/york:817661"></rdf:Description>
      <rdf:Description rdf:about="info:fedora/york:817662"></rdf:Description>
      <rdf:Description rdf:about="info:fedora/york:817663"></rdf:Description>
      <rdf:Description rdf:about="info:fedora/york:817664"></rdf:Description>
      <rdf:Description rdf:about="info:fedora/york:817665"></rdf:Description>
      <rdf:Description rdf:about="info:fedora/york:817666"></rdf:Description>
      <rdf:Description rdf:about="info:fedora/york:817667"></rdf:Description>
    </rel-york:hasMembers>
  </rdf:Description>
</rdf:RDF>

We haven't implemented any page turning over the top of it, but are using this technique to serve up ordered lists of audio tracks.

Some thoughts on mapping York objects to IIIF

Samvera

Current modeling practices

Example: Boston Public Library

Example: University of Hull

Example: Cornell University

Example: University of York

Related content