Data Migration from DSpace to Hydra

The primary goal of this Interest Group is to identify, document and communicate complete workflow solutions for the migration of data out of DSpace and into Hydra. With that goal in mind, use the table below to see that is being worked on, add your own information, and contact any resources that may be able to help you out.

 

Use the template below to share your data migration path:

 

Example:

Institution: Miskatonic University

Data Migration Approach Narrative: We called on the Great Old Ones to use their cosmic magics to move the data.

Major Issues Encountered: Collaboration with Cthulhu is difficult and can be dangerous. 

Code: github.com/chulhu_code_repo/dspace-sufia-migration

 

InstitutionData Migration Approach & ArchitectureMajor/Minor Issues EncounteredStatusExampleCode

California State University
Aaron Collier 

DSpace AIP Packager export: Importing and Exporting Content via Packages

Writing custom import rake job to process AIP Package

Storing Community & Collection naming in metadata fields for faceting. Currently coverage and sponsorship respectively, but that is likely to change.

Major: The size of data and frequency of calls with this approach appear to tax Fedora leading to timeouts after approx. 150 items or so. Will likely need to address the system resources on the fedora server during migration.

Major: Ambiguity of bitstream naming. Bitstream file names come out as bitstream_[some number].[extension]. The file name is in the metadata as "dc.title" with the dspaceType of "BITSTREAM". But this will likely be very difficult to address for multiple bitstreams on one item.

SOLUTION: Complete

In the AIP Package mets.xml

      • Grab the MD5 hash and filename from the premis block
      • Query the exported filename from the fileGrp based on the MD5 checksum key

 

Minor: Duplication of dublin core fields for "system" data vs. item metadata
i.e. <dc.creator>Collier, Aaron</dc.creator> is the author,
while <dc.creator>acollier@calstate.edu</dc.creator> is the submitter. (solution: check for email address format) 

Minor: Embargoes work somewhat differently in Sufia. During the embargo period for an item, it is completely private. This might be expected behavior in some cases, in most cases for us DSpace allows viewing of the metadata. This functionality will need to be adapted into Sufia, but is not a super high priority yet.

In Progress 

https://github.com/scholarworks/dspace_packager/blob/master/lib/tasks/packager.rake

Very beta code for now and could use a thorough refactoring to be more "ruby-like". So far, it's working ok.

University of Michigan

Jose Blanco

I have a perl script that exports 5 items from each of the close to 400 collections we have - each item in one directory. This is done using the "./dspace export" command. The script then creates a yml file for each item to be used by a rake task we have to import the items into Hyrax as works.

We are doing this more to stress test Hyrax. So we are not concerning ourselves presently with permissions of the items or bitstreams; where to store the bitstream descriptions; or the mapping of the items to their respective collections. There's also other issues which we are tabling for the time.

The one issue we did encounter is that the dspace exports have lots of numeric character references in the dublin_core.xml file. In order to convert them back to utf-8 characters we are using htmlentities. Here is a link that helped me out with this: https://makandracards.com/makandra/898-encode-or-decode-html-entities

In Prgress

These files:

https://github.com/mlibrary/deep-blue/blob/master/lib/append_content_service.rb

https://github.com/mlibrary/deep-blue/blob/master/lib/build_content_service.rb

https://github.com/mlibrary/deep-blue/blob/master/lib/tasks/populate_dev_app.rake



Oregon State University

Josh Gum

Steve Van Tuyl

DSpace Replicate plugin to export Collection and Item BAGS. dspace-replicate

Stand alone ruby application, with configurations, to map metadata from each BAG into properly formed data to publish works into a Hyrax server "through the front door".

We are focused on striking a balance on how to operate the application, some of the concepts in mind are;

  • Operate on BAG files that are exported from Dspace at first, with the idea in mind to extend capability to include bags generated by something/someone else
  • Require explicit metadata mapping configuration so that nothing can slip through the cracks, we want to be confident that we handle all metadata and files purposefully (even if it means we explicitly ignore something)
  • Use the commandline, drive the application making the most of things like shell for-loop a list of ITEMs to migrate
  • Make a backup of the transformed data that was posted to the Hyrax server for troubleshooting/replay/sanity check
  • Adhere to the API and shape of the data that the Hyrax server expects. The application cleans/maps/transforms and then posts data, the server takes care of the proper logic and functionality for persisting the works, files, etc.
In Progress
https://github.com/osulp/dspace2hydra (See README.md)