"June6HydraPartnersNotes"

June 6 Hydra partners meeting:

Agenda topics:

- Technical updates (hydra-head & supporting tech)

- Project updates

- Hydra community

Project updates

hydra-head & supporting technologies

Matt Zumwalt talked us through the Hydra Technical Update, the details of which are here: https://wiki.duraspace.org/display/hydra/Hydra+Technical+Update+June+2011

Substantial growth & stabilization of these technologies in the past year. This time last year we didn't even have a beta, just 5 - 6 dev teams, now we have 8+ dev teams working lots of hydra heads. ~50 devs working on hydra parts.

We're working on: steady upgrade paths, strong support for a large, open source distributed development effort.

Patterns are emerging around collaboration, best practices, documentation. We've settled into regular release cycles, documentation, continuous integration.

On the horizon: Upgrade to rails 3.

Challenge: How do you provide consistency and quality of development standards across many different projects with different goals?

Challenge: discussion of Hydrangea past & future

What should we provide out of the box? Not a full app like hydrangea, but what?

Priorities: Easing the adoption path, stabilizing the upgrade path, refining the hydra-head plugin and have it stick just to the things hydra-head should focus on. E.g., hydra-head should not change the look & feel of your rails app, that's out of scope. Accessibility and usability standards are in scope, however.

Coming down the pipe: Rails 3 upgrade, usability and accessibility concerns

Hydrangea doesn't seem like the right place from which to start new projects. Hydra-head does. Everyone seems to agree with this statement.

Right now if you want to start a new hydra-head you can, but it might be better to wait for the rails 3 upgrade.

Discussion ensued about how this timeline fits with various institutions' development timelines.

Suggestion: If people need to start developing now, they could start with Rails3 Blacklight, work on your data modeling, indexing, and UI, and then install hydra-head once it has been released in rails3. This is the path we're planning to take with hypatia in the short term.

Upcoming work: We will create an automated hudson build that runs through the installation process and tests the application that is produced at the end.

Rails 3 upgrade: All of the supporting gems have been updated. Next we move onto upgrading hydra-head itself.

Hydra community

We're currently at five official partners, Northwestern wants to become an official partner. The legal counsel of the partner institutions wants to see a more formalized agreement around

copyright and code licensing. Lawyers want to ensure that we're got good coverage around liability or implied warranty. We want to create a consensus about this legal process, and

once that's done will be better able to accept new development partners.

Last year at OR we wanted 6 partners by this OR, and we hit that target. What's our target for growth for next year? Tabled for discussion later.

Update on individual projects

UVA and Libra

Julie showed us Libra, the UVA institutional repository. They're adding more work types (e.g., book chapter). These open access work types get defined as the need arises and their metadata librarian defines the new object type and they create a new submission form for it. It has good adoption in some departments, the technology seems to work fine, but the biggest problem is getting feedback from faculty members who don't see why they should submit OA work to the repository. Linking articles to datasets it an anticipated new feature, as is adding Electronic Theses & Dissertations.

Julie wants to ensure that the baseline user experience provided by hydra-head has good business rules. Some of these are in regard to legal concerns that we should steer people around by default in hydra-head. Julie wasn't specific about what these concerns are, but we should follow up as we define the default behavior in hydra-head.

In libra, if you're a superuser, you can see anything that's in progress, and you can edit anyone's object. Users that are not logged in can see any public works, and can do everything except download the file. Even if it's UVA only, you can still see the metadata about that object, you just can't download the actual file.

Discussion ensued about Administrative Policy Objects & how to specify who has access to an object. As they add datasets, UVA wants to be able to give access control to people in a certain access control group from their central IT.

Q: Any plans to receive data from other repositories? Or other systems on campus?

A: No, not right now, but probably something we'll need to consider in the future. More of a people problem than a technical problem, really.

Dataset prototype is being limited to objects < 100 Mb.

They might also want to look at tiered storage, so that if you have a large object that isn't accessed very often it can stay in archival storage.

The group has done a lot of work around licensing, storage, and UI work around data sets that we should revisit as a group and share.

Next steps for Libra: ETDs, small data sets, shoring up audit capabilities particularly w/r/t preservation.

Hull and Hydrangea at Hull

We're planning to switch over in September. We want the system to look at much like our existing Muradora system as possible. We've got a few thousand objects in the old system that need to migrate to the new one. We thought we were going to reproduce the drill-down UI that muradora used, but Richard changed his mind over time. Facets work pretty well as a substitute for drilling down, especially if you present them in an appropriate order.

Richard showed us the Hull hydra installation and some screen shots from his 24/7 presentation. Access groups are public, students, staff, and "content and access team". Unlike UVA, anything you can see you can download. If you don't have permission to download something, it won't appear in your search results. They use Fedora DC only has an internal fedora resource, but have an additional Dublin Core datastream which is generated on the fly from the MODS. Really, they just have one metadata datastream, MODS, and anything else they need gets generated on the fly from that. Many groups of content types share a cModel but have subtly different Ruby models underneath to give different displays.

When people are creating an object, we give them a very tightly guided form. When it is edited by the access and content team, we just give them the whole MODS record. Every submission goes through a QA process. They want to be able to accept submissions from Sakai and Sharepoint (largely complete). They're also looking at full text indexing of PDFs.

They're also creating collection level objects ("display sets") to group like objects together and give them contexts. If you find an object that's part of a collection it will link you to its collection. Those collections will also appear as a facet. Queues for object creation & QA are implemented as Fedora sets. When you delete an object it doesn't get deleted, it gets marked as an inactive object in fedora.

File size is stored in the fedora datastream header, and also copied out to MODS and contentMetadata. The last is a one-stop shop for all UI display info.

Afternoon minutes, Hydra Project meeting
June 6, 2011
2:10pm

Meeting reconvening, 4 more updates, 20 minutes each. Will stop demos at 3:30 and move on to the other agenda items.

Northwestern update Mike Stroming & Bill Parod

Digital Image Library "Hydra Head". (also showed this at LibDevConX; haven't done new work on this since March 2011 due to other project priorities.)

Using Hydrangea, currently under development for 70,000 art historical images. Existing image server stack that desire to use in Hydra/Hydrangea. Create virtual crops of images using a crop tool (now in HTML5) to draw boundary information, saved as part of the Fedora object and passed as a datastream to the dissemination service, which forms URL to the image needed. [Showing virtual crop removing the color bars]. Also supports rotation, all this information coded in SVG as a data stream. [question about whether assumption that virtually cropped image is rectangular? currently, yes, but not necessarily always will be in future]

Several slides showing image request processing, content models, data streams used for image Fedora objects, crop tool in action. Crop tool invokes a controller action to clone the source Fedora image object, load the clone and update its SVG for new crop.

Extended Hydrangea app to support creation of ad hoc collections and image upload into Fedora: slide showing ingest via Hydrangea uploading tool. Utilizes Northwestern's existing ingest conversion services, which produces necessary derivatives, copies images to archival servers, etc. Orchestration expressed in CAMEL. Hydra/Hydrangea adds value of doing all the ingesting and datastream creation based on content type, as well as SOLR indexing. So saved all of that work, just had to do the derivative processing. Controller action in Rails does the Fedora ingest, adds appropriate data streams, including the SVG data stream, and as a result it's available in Blacklight, Fedora, etc. Want the process to be as rapid as possible, particularly so that users who are uploading one image will get to view it as quickly as possible.

Slide showing the VRA Core metadata and how work/item/image display works so that related images are shown from the work record. Haven't implemented VRA core editing in Hydra. Are using Xforms currently for VRA Core editing. For more casual use, such as faculty image upload, will use Hydrangea metadata editor tools for that. DIL is one of two meta projects currently underway, the other is a production workflow project. The VRA Core cataloging tool is going to be invoked from the workflow system; the cataloging tool has save buttons that will update the status in the workflow system.

Support for ad hoc collections in Hydra, such as for faculty who want to assemble groups of images for a presentation. These collections become Fedora objects unto themselves, and in Hydrangea has drag-and-drop features for creating new ad hoc collections, updating them. Collection objects have MODS data streams. Members of the collection are ordered as well. Not sure how much the collection object notion will expand, if description will expand beyond basic captions, but the MODS datastream will be there if needed for deeper description.

List of Hydra code:
hydrangea_collections plugin, VRA Core datastream support, ModsCollection datastream support, Models, controllers, helpers, views for MultiresImage & UploadRequest, Javascript for HTML5 viewer, JQueryUI and Ajax for drag-and-drop, ActiveModel for tracking upload requests. The image crop tool is integrated; invoked in the view.

Future directions (Steve D): Northwestern/Indiana submitted a grant proposal to IMLS to collaborate on a video application built on Hydra with Fedora, probably Matterhorn for conversion. Hope to hear about the grant in October. Authentication and authorization other important areas of need/focus; need to continue to work on authentication and authorization, need to be able to leverage campus LDAP/AD services and information. Hydra has its own policy information stored alongside the Fedora rights information; we need to have some kind of middleware that can translate this so that we can still serve applications other than those built in Hydra; might be a factor for the streaming server, etc. Want to have a release of this project by the end of August.

Stanford update, Tom Cramer
Three production applications for Hydra:
ETD application for submission of ETDs. First quarter of last year; showing the confirmation screen. Initial submission screens are triggered by PeopleSoft student administration application, which creates a stub object in Fedora wtih name, program, thesis advisor. Student gets a link in email that brings them to a submission page where they enter abstract, upload dissertation, complete checkbox for copyright permissions & upload copyright permissions documents if relevant (never visible but available for auditing purposes). Creates a signature page with advisor sigs, inserts this into the PDF directly. There is a Progress panel on the right side of screen that shows their progress through the system. Once they have submitted everything, it gets routed to the final reader on the dissertation committee, then they verify it's OK, this information gets routed to PeopleSoft, etc. Approved by the registrar. Then complete, create a stub record inside integrated management system, shows up in digital delivery system.

SALT: Self Archiving Legacy Toolkit. Has been up for a year and a half now. 16,000 scanned documents for papers of Edward Feigenbaum. 3 documents are public, but with permissions and a login, can see all 16,000. Everything displayed in gallery view, can toggle to list view. Collection was not processed, no item level description, so series or box/folder information used for document titles. Once scanned, will be a lot easier to crowdsource description than it would be to work just from the paper materials. Can enrich and describe over time. Have done a lot of entity extraction and these elements show up as facets, which permits gradual reconciliation over time. Also provides an extraction from the finding aid that was prepared, can use that structure as a facet. Tags can also be applied either by archivist or other (donor, in the case of the collxn being demoed). Drilling down to the detail view page: can toggle to edit if permissions allow. See the full structure of the collxn in left panel, other entities presented as facets below. JP2 images delivered via Djatoka, a modified version of the IA flipbook. Can page through in the browser or open the viewer in a new fullscreen window. Also supports a 2page view. Upon edit, can change metadata, change permissions from private to public, etc. Can add authors, date, links, tags, etc.

EEMS: Everyday Electronic Materials. Lets librarians find PDFs or other monographic or born-digital materials on the web and add them to the library's collection. Add a widget to their toolbar, and if they find a good article or publication ont the web, open up a widget in the browser, use widget to provide metadata, drag-and-drop URL. Selector can flag for public access, Stanford access, or on request. Can indicate whether free or on fund; if on fund, kicks off a note to Acquisitions and an acq process. If the URL link doesn't work, can also upload from the desktop. At end of process, item goes automatically into the catalog and also into the Digital Stacks application.

Working on now:
Hydrus. Repository for any content that might be found across the institution. Library collections or beyond. Similar in spirit to UVA Libra and the Hydra repository. Self-deposit, self-publication simple submission workflow, or it could be a two-step process where objects and submitted and then routed using administrative policy objects to route for approval to designated persons (catalogers, administrative departments).

Significant work on refactoring Hydrangea Hydra plugin and putting in a common code base, take out cruft, streamline.

Hypatia. Implementation of AIMS grant, born-digital archiving project led by Virginia with Stanford, Hull and Yale. Purpose is to take SALT and amplify its capabilities, add more robust description and arrangement capabilities. Arrange things into series, add digital components, apply bulk descriptions, apply bulk permissions. Grant ends at the end of September. Looking at how to model an EAD or an archival collection in Fedora, and then how to get both edit and display views for these objects in Hydra. Will be used alongside something like Archivists Toolkit, which does better with accession information, physical content management. Mark Matienzo asked to comment on Archives Space integration: too soon to know how well this tool will do with the digital objects, but will keep an eye on it for future direction decisions in Hypatia.

---
Bill Ingram, Medusa

At Illinois, have a lot of data all over the place. ContentDM (images), IR (5 years or so, lots of PDFs, ETDs for two years, DSpace). Rare books, Google, OA all scanning a lot of material. ContentDM mostly contains access copies, archival copies generally sitting offline, on a CD in storage somewhere. Want to be able to preserve these digital objects, not just have them on disk. Started out wanting to create a Hydra head. Bill is a 30% programmer, just hired someone else who will be 30% visiting research programmer, just getting going. Interest at Illinois in Archivematica workflow system, which starts at the point of scanning, will create an AIP and a DIP. Archivematica is built in Python, moves things from folder to folder in an Ubuntu OS. Illinois liked these features but decided to rebuild using Ruby; Ruby library RUOTE to build messaging queue, have micro services to do little jobs along the way to build up a digital object. Lots of existing data, but also lots of new data being created all the time. Workflow management system: as soon as something is scanned, it immediately becomes a Fedora object. Gathers provenance imformation, creating a PREMIS object with the object. Have a prototype working now which creates Fedora objects: uses ActiveFedora, OM, SOLRizer. Very close to the Hydra content model, but adding a PREMIS object. Haven't started any kind of access application. Hoping will be easy to demonstrate that can use Blacklight also for delivery, maybe start working on a central infrastructure to replace the many different systems that are running all over Illinois to deliver digital library projects. Hope to have something by the end of the summer that is a good workflow manager.

Q: what are the microservices? FITS tool that Matt built, with slight modifications. Virus scanning, Checksumming, HDL service.

Workflow management: Archivematica has a grid view for the object as it goes through, can apply rules that permit a job to stop, say for human intervention, if needed. Each of the services can either pass or fail. If they fail the workflow stops.

Liked Archivematica a lot; general idea with that tool seemed to be to do the GUI first, and return later to do the inner workings. Illinois starting with the guts first, will worry about the GUI later.

Have also been working with Bill P on another project for Project Bamboo, Fedora repository with a CMIS connector on top of it.

Tom: next Hydra head in the queue for Stanford is a workflow management and preservation management head called Argus, which would basically be a repository manager view on top of the repository. See state of all the objects from the repository, can start new jobs, get reports, etc.

"WorkDo" is a common workflow management service? [didn't quite catch the name of this utility]

-----
Update from Dan Brubaker Horst, Notre Dame

ND just getting going with repository work on campus. Just finished a simple Hydra head that is a distributed metadata editing tool, could be used by different departments across campus. Was a proof-of-concept mostly, resembles Libra somewhat. Essentially took Hydrangea and put a layer on top of it to style it, and added support for PBCore to support editing video editing.

A hierarchy of objects, each with its own state information that shows what actions can be performed. Also have more granular permission management, support functional roles rather than just people, with permission groups for functional roles. Currently this is internal to the application, b/c can delegate authorization to central service, but don't actually get good group information fromt he central service. Current version of this tool can be seen at video.library.nd.edu (not branded) is in production, used by about 5 people.

Don't have an update yet on Atrium, which is the Hydra exhibits piece. Had to suspend work on that to get this other IR bit out the door.

Q from BP about state machines, can you say more?
In PBcore, there is the concept of an event, master copies, and derivatives. In the content model, are using the ruby gem state_machine to outline the valid states and the valid transitions between states. Because the app had to be built in 6 weeks, information is managed in an active record model [missed this] . Do Store the current state in a field, in case lose something, and can recreate a state activity if that happens. Some views are driven by who can make what state transitions happen, and what state its currently in. Persistence layer has very little other than the PID and the current state. The state_machine provides everything esle from these two bits and the state_machine DSL.

Atrium can build ad hoc collection objects and information about the facets that should be available to these collections, and this info is all stored in the Fedora object.

Next work at Notre Dame and timeframe: back to Atrium, need to deliver something by September.

---
Matt Z showing another Hydra head under development

Having a lot of objects in Fedora but not actually showing
NARM: National Association of Recording Merchants example.
Get a full record of recorded materials being sold via any merchant, all comes in in legacy data format. ActiveFedora cracks legacy files, creates Fedora object for every line, make SOLR, metadata for every record. Slight modifications to Blacklight interface. Not a huge amount of work, now have a view into old data that had been trapped in old text files. Added in view XML serialization, (b/c this is demo, an unproxied link directly to Fedora URL to show the data stream). Can show the original content, underlying fixed width text. Everything re-exposed via Blacklight interface. All really easy to do with Blacklight and SOLR, once get content into Fedora via ActiveFedora, which means you already know how to get it into SOLR. Demonstrated ease of tackling custom one-off metadata formats with OM, SOLRizer and Blacklight. Demonstrates easy path of entry. Can also be used to batch-load metadata and reprocess for display, etc.

-------- BREAK ---------------

Discussion about picking a time tomorrow to do the EAD discussion.

Bess: trying to figure out EADs for Hypatia, and a lot of others are trying as well. Is this a kind of thing we want to try to figure out together, or is it OK for folks to go in their own direction? Expectation that things will still be maintained in Archivists Toolkit and exported to Fedora, but sometimes hears that people expect to create in AT, export to Fedora and then do further edits there, or do that and then also be able to reimport.

Tentative time for EAD/Hypatia discussion tomorrow at usual time for Hypatia call: 9:30 Pacific, 11:30 Central, 12:30 Eastern

This will allow other people time to meet with to talk about the Design group and other topics.

BILL PAROD PICKING UP NOTETAKING

Notes for 3:50 - 5:00 session:

Note taker: Bill Parod

--------------------------

Rails3 Migration Discussion:

Spreadsheet for identifying functionality to figure out what should be migrated

Rails3 is the next item on the Hydra List

Also working on documentation;

Should plugin developers wait for Rails3 to move forward?

Segregate your tests from Hydra tests; Hydra tests will be moved;

--------------------------

HTML5 Discussion:

Is the UI group looking at whether the Hydra base code will be HTML5 enabled?

WIll our Rails3 version be HTML5 compliant?

What is required for HTML5 compliance?

Looking at the HTML and Javascript to do a base end-user sweep

Our code should be standards compliant. Which standard?

If HTML4 - there are features of HTML5 not there.

What are we doing with the webapp wrapper so that it is forward-thinking, lightweight, and as migrateable as possible?

Will the Rails3 port be HTML5 compliant?

Want automated validation of the HTML; We have <rel> tags and @data that are HTML5 but XHTML validator doesn't pass them

Do you want to recode the UI to make test passed?

For audio and Video, we should use HTML5 as far as how media types are handled.

HTML5 interest also motivated by dynamic canvas elements - drawing with Javascript

Will HTML5 help us with our validation goals?

HTML5 is also relevant when considering interactivity ambitions

Validation is our near term goal. Does HTML5 help us with that?

Does HTML5 help is with accessibility? No.

It does provide some functionality that facilitates accessibility

How do we improve validation?

How do we accomplish accessibility?

Many of the moving parts have to do with layout.

HTML5 offers layout control that can be validated and that facilitates accessibility

Some of the HTML in our code needs improvement - simplification - though is valid.

Much of this can be improved in the Helpers

Some of our complexity comes from using javascript libraries. That can be simplified.

How do we discourage (control?) developers from adding unnecessary HTML complexity?

Such people that are motivated to add complexity should be involved in the test development as well. They understand the motivations motivation and necessity.

We need to have tests that are testing for exactly what we want.

The big question is what is our target for HTML testing?

Validation is one thing, but it's hard to tie a design aesthetic to a test.

Perhaps we can apply a style guide approach in writing our tests.

Can we sit down at this conference and review what HTML Hydra is producing. Yes.

We probably want to make sure that our HTML is solid in current (XHTML/HTML4) environment before moving it the HTML5

It's ok to start using HTML5 tags

We should start validating against HTML5

We should use HTML compatible libraries. But this will take some time.

Need to pick a backwards compatible solution

How do we establish channels of communication for these topics?

Can we also provide developer guides for those building on the chosen HTML approaches?

Want to develop style guides around the approach with perhaps developer suggestions / guidelines

Things to think about when writing an HTML5 app: semantic elements :sidebar, footer, …(not divs). The application layout file would have to be rewritten.

We might be able to get away without touching the application layout

--------------------------

Code Contributions:

How do we incorporate code contributions?

Especially if there are new contributions coming out in the next few months.

We should encourage these contributions to be in the form of plugins.

For example:

Exhibits, Drag and Drop

A lot of the application plugins (especially since Hydra will likely just be editing) cross Hydra/Blacklight boundaries. Is it a Hydra plugin or a Blacklight plugin?

We need guidelines for making these decisions, design, and providing testing.

We also have a lot of overlap of development. How can we coordinate these efforts?

How do we establish best practice for plugin design that crosses backlight and Hydra functionality?

We shouldn't approach this question pre-Rails3

We need channels and marketplace for contributing plugins.

Need tighter loops of communication among folks that are working on similar projects

We should report to Hydra-tech on advances that are happening locally.

Others need to see others' code, Fedora objects, Solr documents…

Can use hudson to try transient builds to see how something is working

Talk to Bess about Hudson

Need tests that actually test things.:) Looking at Heckle report on Hudson.

Integration tests are very important!

Blacklight does not take contributions that don't provide tests.

Cucumber can be very difficult to use and you can spend a lot of time getting Cucumber to run correctly rather than getting the application right. It's a special skill. It is also a new technology that hasn't matured. The benefits or that type of test coverage is worth the pain but we should understand that there is a cost.

UIUC, Northwestern, and Notre Dame recognize the value of writing tests and doing continuous integration, but haven't yet established that practice.

Documentation coverage is now at around 40%

Looking at your test coverage numbers motivates your attending to tests.

The Hydra upgrade plan assumes that you have tests you can run with the new head and then test things.

If you're going from no test coverage to some testing, Cucumber might be a good place to start.

Fall Hydra meeting with be an important one.

---------

Developer Skills:

Had a phone discussion on what makes a good UI developer - if we were to hire someone today? Noted baseline skills on the wiki at bottom of developers' page.

This can also be a parallel track for the GUI track tomorrow.