2016-09-29—PCDM FileSets Kick-off Meeting

Date and Time

September 29 2016, 2pm EDT

Connection Information

Google Hangouts: https://hangouts.google.com/hangouts/_/artic.edu/pcdm-filesets

Attendees

Agenda

Set WG scope and goals in regard to recent discussion in the PCDM and Hydra forums
Define roles and availability
Agree on schedule for future calls
1. Unconference session at HydraConnect?
Anything else?

Discussion points

PCDM 2.0?

In light of the several discussions in the PCDM Github space, we want to consider whether this project is tied to the PCDM 2.0 release.

Rough implementation draft

In order to make FileSet accessible in CC as first-class citizens with their search and CRUD workflows, a proposed implementation would go along the following lines:

The batch upload feature produces multiple FileSets, one per uploaded file. The metadata entered in the create form apply to each of the FileSets.
The individual upload feature has additional fields for uploading files, some pre-filled with a common role, e.g. original, intermediate, etc. corresponding to pcdmuse terms. More upload fields can be added with a choice list of pcdmuse terms. The pre-filled fields should be easily modifiable via configuration.
Functionality is added to create single or multiple FileSets starting from a PCDM object. E.g. you have a "Create FileSet(s) related to this Object" button would take you to either the individual or batch upload form, with the relationship already pre-established. We actually have this functionality in LAKEshore except that we are creating Objects instead of FileSets.
The FileSet resource would get its own update controller and view, where descriptive metadata can be edited. The current view (characterization metadata, etc). and edit (new versions) pages for FileSets would be moved to Files. I am not quite sure if we want separate sharing permissions for Files and FileSets, that may be a good design discussion point.

Minutes

Present: Andrew, Esmé, Jennifer, Joshua, Julie, Julie, Michael, Stefano, Yinlin, nikhil

1. Set WG scope and goals in regard to recent discussion in the PCDM and Hydra forums

Stefano: Uploaded graph to Duraspace Wiki page with tentative visual of prior conversations
Where do we stand with FileSet implementation in the PCDM conversations and PRs on Github?

Esmé: There's no comprehensive consensus
Lots of different value judgements based on varying situations

Stefano: Islandora community is involved in discussions, and there were some pretty important differences of opinions
- Should it be part of PCDM or an optional entity?
- Islandora doesn't want to commit to it being a mandatory part of the ontology

Esmé: Pull request in PCDM to have a FileSet as an optional sidecar

Mike (or Esmé): Fine having it in ontology but not making it mandatory
Many use cases with simple ontology -- one or two files but not rich metadata etc. for each file.

Stefano: Is Hydra willing to make FileSet an optional entity that's implemented by default?

Esmé: We're not going to get agreement to have it be mandatory
We already haven't implemented some of the things we decided we wanted to do
Such as an independent book object

Michael: Agree with Esmé
Don't see this working group getting involved in PCDM discussion

Stefano: We don't have to come up with a joint agreement on making things mandatory
But concerned with interoperability issues between applications

Michael: We can do some of that in Hydra now, the PCDM work we have is already good and we can just go with it
Only concern is interoperability between Hydra heads
Best answer is at the very minimum, Hydra can be aware of different flavors of PCDM that don't use the Works extensions that won't have FileSets between Works and Files.

Esmé: We may end up with objects linking directly to files with no FileSet in between
You can check for FileSets and work with them if they're there, otherwise work with files directly

Stefano: Are here cases where it's ok to have free-floating Files with technical metadata and no descriptive metadata related?

Esmé: In Hydra, we find FileSets really useful
! Figuring out how to group them while still making them optional is a challenge

Michael: We may be able to safely consider that out of scope at this point

Stefano: So this working group could be mainly concerned with implementation of PCDM Files in Hydra?

Esmé: Having agreement of what FileSet means and what user interface should look like, and adding that functionality with Curation Concerns
Other modeling and refractoring is a bigger question

Stefano: User interface, workflows and CRUD follow PCDM definition

Julie: Is PCDM Works Extension being implemented?

Michael: Works Extension is published in PCDM repo
We're using their URIs
In Hydra Works codebases, they're types are PCDM Object and PCDM Work

Jen: So you can declare an object that's both?

Esmé: There's consensus on changing that, we just haven't changed it
We want to determine it by context.
There's nothing inherit that makes it a Work, it's determined by the context

Stefano: In our draft model, there's a separation

Andrew: Could you review what the problem was that forced you reevaluate your data model?
If we used your proposed model, we'd run into some issues

Stefano: The problem is there's no differentiation between a real-world object and a digital asset

Also pcdmuse terms: If you follow the tree down, each FileSet has one file
By having one FileSet, and uploading multiple files that are different representations of the same asset, the roles of the files become easily recognizable

Jen: It would be easier to store files in different places

Stefano: Yes, you can apply storage policies, access policies that are reliant on pcdmuse type

Andrew: It's nice to distinguish between different derivatives
Since we're using Curation Concerns as a DAMS, we're interested in differentiating files based on their technical metadata
In Curation Concerns, fcr:metadata gets bubbled up to FileSet
Each file will have different technical metadata that we'll need to differentiate
And they'll all get bubbled up to one FileSet object, leading to conflicts
We use Solr for searching

Stefano: We have the same problem
Example: Who's original file is larger than a certain file size?
Example: Find all FileSets who's Preservation Master has certain parameters
Triplestore can resolve this
Another way is to have a preferential file
Curation Concerns allows "has related images" or "has related media fragment"
You could have direct link between FileSet and it's representative file and look for metadata there
You could have technical metadata bubble up only for a representative file

Andrew: We want all technical metadata for all derivatives to bubble up
The only place we've been able to store that is in separate FileSets

Michael: Current metadata privileges original file

Andrew: Technical metadata is generated offsite that we want to store with files as binaries
Looked into using iana:describedBy predicate
Possible to do it in Fedora, but Hydra assumes there will be only one

Michael: Is there a way we can be clever about producing PCDM solr documents?

Stefano: If there was a way to do joins in Solr, would that solve your issue?

Andrew: Yeah, that would probably work

Michael: There's probably some people who are deep in ActiveFedora that we could ask if this is possible

Andrew: Lots of this stems from the PCDM concept that files don't have descriptive metadata
Why is that? Is that reversible?

Michael: In terms of RDF it's not a real distinction because you could put whatever fcr:metadata you want for a file, but it gets messy

Jen: Wondering about giving FileSets themselves rich data
Making FileSets themselves descriptive objects

Andrew: Every FileSet has a managed file
There are other files in the FileSets that do nothing but describe that file in some way, but there's just one file
We could't accommodate the new proposed model

2. Define roles and availability

Development: Andrew, Jen

Michael: no time for development, but can be involved in discussions and plug people into existing bits of work and sprints

Andrew: What's the scope of work?

Stefano: Have FileSets accessible in Curation Concerns
Most of this is Curation Concerns work
Goal is for FileSets to be first class resources in Curation Concerns
No code has been written yet

Andrew: On board with FileSets having their own workflow
We've added FileSets to the basic query inside Curation Concerns
When you do a general query you also get all the FileSets
When you click on the FileSet it takes you to the file edit page
Some of this is already done in our implementation, just no CRUD for FileSets
All sitting in our github repo
Deliverable would be an engine you can install that would give you the option to handle FileSets as equal citizens to Works
Right now it's all flavored with our opinions, would love to make it as general as possible

Joshua: Has it been determined what role FileSets will have in PCDM in terms of modeling?

Andrew: This meeting has outlined that we don't have consensus
How atomic should a FileSet be?
Current model more atomic, proposed model less atomic

Stefano: Not strictly related to FileSets in PCDM as it is the definitions of FileSets in Hydra

Joshua: Dangerous to not consider wider implementations of PCDM

Michael: We talked a little about that earlier in the call, but let's talk more in Boston

3. Agree on schedule for future calls

Stefano: Will determine in Boston

a. Unconference session at HydraConnect?

Stefano: everyone on the call is going to boston?

All: Yes

Stefano: Meet at Hydra Connect?

+1 Josh, Michael, Julie A

Joshua: Would this be an unconference session?

Andrew: I already have my eye on some unconference sessions. Could we do dinner?

Stefano: we might want to draw or show something on computers
But if there's no other time, we could try dinner
Tentatively do Wednesday?
Dinner or some other non-conference time