02. Digital Preservation Recommendations

Overview

The goal of digital preservation is to provide a complete, technology-agnostic copy or record of a digital item for as far into the future as possible. Digital preservation targets the long-term, and not necessarily intended for quick retrieval or restoration of current systems, as is the case with the practice of backing up and mirroring software systems. Storage, description and management of preserved items is intended to enable curators to continue to manage digital objects in perpetuity, allow those in the future using unforeseen technologies to discover, review and restore digital objects separate from systems in which those items were managed or created.

In Avalon’s use-case, the objects, their descriptive metadata, and the hierarchical information about the objects in relation to each other should be preserved. Preservation metadata also may include provenance information which describes activities that have occurred over time, including fixity checks and metadata extraction, and any transformations of the source material (such as normalization and creating preservation or access copies of files). While this does not exclude the creation of derivatives or access copies, the maintenance and management of the original or master version of digital artifacts supersedes other concerns.

The integration of Avalon and Archivematica is intended to allow for archivists to perform preservation tasks allowing for the preserving of master files first and foremost, with the application of identifiers and preservation metadata that will allow preservationists and archivists at a later date to update, migrate and record use of items that have passed through the Archivematica pipeline en route to Avalon ingestion.

At this phase, the goal is to give preservationists a single, simple workflow in which the work of ingestion allows them to send an item into Archivematica - at which point any preservation activities they wish to apply to an item occur. The item is given a unique identifier which will be useful in recovery and discovery processes in the future as the master file and related material is shuttled to preservation storage (preservation management being the responsibility of the user) and the master and manifest are dropped at Avalon for ingestion. From there, Avalon can create derivative copies for presentation of archival materials.

Unique identifiers

Avalon ingests media content and uses the Fedora repository software to create identifiers for ingested content, including the use of NOIDs for item- and collection-level identifiers. Archivematica uses a standard that creates unique identifiers and bases tracking and movement of archival objects on the Universal Unique Identifier standard. A UUID (Universal Unique Identifier) is a 128-bit number used to uniquely identify some object or entity on the Internet. These UUIDs support the long term, system interoperable, and migratable storage of digital items intended to last in perpetuity even as technology changes. This UUID reduces the opportunity for filename collision, as may be the case with similar items or re-ingested content, and assists in findability and discovery of individual items.

UUIDs have been considered part of preservation and digital archiving standards for more than a decade. Providing similar identifiers as those used in other archival environments will assist future preservationists and archivists with discovery of content.

Transfer Size and Scope Recommendations

Scope of Materials

Files prepared for processing through Archivematica should be ordered according to archival best practices. Each transfer should ideally be working at the Item-level, which may include multiple files but all essentially capture a single Work (as defined by the Functional Requirements for Bibliographic Records model). Multiple files may be additional audio tracks, multiple videos for the same event, or sidecar subtitles metadata.

The goal of Archivematica is to make system-agnostic, self-describing Archival Information Packages. For easy identification and retrieval of desired assets in the future (both near-term and long-term), AIPs should be able to be retrieved from storage in a precise manner.

Size

Audiovisual assets can be complex and very large. It is recommended that collections of materials prepared for ingest through Archivematica be of reasonable size. What is a reasonable size? While Archivematica can process very large collections of bitstreams, an Archivematica installation must be configured to support these large collections. For documentation on supporting large files at a larger scale, please see the “Scaling Archivematica” portion of the official Archivematica documentation. It is generally recommended that the Archivematica pipeline have at least 3-4x the amount of disk space as the size of assets that are actively being transferred through the system, and this recommendation is higher for processing configurations that include the normalization of audiovisual materials.

These issues around transfer size and systems requirements should coincide with the above recommendations for Scope of Materials.

Spreadsheet format

We recommend using the CSV file format over the other acceptable file formats when generating Manifest files that will pass through Archivematica and get utilized by the Avalon Media System. The acceptable formats are .xls, .xlsx, .csv, or .ods. All are valid spreadsheet formats, but CSV, as a format, is an open standard and .xlsx and .xls is an open patent owned by Microsoft. We recommend using the CSV format because it is much more likely to not have interoperability issues in the future. Because Archivematica works in a Linux environment, it is important to use something universally cross-compatible even if digital objects begin in a Windows environment, instead of using an Excel spreadsheet meant for use in Microsoft Office products (.xls, .xlsx). ODS is a spreadsheet file format used by OpenOffice/StarOffice, but not as widely used as CSV. Therefore, CSV is still preferred. Any of the above formats can be read and recognized as spreadsheets, but only CSV formats are compliant with Archivematica’s CSV validator service.