06. Archivematica Recommendations


Archivematica Recommendations

Transfer and Ingest microservices are performed via an Avalon-optimized pipeline.

Each file will step through all of the Transfer and Ingest microservices and create an AIP.

Processing configuration

Avalon-optimized processing configuration

Documentation on processing configuration fields.

Suggested choice is in bold, other options are normal weight.

Name

default

The Avalon workflow does not require the use of this feature.

Assign UUIDs to directories

Directories are given an entry in the fileSec and assigned a unique universal identifier (UUID). Note that the digital objects in the transfer are always assigned a UUID.

Options:

    1. None - the user is prompted for a decision.


  • Yes - UUIDs are assigned.


  1. No - UUIDs are not assigned.

Send transfer to quarantine

Transfers are sequestered until virus definitions can be updated.

Options:

  1. None - the user is prompted for a decision.
  2. Yes - transfers are automatically quarantined.
  3. No - transfers are not sent to quarantine.

The Avalon workflow does not require the use of this feature.

Remove from quarantine after (days)

Transfers are automatically removed from quarantine after a defined number of days and made available for further processing.

Data entry field: 0

The Avalon workflow does not require the use of this feature.

Generate transfer structure report

A text file is generated showing a directory tree of the original transfer structure.

Options:

    1. None - the user is prompted for a decision.
    2. Yes - structure report is created.


  • No - structure report is not created.


The Avalon workflow does not require the use of this feature.

Perform file identification (Transfer)

Choose whether or not to identify the format of the files in your transfer.

Options:

  1. None - the user is prompted for a decision.*
  2. Yes - use the enabled file identification command. See Identification for more information.*
  3. No - files will not be identified.

By default, Siegfried is chosen because it uses the UK National Archives’ PRONOM database to identify and verify file formats, and the process is quicker than the other file identification option that relies on PRONOM (FIDO).

Extract packages

Packages (such as .zip files) are unzipped and extracted into a directory.

Options:

  1. None - the user is prompted for a decision.
  2. Yes - the contents of the package are extracted.
  3. No - package is left as-is.

In the event that a video files comes packaged with additional data in a compressed format, that data should remain in the compressed format and not be extracted. It is not anticipated that the Avalon workflow will be encountering data structured in this way.

Delete packages after extraction

Packages that have been extracted in the previous step can be deleted after extraction.

Options:

  1. None - the user is prompted for a decision.
  2. Yes - the package is deleted.
  3. No - the package is preserved along with the extracted content.
Perform policy checks on originals

If you create policies using MediaConch, run the policies against the transfer to assess conformation.

Options:

  1. None - the user is prompted for a decision.
  2. Yes - the transfer is checked against any policies.
  3. No - policies are ignored.

MediaConch can be configured to ensure files are within the scope of appropriate files. MediaConch specializes in analysis for media files. If these are being checked in a pre-validation step MediaConch could be skipped, or could be used for institutions worried about very specific or unusual file formats being processed by Avalon. If additional configuration for this feature is not done, this can be set to “No - policies are ignored.”

Examine contents

Run Bulk Extractor, a forensics tool that can recognize credit card numbers, social security numbers, and other patterns in data. For more information on reviewing Bulk Extractor logs, see the Analysis pane on the Appraisal tab.

Options:

  1. None - the user is prompted for a decision.use
  2. Yes - Bulk Extractor scans content and creates log outputs of recognized patterns for review.
  3. No - Bulk Extractor does not run.

The Avalon workflow does not require the use of this feature.

Create SIP(s)

Create a formal SIP out of the transfer or send it to the backlog.

Options:

  1. None - the user is prompted for a decision.
  2. Send to backlog - transfer is sent to a backlog storage space for temporary storage or appraisal.
  3. Create single SIP and continue processing - transfer becomes a SIP and is made available for further processing on the ingest tab.

The Avalon workflow does not require processes to stop between the Transfer and Ingest stages of Archivematica. A SIP can be created automatically without human intervention, and the Backlog feature is not being used with this workflow.

Select file format identification command (Ingest)

Choose a tool to identify the format of files in your SIP.

Options:

  1. None - the user is prompted for a decision.
  2. Use existing data - reuse file identification data from the transfer tab.
  3. Identify using Fido - use fido to identify files by their file signature.
  4. Identify using Siegfried - use Siegfried to identify files by their signature.
  5. Identify by File Extension - identify files by their extension rather than their signature.

There is no need to run this process twice, and running it during the Transfer stage is sufficient for this workflow.

Normalize

Convert ingested digital objects to preservation and/or access formats. See Normalize for more information.

Options:

  1. None - the user is prompted for a decision.
  2. Normalize for preservation and access - creates preservation copies of the objects plus access copies which will be used to generate the DIP.
  3. Normalize for preservation - creates preservation copies only. No access copies are created and no DIP will be generated.
  4. Normalize manually - see Manual Normalization for more information.
  5. Do not normalize - the AIP will contain originals only. No preservation or access copies are generated and no DIP will be generated.
  6. Normalize service files preservation - see Digitization for more information.
  7. Normalize for access - the AIP will contain originals only. No preservation copies will be generated. Access copies will be created which will be used to generate the DIP.

For the Avalon workflow, Avalon is the service responsible for creating transcoded access copies from preservation-level master files. Because of this, the Normalization step can be skipped. This is normally the point in Archivematica when a DIP is created, but the Avalon workflow should use the automation_tool’s feature of creating a DIP from an existing AIP in order to transfer all assets and allow Avalon to make decisions based around the normalization of files.

Approve normalization

The dashboard allows users to review the normalization output and the normalization report.

Options:

  1. None - the user has a chance to review and approve normalization.
  2. Yes - skip the review step and automatically continue processing.

Generate thumbnails

This gives the option of generating thumbnails for use in the AIP and DIP.

    1. None - the user is prompted for a decision.*
    2. Yes, without default - thumbnails will be produced for any format which has a normalize for thumbnails rule in the FPR. Formats which do not have a rule will not have a thumbnail generated.


  • No - thumbnails will not be generated.


  1. Yes - thumbnails will be generated according to the format rules in the FPR. Formats which do not have a rule will have a default thumbnail generated (grey document icon).

The Avalon workflow does not require the use of this feature.

Perform policy checks on preservation derivatives

If you create policies using MediaConch, run the policies against the newly-created preservation derivatives to ensure conformation.

Options:

    1. None - the user is prompted for a decision.
    2. Yes - the normalized files are checked against any policies.


  • No - policies are ignored.


The Avalon workflow does not require the use of this feature.

Perform policy checks on access derivatives

If you create policies using MediaConch, run the policies against the newly-created preservation derivatives to ensure conformation.

Options:

  1. None - the user is prompted for a decision.
  2. Yes - the normalized files are checked against any policies.
  3. No - policies are ignored.

The Avalon workflow does not require the use of this feature.

Bind PIDs

Assign persistent identifiers and send the information to a Handle Server (must be configured).

Options:

  1. None - the user is prompted for a decision.
  2. Yes - PIDs are created and a API call posts the PIDs to the Handle Server.
  3. No - PIDs are not created.

The Avalon workflow does not require the use of this feature.

Document empty directories

By default, Archivematica removes empty directories and does not document that they existed.

Options:

  1. None - the user is prompted for a decision.
  2. Yes - an entry for the directory is created in the structmap.
  3. No - the directory is not documented.

The Avalon workflow does not require the use of this feature.

Reminder: add metadata if desired

Archivematica allows users to see add metadata to the SIP using the GUI. This reminder occurs at the last moment that it is possible to add metadata; once the ingest proceeds past this point, it is no longer possible to add metadata to the SIP.

Options:

  1. None - the user has a chance to add metadata.
  2. Continue - skip the reminder and automatically continue processing.

The Avalon workflow does not require the use of this feature.

Transcribe files (OCR)

Users can elect to run Tesseract, an OCR tool that is included in Archivematica, to produce text files containing file transcripts. For more information, see (see Transcribe SIP contents).

Options:

  1. None - the user is prompted for a decision.
  2. Yes - Tesseract runs on all OCR-able files.
  3. No - Tesseract does not run.

The Avalon workflow does not require the use of this feature. Tesseract does not work on media files.

Select file format identification command (Submission documentation & metadata)

Choose a tool to identify the format of any submission documentation and/or metadata files that were included in your transfer.

Options:

  1. None - the user is prompted for a decision.
  2. Identify using Siegfried - use Siegfried to identify files by their signature.
  3. Identify using Fido - use fido to identify files by their file signature.
  4. Identify by File Extension - identify files by their extension rather than their signature.
  5. Skip File Identification - file identification is not run on submission documentation or metadata files.

Select compression algorithm

AIPs created by Archivematica can be stored as compressed packages or uncompressed, depending on your storage requirements.

Options:

    1. None - the user is prompted for a decision.


  • 7z using bzip2 - a 7Zip file is created using the tool bzip2.


  1. 7z using LZMA - a 7Zip file is created using the tool LZMA.
  2. Uncompressed - the AIP is not compressed.
  3. Parallel bzip2 - a 7Zip file is created using the tool Parallel bzip2 (pbzip2).

The default packaging and compression algorithms is recommended.

Select compression level

If you selected a compression choice in the step above, you can determine how compressed you would like your AIP to be. Selecting a higher compression level means that the resulting AIP is smaller, but compression also takes longer. Lower compression levels mean quicker compression, but a larger AIP.

Options:

    1. None - the user is prompted for a decision.


  • 5 - normal compression mode - the compression tool will strike a balance between speed and compression to make a moderately-sized, moderately-compressed AIP.


  1. 7 - maximum compression - a smaller AIP that takes longer to compress.
  2. 9 - ultra compression - the smallest possible AIP.
  3. 3 - fast compression mode - a larger AIP that will be compressed quickly.
  4. 1 - fastest mode - the AIP will be compressed as quickly as possible.

The default packaging and compression algorithms is recommended.

Store AIP

Once processing is complete, AIPs can be stored without interrupting the workflow in the dashboard.

Options:

    1. None - the user is prompted for a decision.


  • Yes - the AIP is marked for storage automatically.


The AIP can automatically be stored and sent to preservation storage.

Store AIP location

If the previous step and this step are configured, all AIPs will be sent to the selected storage location (unless you have included a custom processing configuration with the transfer that defines another location).

Options:

  1. None - the user is prompted for a decision.
  2. Default location - the AIP is stored in the AIP storage location that has been defined as the default in the Storage Service.
  3. [Other storage locations] - any other AIP storage locations that are available will also appear on this list.

The workflow should be configured with the desired storage option, and this option can be set as the default location. If this option is not set as the default location, the specific storage location can be used here instead. This may be the case if Archivematica is used for mixed workflows and other materials should go to a storage location that is not shared with the Avalon Media System.

Upload DIP

If a DIP was created, it can be automatically sent to an access system for which there is an Archivematica integration.

Options:

  1. None - the user is prompted for a decision.
  2. Upload DIP to CONTENTdm - see CONTENTdm DIP upload documentation.
  3. Upload DIP to Archivists Toolkit - see Archivists Toolkit DIP upload documentation.
  4. Upload DIP to AtoM - see AtoM DIP upload documentation.
  5. Do not upload - the DIP will not be uploaded to an access system.
  6. Upload DIP to ArchivesSpace - see ArchivesSpace DIP upload documentation.

DIP upload is not a component of the Avalon workflow.

Store DIP

If a DIP was created, it can be stored without interrupting the workflow in the dashboard. Note that DIP storage is not required, and that DIPs can be created on demand by reingesting the AIP.

Options:

    1. None - the user is prompted for a decision.


  • Yes - the DIP is marked for storage automatically.


DIP upload is not a component of the Avalon workflow.

Store DIP location

If the previous step and this step are configured, all DIPs will be sent to the selected storage location (unless you have included a custom processing configuration with the transfer that defines another location).

Options:

    1. None - the user is prompted for a decision.


  • Default location - the DIP is stored in the DIP storage location that has been defined as the default in the Storage Service.


  1. [Other storage locations] - any other DIP storage locations that are available will also appear on this list.

DIP upload is not a component of the Avalon workflow.

AIP Structure


The structure of the AIP will look similar to this example below:


AvalonCollection-76dd330d-bc61-4182-b339-04c3d8f78cc4
├── bag-info.txt
├── bagit.txt
├── data
│   ├── logs
│   │ ├── fileFormatIdentification.log
│   │ ├── filenameCleanup.log
│   │ └── transfers
│   │ └── AvalonCollection-252f53a9-0bad-4fb1-931e-012fac618167
│   │     └── logs
│   │         ├── fileFormatIdentification.log
│   │         └── filenameCleanup.log
│   ├── METS.76dd330d-bc61-4182-b339-04c3d8f78cc4.xml
│   ├── objects
│   │ ├── assets
│   │ │   ├── agz3068a.wav
│   │ │   ├── lunchroom_manners_512kb.mp4
│   │ │   ├── lunchroom_manners_512kb.mp4.structure.xml
│   │ │   ├── lunchroom_manners_512kb.mp4.vtt
│   │ │   ├── OrganClip.high.mp4
│   │ │   ├── OrganClip.low.mp4
│   │ │   └── OrganClip.medium.mp4
│   │ ├── Demo_Manifest.csv
│   │ ├── metadata
│   │ │   └── transfers
│   │ │   └── AvalonCollection-252f53a9-0bad-4fb1-931e-012fac618167
│   │ │       └── directory_tree.txt
│   │ └── submissionDocumentation
│   │ └── transfer-AvalonCollection-252f53a9-0bad-4fb1-931e-012fac618167
│   │     └── METS.xml
│   ├── README.html
├── manifest-sha256.txt
└── tagmanifest-md5.txt




Interim storage location

At this point in the workflow, Archivematica continues to process the AIP and store it in the desired storage location. Copies are sent to a watched folder to be picked up by Avalon and processing continues, relying on the included Manifest.csv file to add metadata and establish connections between the files.

AIP Path

The AIP continues to be processed by Archivematica and is set to be stored in the final Preservation Storage system. Confirmation of deposit can be assured via the Archival Storage tab in Archivematica, or by viewing the Archivematica Storage Service. Independent tracking and monitoring may be set up specifically within the storage space.

DIP Path

For the Avalon workflow, Avalon is the service responsible for creating transcoded access copies from preservation-level master files. Because of this, the Normalization step can be skipped. The Normalization set of microservices are normally the point in Archivematica when a DIP is created, but the Avalon workflow should use the automation_tool’s feature of creating a DIP from an existing AIP in order to transfer all assets and allow Avalon to make decisions based around the normalization of files.

The automation tools leverage the Archivematica API capabilities to perform tasks during or after Archivematica has performed preservation processing services. To create a DIP from the AIP without modifying the original contents, the Automation Tools should be configured to create a DIP after the AIP has been stored, pulling from the Archivematica Storage Service.


python -m aips.create_dip \
--ss-url http://archivematica-storage-service:8000 \
--ss-user test --ss-api-key test \
--dip-type avalon-manifest \
--output-dir /tmp \
--aip-uuid 1f160dff-1883-4089-9db0-1551b5842f2b



Pre-Avalon Script


Automation-tools will send DIPs created from ingested Archival assets to a folder using the create_avalon_dip.sh script, as described above. Contents of that folder should be configured to have a script (or multiple scripts) that perform the following actions:

  • Unzips DIP object
  • Appends UUID to Manifest file
  • Identifies Collection folder by transfer name
  • Moves DIP object into Collection folder for further processing

Extract from ZIP

Extract the contents from its zipped carrier. Archivematica’s automation-tools DIP creator will copy files from an AIP and transport them in a ZIP format to a folder, where it can be unzipped and processed.

Add UUIDs to Manifest

Take file UUIDs, minted by Archivematica, and add them to the Manifest meant for ingest into Avalon, setting the UUID as “Other Identifier” and setting “Other Identifier Type” to “archivematica” or whatever value is preferred by the system. 

Collection identified by transfer name and moved

Using the transfer name as a method of identifying the appropriate Collection, the script can identify where to move the DIP folder and which Collection it should be moved into.

These actions can be configured to run at regular intervals using cron or a similar service, or the scripts can be set off manually by a user when ready to perform these actions.

DIP Structure


The created DIP will have the following structure:


/path/to/CollectionsDropbox/Collection/
└── UUID
├── assets
│   ├── agz3068a.wav
│   ├── lunchroom_manners_512kb.mp4
│   ├── lunchroom_manners_512kb.mp4.structure.xml
│   ├── lunchroom_manners_512kb.mp4.vtt
│   ├── OrganClip.high.mp4
│   ├── OrganClip.low.mp4
│   └── OrganClip.medium.mp4
└── Demo_Manifest.csv


Confirmation

The final step of the process requires someone to verify contents have been successfully added as an AIP to archival storage and as an access copy into Avalon. The user should confirm deposit and clean up transfer representations from Archivematica.

Additional References

Diagram: https://docs.google.com/drawings/d/1rfFB9OlGk1NWFjeOIWET94D-TCEk_7-9f2uObBld8ws/edit

Processing configuration recommendation: https://docs.google.com/document/d/1ew2EaN7ijWwZ24J6oVZhsThIJdXEuyACKhZ6fzHPtLQ/edit#heading=h.cywgxboqhflj

Avalon/Archivematica crosswalk: https://docs.google.com/spreadsheets/d/1bSi380JW8piBFk99gMensJ4mhTr7y-b-Qf-AeVzt_U4/edit#gid=0

Automation tools DIP Creation: https://github.com/artefactual/automation-tools#dip-creation

Current sample bag structure


├── assets
│   ├── agz3068a.wav
│   ├── lunchroom_manners_512kb.mp4
│   ├── lunchroom_manners_512kb.mp4.structure.xml
│   ├── lunchroom_manners_512kb.mp4.vtt
│   ├── OrganClip.high.mp4
│   ├── OrganClip.low.mp4
│   └── OrganClip.medium.mp4
└── Demo_Manifest.xlsx


Extension options


Preservation-only materials

If files are meant for preservation but not for access, this can be mitigated by stopping the migration of an AIP into a DIP. The files can also be removed before ingest into Avalon, or removed after ingest into Avalon. 


Existing transcoded materials

It is possible to adapt this workflow to suit a situation in which transcoded access copies have already been created, when those transcoded copies are in alignment with the rules and structure of the Avalon Media System. 

Points in the workflow that would have to be modified, including: 

  • adjustment to the archival bag settings to have the access copies located in a specific folder that Archivematica can identify, 
  • changing the Normalization settings to use existing service copies, and changing DIP upload settings to happen during the Archivematica workflow process and 
  • have DIPs get stored in the same space as the standard workflow.