Samvera Newspapers Interest Group Call: 2019-05-02

Connection Details:

Join Zoom Meeting
https://zoom.us/j/993200218

One tap mobile
+16699006833,,993200218# US (San Jose)
+16465588656,,993200218# US (New York)

Dial by your location
+1 669 900 6833 US (San Jose)
+1 646 558 8656 US (New York)
Meeting ID: 993 200 218
Find your local number: https://zoom.us/u/adKGAdrW7F

Date

02 May 2019

Time: 1PM EST / 12PM CT / 11AM MST / 10AM PTD

Facilitator: Eben English (Boston Public Library)

Notetaker: TK

Etherpad: https://etherpad.wikimedia.org/p/Samvera_Newspapers_Interest_Group_Call__2019-05-02

Attendees

Eben English (BPL, leading call)
nhomenda Indiana University
Gordon Leacock (Univ. of Michigan)
Clifford Wulfman (Princeton)
Sean Upton (University of Utah, taking notes)

Agenda

Newspapers Grant update
1. Github: https://github.com/marriott-library/newspaper_works
2. Recent dev work
  1. Calendar browse
  2. Highlight search term in show view
  3. ChronAm style URLs
  4. NDNP ingest rake task
Newspapers Testing
1. Vagrant: https://github.com/marriott-library/samvera-vagrant
2. Test site: https://newspaperworks.digitalnewspapers.org
3. Wiki: https://github.com/marriott-library/newspaper_works/wiki
Article segmentation in METS-ALTO
TIFF batch ingest workflow
Content Examples
1. https://drive.google.com/drive/folders/0BwKKtxaBVqjEbE5zMFdWUEU4WGM?usp=sharing
  1. Still need: TEI, Veridian
Intel sharing from other groups/projects
Next meeting:
1. Thursday, July 11 2019 ?

Notes

Grant Update

Almost ready for stress testing ingest (NDNP)
- Command line, rake task takes path argument pointing to NDNP batch
- Should create (or link existing, based on LCCN) title/publication work, ingest reels, issues, pages
UI:
- Calendar-based browsing for any given title/publication
- Basic working functionality, later goals to style and improve look
- Search results, highlighting within IIIF viewer of search term, when linked to show view of page/issue, from search result
- Semantic URLs:
  - for publication/title (using LCCN)
  - for issues, pages (not sequential)
  - not included: article URLs

Search within a newspaper title
Upcoming:
- Search interface just for newspaper content (like "advanced search" in sense of field, but not necessarily the boolean operators
- Search within a title, select a date range, choose a language (facet), article type, etc.
- Question (Gordon): is this based on Blacklight advanced search?
  - Eben: This is TBD — trying to avoid collision in view configuration for multiple adavanced search.
- Batch ingest for PDF, TIFF
- Article segmentation in METS-ALTO?

Testing

Still a work in progress
- Vagrant instance is likely best choice
- Test site is up, works, provides the PDF ingest functionality, but still needs most of the UI features deployed to it.
  - Behind current master (of newspaper_works gem).
  - Hoping soon for feedback.
- Improving documentation in wiki of `newspaper_works` repository.

Article-segmented ALTO?

Open query to the people on the call...
- Anyone in group with experience with this?
- Thinking about metadata extraction (e.g. headline extraction, text classification)?
- Are there any projects anyone is aware of that have tackled this?
- Eben: "people like the idea of this, but very few have paid to have this done" ... ergo "little information on best practices".
- Title extraction: if there is something in the ALTO with cues for font size, "this block of text might be the title".

TIFF batch ingest

Anyone having experience with ingest workflows for (directories of TIFF newspaper pages, PDFs of issues)?
- Gordon: ingesting of PDFs, mostly.
  - Eben: was there some kind of manifest that gave hints to the files?
    - Gordon: I think it might have been XML descriptions of the files (will message Eben details)
- Eben: Leaning toward solution that is based on stipulated folder and file naming convention, instead of inventing some kind of required manifest.
  - PDF would need to have date, folder containing them would need to specify something like LCCN (or some other clue to publication name or identifier?).
    - Enough metadata for a batch ingest with a useful result.
    - Naming convention does not seem an arduous requirement.
  - Could have some kind of nesting, for TIFFs, with folder per date, possibly in folders for year, in a parent folder for the publication.
    - Would have to presume some kind of lexical order of the file naming for the pages within each issue. Files would not need issue date if the parent folder had the issue date in its directory name.
  - Hypothetically not complex to prepare materials for this structure.
  - Nicholas: may have some YML manifest example.
  - Sean: possible to eventually do configuration and/or manifest as a progressive enhancement at a future date.
  - Eben: how arduous is it to create the YML file vs. creating the file/folder structure/naming?
    - e.g. NDNP: folder with LCCN of a title, containing folders for reels, each issue directories under each reel using a date as the naming convention.. While the reel folder is somewhat superflous to this example, the LCCN and the date is enough to MVP (of ingested pages linked to parent issue and title works, without being orphaned).
  - Newspapers project looking at this in next 3-4 weeks.

Content examples, intel sharing

Standing item for content examples, not getting some of these materials so far, so we presume we are in okay shape for now for representative materials.

Next meeting

Thursday July 11, 1 PM EDT.