Samvera Newspapers Interest Group Call: 2019-05-02

Connection Details: 

Join Zoom Meeting
https://zoom.us/j/993200218

One tap mobile
+16699006833,,993200218# US (San Jose)
+16465588656,,993200218# US (New York)

Dial by your location
+1 669 900 6833 US (San Jose)
+1 646 558 8656 US (New York)
Meeting ID: 993 200 218
Find your local number: https://zoom.us/u/adKGAdrW7F

Date

Time: 1PM EST / 12PM CT / 11AM MST / 10AM PTD

FacilitatorEben English (Boston Public Library)

NotetakerTK

Etherpad: https://etherpad.wikimedia.org/p/Samvera_Newspapers_Interest_Group_Call__2019-05-02

Attendees

Agenda

  1. Newspapers Grant update
    1. Github: https://github.com/marriott-library/newspaper_works
    2. Recent dev work
      1. Calendar browse
      2. Highlight search term in show view
      3. ChronAm style URLs
      4. NDNP ingest rake task

  2. Newspapers Testing
    1. Vagrant: https://github.com/marriott-library/samvera-vagrant
    2. Test site: https://newspaperworks.digitalnewspapers.org
    3. Wiki: https://github.com/marriott-library/newspaper_works/wiki

  3. Article segmentation in METS-ALTO

  4. TIFF batch ingest workflow

  5. Content Examples
    1. https://drive.google.com/drive/folders/0BwKKtxaBVqjEbE5zMFdWUEU4WGM?usp=sharing
      1. Still need: TEI, Veridian

  6. Intel sharing from other groups/projects

  7. Next meeting: 
    1. Thursday, July 11 2019 ?

Notes

Grant Update

  •   Almost ready for stress testing ingest (NDNP)
    • Command line, rake task takes path argument pointing to NDNP batch
    • Should create (or link existing, based on LCCN) title/publication work, ingest reels, issues, pages
  • UI:
    • Calendar-based browsing for any given title/publication
    • Basic working functionality, later goals to style and improve look
    • Search results, highlighting within IIIF viewer of search term, when linked to show view of page/issue, from search result
    • Semantic URLs:
      • for publication/title (using LCCN)
      • for issues, pages (not sequential)
      • not included: article URLs
  • Search within a newspaper title
  • Upcoming: 
    • Search interface just for newspaper content (like "advanced search" in sense of field, but not necessarily the boolean operators
    • Search within a title, select a date range, choose a language (facet), article type, etc.
    • Question (Gordon): is this based on Blacklight advanced search?
      • Eben: This is TBD — trying to avoid collision in view configuration for multiple adavanced search.
    • Batch ingest for PDF, TIFF
    • Article segmentation in METS-ALTO?

Testing

  • Still a work in progress
    • Vagrant instance is likely best choice
    • Test site is up, works, provides the PDF ingest functionality, but still needs most of the UI features deployed to it.
      • Behind current master (of newspaper_works gem).
      • Hoping soon for feedback.
    • Improving documentation in wiki of `newspaper_works` repository.

Article-segmented ALTO?

  • Open query to the people on the call...
      - Anyone in group with experience with this?
      - Thinking about metadata extraction (e.g. headline extraction, text classification)?
      - Are there any projects anyone is aware of that have tackled this?
      - Eben: "people like the idea of this, but very few have paid to have this done" ... ergo "little information on best practices".
      - Title extraction: if there is something in the ALTO with cues for font size, "this block of text might be the title".

TIFF batch ingest

  • Anyone having experience with ingest workflows for (directories of TIFF newspaper pages, PDFs of issues)?
    • Gordon: ingesting of PDFs, mostly.
      • Eben: was there some kind of manifest that gave hints to the files?
        • Gordon: I think it might have been XML descriptions of the files (will message Eben details)
    • Eben: Leaning toward solution that is based on stipulated folder and file naming convention, instead of inventing some kind of required manifest.
      • PDF would need to have date, folder containing them would need to specify something like LCCN (or some other clue to publication name or identifier?).
            - Enough metadata for a batch ingest with a useful result. 
            - Naming convention does not seem an arduous requirement.
      • Could have some kind of nesting, for TIFFs, with folder per date, possibly in folders for year, in a parent folder for the publication.
            - Would have to presume some kind of lexical order of the file naming for the pages within each issue.  Files would not need issue date if the parent folder had the issue date in its directory name.
      • Hypothetically not complex to prepare materials for this structure.
      • Nicholas: may have some YML manifest example.
      • Sean: possible to eventually do configuration and/or manifest as a progressive enhancement at a future date.
      • Eben: how arduous is it to create the YML file vs. creating the file/folder structure/naming?
        • e.g. NDNP: folder with LCCN of a title, containing folders for reels, each issue directories under each reel using a date as the naming convention..  While the reel folder is somewhat superflous to this example, the LCCN and the date is enough to MVP (of ingested pages linked to parent issue and title works, without being orphaned).
      • Newspapers project looking at this in next 3-4 weeks.

Content examples, intel sharing

Standing item for content examples, not getting some of these materials so far, so we presume we are in okay shape for now for representative materials.

Next meeting

  • Thursday July 11, 1 PM EDT.

Action items

  •