8 Show processing
Dave Morriss edited this page 2024-07-10 09:30:42 +01:00

Introduction

At present (2023-06-22), show notes are processed externally. The task is performed by Dave Morriss on his PC workstation.

The details of the show processing stages are included here because the way many steps are being done at present, in comparison to the way they will be done, needs consideration (and possibly debate) before being changed.

Overview

The overall process consists of these steps:

  • A new show is detected (or multiple shows if appropriate)
  • The status of the upload is determined, and only when complete, the next stage is performed
  • A subset of the files in the directory on the HPR server where incoming shows are dropped is synchronised using rsync
  • The files collected are written to a local directory for processing
  • Processing consists of:
    • Examination of the file shownotes.json
    • Parsing of this file
    • Extraction of elements of the file, particularly the show notes, title and summary
    • If there are assets (pictures, scripts, etc) they may need work
    • The notes are edited, and may need conversion, reformatting, etc.
    • If not HTML already the notes are converted to HTML
    • A local stand-alone copy of the notes is generated and can be viewed in a browser
    • Further work may be needed to refine the notes
    • Any assets are sent to the HPR server
    • The HTML is sent back to the upload directory
    • The status of the show is set to METADATA_PROCESSED in the reservations table in the HPR database
    • A message is sent to the HPR Janitor's Closet room on Matrix
  • The local directory for the show is retained in case further work is required, and deleted after a time by a script

Details

Show detection

  • At present, new shows are detected by scraping the page at https://hub.hackerpublicradio.org/calendar.php
  • The scraper is a Perl script called scrape_HPR and it is run every 30 minutes during the day (UK time)
  • The scraper detects all of the shows in the main table and categorises them by matching with regular expressions. It is able to spot empty slots, reserved slots, slots that have been requested by clicking on their links and are uploading, uploaded shows awaiting processing and slots which contain already processed shows.
  • Uploading and uploaded shows in need of processing are reported by various methods: pop-up alerts, sounds, and the triggering of IoT devices (LED lights)
  • The scraper also detects what it sees as anomalies:
    • Shows that appear "fully formed" without apparently going through the upload process
    • Shows that disappear where they had once existed in the fully uploaded state
  • Since the presence of the local show directory affects other parts of the workflow some extra actions are taken when anomalies are detected:
    • New unexpected shows cause "dummy" directories to be created
    • Show disappearances cause existing directories to be moved to a holding area (in case they are wanted in the future)

Redesign

  • The use of a scraper as described here might not be optimal since it is very dependent on the format of the calendar.php page

  • There exists a database table called reservations which holds status information about shows being received and processed

  • An interface to this table exists which can be accessed through curl. This interface is used in other scripts within the show processing workflow.

  • Discussion of the use of this interface in preference to the current web scraping interface is ongoing and will be documented here.

  • The reservations table is populated as a new show is being set up by the host selecting a slot on the calendar.php page

  • The status values are:

Name Short description Comments
REQUEST_UNVERIFIED unverified shouldn't be returned
REQUEST_EMAIL_SENT email sent host sent the email with a link
EMAIL_LINK_CLICKED pending filling in the form/sending the show
SHOW_SUBMITTED uploaded upload complete
METADATA_PROCESSED metadata processed notes processed, etc
SHOW_POSTED in the database awaiting audio transcoding
MEDIA_TRANSCODED transcoded audio transcoded
UPLOADED_TO_IA uploaded to IA uploaded to IA
UPLOADED_TO_RSYNC_NET archived archived on rsync.net [¹]

[¹] free allocation exhausted?

  • What cannot be detected from the above list?
    1. If a show existed at some point but has been deleted there's no way of telling. The present system, because it keeps a record of processed shows can spot a free slot where there was a show before.
    2. Slots which have been reserved (such as for Community News shows) cannot be detected since they are not in the reservations table.
    3. If a slot becomes filled in a "non-standard" way, bypassing the normal route of appearing in the reservations table and progressing though the above states, this cannot be detected. In the past this has sometimes happened as reserve shows have been added to this system, for example.

Copying files from the server

  • This is achieved with rsync over an ssh connection which is run from a script called sync_hpr
  • The rsync command uses a filter which limits what is copied: --filter=". .rsync_hpr_upload", where the filter in the file ignores all (likely) media types, files likely to have been written to the directories during processing, and various others.
  • The rsync command also deletes any local directories which have been deleted on the server

Copying show files to a working area

  • Downloaded files are stored in a local directory named after the show, such as shownotes/hpr1234.
  • The script copy_shownotes selects new shows from the upload/ directory and copies them to a working directory as described. It does this using find with a regular expression matching the directory structure.

Processing shows

  • A pdmenu menu is used to manage show processing. This is created dynamically for each show which is ready for processing, and does so in numerical order. A script called makemenu is used to generate each menu.
  • Once a show has been found that is eligible for processing statistics about are collected by parsing the shownotes.json file (using jq) and the menu tailored for the type of action which may be required. Such choices as whether to pre-process images and whether to upload assets to the server

TBA