1 Working with the Internet Archive
Dave Morriss edited this page 2024-06-04 16:48:09 +01:00

Overview

We upload all HPR shows to the Internet Archive (referred to as the IA here).

Each show is an IA item with an URL such as: https://archive.org/details/hpr0144. Here the number 0144 is the show number using 4 digits with leading zeroes.

A show consists of a front page built from the HTML copied from the HPR database. Attached to the item are all the files associated with the show; always the audio files and any other assets such as photographs, added text, scripts, etc. The intention is to make the copy of the show on the IA stand-alone. For historical reasons, there are some shows where not all associated files have yet been uploaded. There should be a record of these, but nothing has yet been done to add missing files.

Status

  • At the time of writing, 2022-03-05, most of the older shows in the range 1-870 have been uploaded (in reverse numerical order) but the last three (1-3) have not, due to a naming clash.

  • Update 2022-08-04: the naming clash mentioned above was cleared and all shows have now been uploaded. The project to re-upload certain shows is ongoing. This will ensure all assets are on the IA and that any metadata is up to date.

History

We have been adding HPR shows to the Internet Archive since 2010 when shows 1-620 were uploaded as MP3 audio in batches of 10. For example, the audio for shows 121-130 exist as the batch: https://archive.org/details/Hackerpublicradio.org-archiveEp0121-Ep0130

There was a delay of four years before the current project began in 2014. Since then shows have been uploaded individually, with show notes. The original cycle was to upload the previous weeks' shows each weekend, and gradually work through the older shows going back in time.

The main tools used are make_metadata (a locally-developed Perl script) and ia (a Python script created by IA programmers).

Originally in the current project, all that was uploaded was the WAV format audio and the show notes. The WAV file was transcoded to other formats by the Internet Archive software.

Towards the end of 2017 auxiliary files were uploaded for shows that have them: files like pictures, examples, supplementary notes and so forth. Also, in December 2017 we started pointing our RSS feeds at the Internet Archive instead of the HPR server, and, since the audio files transcoded on the Internet Archive machines do not include audio tags, we began generating all the formats ourselves, with tags, and uploaded them too. We also needed to upload shows for the week ahead rather than the week just gone. A script called weekly_upload performed the necessary steps top preload shows. This is not currently used.

In early 2021 the upload strategy was changed. A script called future_upload was written which determines if there are shows to upload from the cacheing area on borg. It does this by consulting a history file and by querying the IA itself. If shows are found they are uploaded.

At around the same time, a script called past_upload was written to upload shows in the range 1-870. This collects the show audio from the HPR server - which is just MP3 format - transcodes it into all of the formats required on the IA, and uploads the results. This is run on a regular basis from borg, processing five shows a day so as not to overload the IA servers.

A SQLite database exists (called ia.db) which is used to hold information about shows uploaded to the IA. This is useful to keep track of what has been done, it is used when generating the monthly Community News show notes, and is intended to be incorporated into the planned new HPR database design.

Software and other components

This is an alphabetic list of scripts, for reference:

archive_metadata

This Bash script adds metadata files (produced by make_metadata - see below) to a compressed tar file (called meta.tar.bz2) and deletes the originals. There is currently no mechanism for purging the oldest files stored in this way.

check_week

This Bash script is used to check what shows exist in the HPR database for a particular week (by week number) and whether these shows have been uploaded to the IA. It was created to prevent gaps from appearing in the sequence of shows on the IA, caused by too infrequent runs of future_upload.

Documentation may be found here.

collect_show_data

This Bash script is used to collect data from the IA in JSON format for adding to the SQLite database (ia.db). This is being done on a local workstation rather than on borg, but the database is being kept on Gitea and a copy stored on borg:~perloid/InternetArchive/ia.db which is synchronised daily.

future_upload

This Bash script runs on borg where it performs show uploads by looking at the cache of show files (/var/IA/uploads) and determining which have not yet been uploaded to the IA. Since the checks interrogate the IA and are expensive, the script maintains a history file in .future_upload.dat which lists the shows that have been uploaded.

Documentation may be found here.

make_metadata

This Perl script generates CSV metadata for driving the upload of HPR shows to the Internet Archive. The script is mainly called from other scripts, because its use is rather complex. The script itself contains its own documentation, a copy of which is included here.

past_upload

A Bash script for uploading older shows to the IA on borg. Downloads the audio (always mp3 for older shows) and transcodes it to the formats used for newer shows, maintaining id3 tags and so forth along the way. Generates CSV metadata with make_metadata and uploads the shows with the ia tool.

Documentation may be found here.

Dependencies

Aside from Perl modules (which are documented in the relevant POD sections in the scripts), the various Bash scripts perform checks on pre-requisite files and tools.

This is a list of these pre-requisites, starting with Bash and Perl scripts:

~/bin/close_tunnel

A Bash script to close down the SSH tunnel opened by open_tunnel

~/bin/function_lib.sh

A file of shared Bash functions.

~/bin/open_tunnel

A Bash script used to open an SSH tunnel to the HPR server so that scripts can easily access the MariaDB database there.

~/bin/transfer_tags

A Perl script which transfers id3 tags from a main file to a number of subsidiary files.

~/bin/tunnel_is_open

A Bash script that tests whether the SSH tunnel is open.

ia

A Python script from the Internet Archive used to interact with the IA servers. This is used to interrogate the state of the collection on the IA and to upload files.

The tool can be installed as described here: installing internetarchive This provides the ia command.

jq

The JSON parser used to manipulate JSON files imported from the IA.