Overview
We upload all HPR shows to the Internet Archive (referred to as the IA here).
Each show is an IA item with an URL such as:
https://archive.org/details/hpr0144
. Here the number 0144
is the show
number using 4 digits with leading zeroes.
A show consists of a front page built from the HTML copied from the HPR database. Attached to the item are all the files associated with the show; always the audio files and any other assets such as photographs, added text, scripts, etc. The intention is to make the copy of the show on the IA stand-alone. For historical reasons, there are some shows where not all associated files have yet been uploaded. There should be a record of these, but nothing has yet been done to add missing files.
Status
-
At the time of writing, 2022-03-05, most of the older shows in the range 1-870 have been uploaded (in reverse numerical order) but the last three (1-3) have not, due to a naming clash.
-
Update 2022-08-04: the naming clash mentioned above was cleared and all shows have now been uploaded. The project to re-upload certain shows is ongoing. This will ensure all assets are on the IA and that any metadata is up to date.
History
We have been adding HPR shows to the Internet Archive since 2010 when shows 1-620 were uploaded as MP3 audio in batches of 10. For example, the audio for shows 121-130 exist as the batch: https://archive.org/details/Hackerpublicradio.org-archiveEp0121-Ep0130
There was a delay of four years before the current project began in 2014. Since then shows have been uploaded individually, with show notes. The original cycle was to upload the previous weeks' shows each weekend, and gradually work through the older shows going back in time.
The main tools used are make_metadata
(a locally-developed
Perl script) and ia
(a Python script created by IA programmers).
Originally in the current project, all that was uploaded was the WAV format audio and the show notes. The WAV file was transcoded to other formats by the Internet Archive software.
Towards the end of 2017 auxiliary files were uploaded for shows that have
them: files like pictures, examples, supplementary notes and so forth. Also,
in December 2017 we started pointing our RSS feeds at the Internet Archive instead
of the HPR server, and, since the audio files transcoded on the Internet
Archive machines do not include audio tags, we began generating all the
formats ourselves, with tags, and uploaded them too. We also needed to upload
shows for the week ahead rather than the week just gone. A script called
weekly_upload
performed the necessary steps top preload shows. This is not
currently used.
In early 2021 the upload strategy was changed. A script called
future_upload
was written which determines if there are
shows to upload from the cacheing area on borg
. It does this by consulting a
history file and by querying the IA itself. If shows are found they are
uploaded.
At around the same time, a script called past_upload
was
written to upload shows in the range 1-870. This collects the show audio from
the HPR server - which is just MP3 format - transcodes it into all of the
formats required on the IA, and uploads the results. This is run on a regular
basis from borg
, processing five shows a day so as not to overload the IA
servers.
A SQLite database exists (called ia.db
) which is used to hold information
about shows uploaded to the IA. This is useful to keep track of what has been
done, it is used when generating the monthly Community News show notes, and is
intended to be incorporated into the planned new HPR database design.
Software and other components
This is an alphabetic list of scripts, for reference:
archive_metadata
This Bash script adds metadata files (produced by make_metadata
- see below)
to a compressed tar
file (called meta.tar.bz2
) and deletes the originals.
There is currently no mechanism for purging the oldest files stored in this
way.
check_week
This Bash script is used to check what shows exist in the HPR database for a
particular week (by week number) and whether these shows have been uploaded to
the IA. It was created to prevent gaps from appearing in the sequence of shows
on the IA, caused by too infrequent runs of future_upload
.
Documentation may be found here.
collect_show_data
This Bash script is used to collect data from the IA in JSON format for adding
to the SQLite database (ia.db
). This is being done on a local workstation
rather than on borg
, but the database is being kept on Gitea and a copy
stored on borg:~perloid/InternetArchive/ia.db
which is synchronised daily.
future_upload
This Bash script runs on borg
where it performs show uploads by looking at
the cache of show files (/var/IA/uploads
) and determining which have not yet
been uploaded to the IA. Since the checks interrogate the IA and are
expensive, the script maintains a history file in .future_upload.dat
which
lists the shows that have been uploaded.
Documentation may be found here.
make_metadata
This Perl script generates CSV metadata for driving the upload of HPR shows to the Internet Archive. The script is mainly called from other scripts, because its use is rather complex. The script itself contains its own documentation, a copy of which is included here.
past_upload
A Bash script for uploading older shows to the IA on borg
. Downloads the
audio (always mp3
for older shows) and transcodes it to the formats used for
newer shows, maintaining id3 tags and so forth along the way. Generates CSV
metadata with make_metadata
and uploads the shows with the ia
tool.
Documentation may be found here.
Dependencies
Aside from Perl modules (which are documented in the relevant POD sections in the scripts), the various Bash scripts perform checks on pre-requisite files and tools.
This is a list of these pre-requisites, starting with Bash and Perl scripts:
~/bin/close_tunnel
A Bash script to close down the SSH tunnel opened by open_tunnel
~/bin/function_lib.sh
A file of shared Bash functions.
~/bin/open_tunnel
A Bash script used to open an SSH tunnel to the HPR server so that scripts can easily access the MariaDB database there.
~/bin/transfer_tags
A Perl script which transfers id3
tags from a main file to a number of
subsidiary files.
~/bin/tunnel_is_open
A Bash script that tests whether the SSH tunnel is open.
ia
A Python script from the Internet Archive used to interact with the IA servers. This is used to interrogate the state of the collection on the IA and to upload files.
The tool can be installed as described here: installing
internetarchive
This provides the ia
command.
jq
The JSON parser used to manipulate JSON files imported from the IA.