Internet Archive Workflow

Overview

This section describes the processes used to upload Hacker Public Radio episodes to the Internet Archive (archive.org).

Note: This text is taken from the Wiki built under GitLab several years ago. It's in the process of being updated for the current practices developed since then.

History

We have been adding HPR shows to the Internet Archive since 2010 when shows 1-620 were uploaded as MP3 audio in blocks of 10.

There was a delay of four years before the current project began in 2014. Since then shows have been uploaded individually, with show notes. The normal cycle has been to upload the previous weeks' shows each weekend, and gradually work through the older shows going back in time.

Originally in the current project, all that was uploaded was the WAV format audio and the show notes. The WAV file was transcoded to other formats by the Internet Archive software.

Towards the end of 2017 auxiliary files were uploaded for shows that have them: files like pictures, examples, supplementary notes and so forth. Also, in December 2017 we started pointing our feeds at the Internet Archive instead of the HPR server, and, since the audio files transcoded on the Internet Archive machines do not include audio tags, we began generating all the formats ourselves, with tags, and uploaded them too. We also needed to upload shows for the week ahead rather than the week just gone.

Workflow

Obsolete, needs work

As part of the process of preparing a new show the audio is transcoded to a variety of formats. The formats are: flac, mp3, ogg, opus, spx and wav.
The audio files are copied to the Raspberry Pi borg in Ken's house from the HPR server, and named hpr<show>.<format> as appropriate for the show number and audio format (e.g. hpr2481.wav). They are stored in the directory /var/IA/uploads/.
The upload process itself, uses the internetarchive tool. This provides the ia command. There is a bulk mode which the ia command offers, and this is what is used. This takes a comma separated variable (CSV) file, which is generated by an HPR tool called make_metadata which is currently run under the account perloid.
The shows to be uploaded are checked for HTML errors. A script called clean_notes is used which uses a Perl module called HTML::Tidy to check for errors. Errors are corrected manually at this point. (TODO: explain in more detail)
The make_metadata script generates data for a block of shows. It collects any associated files and saves them in the /var/IA/uploads/ directory. It generates a CSV file which points to the various audio formats for each show, as well as any associated files. Further details of what this tool can do are provided in its documentation.
During metadata creation the make_metadata script will halt if it finds that a given show does not have a summary (extremely rare for new shows) or tags (sadly fairly common). It is possible to override this step, but it is preferable to supply the missing elements because they are of great use on archive.org.
Having created the metadata in a CSV file this is processed with the ia tool. This is run in bulk upload mode, it reads the CSV file and creates an item on archive.org. It uploads any audio files listed in the CSV file as well as any associated files. (TODO: add an example)
Once all uploads have completed the script delete_uploaded is run to delete files in /var/IA/uploads which have been uploaded. The VPS does not have much disk space so deleting unnecessary files is important.

To be continued

Example commands

Back to home page