Table of Contents
Internet Archive Workflow
Overview
This section describes the processes used to upload Hacker Public Radio
episodes to the Internet Archive (archive.org
).
Note: This text is taken from the Wiki built under GitLab several years ago. It's in the process of being updated for the current practices developed since then.
History
We have been adding HPR shows to the Internet Archive since 2010 when shows 1-620 were uploaded as MP3 audio in blocks of 10.
There was a delay of four years before the current project began in 2014. Since then shows have been uploaded individually, with show notes. The normal cycle has been to upload the previous weeks' shows each weekend, and gradually work through the older shows going back in time.
Originally in the current project, all that was uploaded was the WAV format audio and the show notes. The WAV file was transcoded to other formats by the Internet Archive software.
Towards the end of 2017 auxiliary files were uploaded for shows that have them: files like pictures, examples, supplementary notes and so forth. Also, in December 2017 we started pointing our feeds at the Internet Archive instead of the HPR server, and, since the audio files transcoded on the Internet Archive machines do not include audio tags, we began generating all the formats ourselves, with tags, and uploaded them too. We also needed to upload shows for the week ahead rather than the week just gone.
Workflow
Obsolete, needs work
-
As part of the process of preparing a new show the audio is transcoded to a variety of formats. The formats are: flac, mp3, ogg, opus, spx and wav.
-
The audio files are copied to the Raspberry Pi
borg
in Ken's house from the HPR server, and namedhpr<show>.<format>
as appropriate for the show number and audio format (e.g.hpr2481.wav
). They are stored in the directory/var/IA/uploads/
. -
The upload process itself, uses the internetarchive tool. This provides the
ia
command. There is a bulk mode which theia
command offers, and this is what is used. This takes a comma separated variable (CSV) file, which is generated by an HPR tool calledmake_metadata
which is currently run under the accountperloid
. -
The shows to be uploaded are checked for HTML errors. A script called
clean_notes
is used which uses a Perl module calledHTML::Tidy
to check for errors. Errors are corrected manually at this point. (TODO: explain in more detail) -
The
make_metadata
script generates data for a block of shows. It collects any associated files and saves them in the/var/IA/uploads/
directory. It generates a CSV file which points to the various audio formats for each show, as well as any associated files. Further details of what this tool can do are provided in its documentation. -
During metadata creation the
make_metadata
script will halt if it finds that a given show does not have a summary (extremely rare for new shows) or tags (sadly fairly common). It is possible to override this step, but it is preferable to supply the missing elements because they are of great use onarchive.org
. -
Having created the metadata in a CSV file this is processed with the
ia
tool. This is run in bulk upload mode, it reads the CSV file and creates an item on archive.org. It uploads any audio files listed in the CSV file as well as any associated files. (TODO: add an example) -
Once all uploads have completed the script
delete_uploaded
is run to delete files in/var/IA/uploads
which have been uploaded. The VPS does not have much disk space so deleting unnecessary files is important.
To be continued
Example commands
Back to home page