1 make_metadata
Dave Morriss edited this page 2024-06-04 16:48:09 +01:00

NAME

make_metadata - Generate metadata from the HPR database for Archive.org

VERSION

This documentation refers to make_metadata version 0.4.11

USAGE

make_metadata [-help] [-documentation]

make_metadata -from=FROM [-to=TO] [-count=COUNT] [-output[=FILE]]
    [-script[=FILE]] [-[no]meta_only] [-[no]fetch]
    [-[no]assets] [-[no]silent] [-[no]verbose] [-[no]test]
    [-[no]ignore_missing] [-config=FILE] [-dbconfig=FILE] [-debug=N]

make_metadata -list=LIST [-output[=FILE]] [-script[=FILE]]
    [-[no]meta_only] [-[no]fetch] [-[no]assets] [-[no]silent]
    [-[no]verbose] [-[no]test] [-[no]ignore_missing] [-config=FILE]
    [-dbconfig=FILE] [-debug=N]

Examples:

make_metadata -from=1234 -nofetch

make_metadata -from=1234 -to=1235

make_metadata -from=1234 -count=10

make_metadata -from=1 -to=3 -output=metadata_1-3.csv

make_metadata -from=1500 -to=1510 -out=metadata_1500-1510.csv -verbose

make_metadata -from=1500 -to=1510 -out=metadata_%d-%d.csv -verbose

make_metadata -from=500 -to=510 -out=metadata_%04d-%04d.csv -verbose

make_metadata -from=1500 -to=1510 -out -verbose

make_metadata -from=1500 -to=1510 -out

make_metadata -from=1675 -to=1680 -out=metadata_%d-%d.csv -meta_only

make_metadata -from=1450 -test

make_metadata -list='1234,2134,2314' -out -meta_only

make_metadata -list="931,932,933,935,938,939,940" -out -meta -ignore

make_metadata -dbconf=.hpr_livedb.cfg -from=1234 -to=1235

make_metadata -from=3004 -out -meta_only -noassets

OPTIONS

  • -help

    Reports brief information about how to use the script and exits. To see the full documentation use the option -documentation or -man. Alternatively, to generate a PDF version use the pod2pdf tool from http://search.cpan.org/~jonallen/pod2pdf-0.42/bin/pod2pdf. This can be installed with the cpan tool as App::pod2pdf.

  • -documentation or -man

    Reports full information about how to use the script and exits. Alternatively, to generate a PDF version use the pod2pdf tool from http://search.cpan.org/~jonallen/pod2pdf-0.42/bin/pod2pdf. This can be installed with the cpan tool as App::pod2pdf.

  • -debug=N

    Run in debug mode at the level specified by N. Possible values are:

    • 0

      No debugging (the default).

    • 1

      TBA

    • 2

      TBA

    • 3

      TBA

    • 4 and above

      The metadata hash is dumped.

      Each call of the function find_links_in_notes is reported. On finding an <a> or <img> tag the uri value is shown, as is any fragment and the related link. The original file is reported here.

      Each call of the function find_links_in_file is reported. On finding an <a> or <img> tag the uri value is shown, as is any fragment and the related link. The original file is reported here, and if a link is to be ignored this is reported.

  • -from=NUMBER

    This option defines the starting episode number of a group. It is mandatory to provide either the -from=NUMBER option or the -list=LIST option (see below).

  • -to=NUMBER

    This option specifies the final episode number of a group. If not given the script generates metadata for the single episode indicated by -from.

    The value given here must be greater than or equal to that given in the -from option. The option must not be present with the -count option.

    The difference between the episode numbers given by the -from and -to options must not be greater than 20.

  • -count=NUMBER

    This option specifies the number of episodes to process (starting from the episode number specified by the -from) option. The option must not be present with the -to option.

    The number of episodes specified must not be greater than 20.

  • -list=LIST

    This option is an alternative to -from=NUMBER and its associated modifying options. The LIST is a comma-separated list of not necessarily consecutive episode numbers, and must consist of at least one and no more than 20 numbers.

    This option is useful for the case when non-sequential episode numbers are to be uploaded, and is particularly useful when repairing elements of particular episodes (such as adding summary fields and tags) where they have already been uploaded.

    For example, the following shows have no summary and/or tags, but the shows are already in the IA. The missing items have been provided, so we wish to update the HTML part of the upload:

      $ ./make_metadata -list='2022,2027,2028,2029,2030,2033' -out -meta
      Output file: metadata_2022-2033.csv
    
  • -output[=FILE]

    This option specifies the file to receive the generated CSV data. If omitted the output is written to metadata.csv in the current directory.

    The file name may contain one or two instances of the characters '%d', with a leading width specification if desired (such as '%04d'). These will be substituted by the -from=NUMBER and -to=NUMBER values or if -from=NUMBER and -count=NUMBER are used, the second number will be the appropriate endpoint (adding the count to the starting number). If neither of the -to=NUMBER and -count=NUMBER options are used then there should only be one instance of '%d' or the script will abort.

    If no value is provided to -output then a suitable template will be generated. It will be 'metadata_%04d.csv' if one episode is being processed, and 'metadata_%04d-%04d.csv' if a range has been specified.

    Example:

      ./make_metadata -from=1430 -out=metadata_%04d.csv
    

    the output file name will be metadata_1430.csv. The same effect can be achieved with:

      ./make_metadata -from=1430 -out=
    

    or

      ./make_metadata -from=1430 -out
    
  • -script[=FILE]

    This option specifies the file to receive commands required to upload certain files relating to a show. If omitted the commands are written to script.sh in the current directory.

    The file name may contain one or two instances of the characters '%d', with a leading width specification if desired (such as '%04d'). These will be substituted by the -from=NUMBER and -to=NUMBER values or if -from=NUMBER and -count=NUMBER are used, the second number will be the appropriate endpoint (adding the count to the starting number). If neither of the -to=NUMBER and -count=NUMBER options are used then there should only be one instance of '%d' or the script will abort.

    If no value is provided to -script then a suitable template will be generated. It will be 'script_%04d.sh' if one episode is being processed, and 'script_%04d-%04d.sh' if a range has been specified.

    Example:

      ./make_metadata -from=1430 -script=script_%04d.sh
    

    the output file name will be script_1430.sh. The same effect can be achieved with:

      ./make_metadata -from=1430 -script=
    

    or

      ./make_metadata -from=1430 -script
    
  • -[no]fetch

    This option controls whether the script attempts to fetch the MP3 audio file from the HPR website should there be no WAV file in the upload area. The default setting is -fetch.

    Normally the script is run as part of the workflow to upload the metadata and audio to archive.org. The audio is expected to be a WAV file and to be in the location referenced in the configuration file under the 'uploads' label. However, not all of the WAV files exist for older shows.

    When the WAV file is missing and -fetch is selected or defaulted, the script will attempt to download the MP3 version of the audio and will store it in the 'uploads' area for the upload script (ias3upload.pl or ia) to send to archive.org. If the MP3 file is not found then the script will abort.

    If -fetch is specified (or defaulted) as well as -nometa_only (see below) then the audio file fetching process will not be carried out. This is because it makes no sense to fetch this file if it's not going to be referenced in the metadata.

  • -[no]assets

    This option controls the downloading of any assets that may be associated with a show. Assets are the files held on the HPR server which are referenced by the show. Examples might be photographs, scripts, and supplementary notes. Normally all such assets are collected and stored in the upload area and are then sent to the archive via the script. The notes sent to the archive are adjusted to refer to these notes on archive.org, making the HPR episode completely self-contained.

  • -[no]meta_only (alias -[no]noaudio)

    This option controls whether the output file will contain a reference to the audio file(s) or only the metadata. The default is -nometa_only meaning that the file reference(s) and the metadata are present.

    Omitting the file(s) allows the metadata to be regenerated, perhaps due to edits and corrections in the database, and the changes to be propagated to archive.org. If the file reference(s) exist(s) in the metadata file then the file(s) must be available at the time the uploader is run.

    Note that making changes this way is highly preferable to editing the entry on archive.org using the web-based editor. This is because there is a problem with the way HTML entities are treated and this can cause the HTML to be corrupted.

  • -[no]silent

    The option enables (-silent) and disables (-nosilent) silent mode. When enabled the script reports nothing on STDOUT. If the script cannot find the audio files and downloads the MP3 version from the HPR site for upload to archive.org then the downloads are reported on STDERR. This cannot be disabled, though the STDERR output could be redirected to a file or to /dev/null.

    If -silent is specified with -verbose then the latter "wins".

    The script runs with silent mode disabled by default. When -nosilent is used with -noverbose the script reports the output file name and nothing else.

  • -[no]verbose

    This option enables (-verbose) and disables (-noverbose) verbose mode. When enabled the script reports the metadata it has collected from the database before writing it to the output file. The data is reported in a more readable mode than examining the CSV file, although another script show_metadata is also available to help with this.

    If -verbose is specified with -silent then the former "wins".

    The script runs with verbose mode disabled by default.

  • -[no]ignore_missing

    The script checks each episode to ensure it has a summary and tags. If either of these fields is missing then a warning message is printed for that episode (unless -silent has been chosen), and if any episodes are lacking this information the script aborts without producing metadata. If the option -ignore_missing is selected then the warnings are produced (dependent on -silent) but the script runs to completion.

    The default setting is -noignore_missing; the script checks and aborts if any summaries or tags are missing.

  • -[no]test

    DO NOT USE!

    This option enables (-test) and disables (-notest) test mode. When enabled the script generates metadata containing various test values.

    In test mode the following changes are made:

    • .

      The item names, which normally contain 'hprnnnn', built from the episode number, have 'test_' prepended to them.

    • .

      The collection, which is normally a list containing 'hackerpublicradio' and 'podcasts', is changed to 'test_collection'. Items in this collection are normally deleted by Archive.org after 30 days.

    • .

      The contributor, which is normally 'HackerPublicRadio' is changed to 'perlist'.

    NOTE The test mode only works for the author!

  • -config=FILE

    This option allows an alternative script configuration file to be used. This file defines various settings relating to the running of the script - things like the place to look for the files to be uploaded. It is rare to need to use any other file than the default since these are specific to the environmewnt in which the script runs. However, this has been added at the same time as an alternative database configuration option was added.

    See the CONFIGURATION AND ENVIRONMENT section below for the file format.

    If the option is omitted the default file is used: .make_metadata.cfg

  • -dbconfig=FILE

    This option allows an alternative database configuration file to be used. This file defines the location of the database, its port, its name and the username and password to be used to access it. This feature was added to allow the script to access alternative databases or the live database over an SSH tunnel.

    See the CONFIGURATION AND ENVIRONMENT section below for the file format.

    If the option is omitted the default file is used: .hpr_db.cfg

DESCRIPTION

This script generates metadata suitable for uploading Hacker Public Radio episodes to the Internet Archive (archive.org).

The metadata is in comma-separated variable (CSV) format suitable for processing with an upload script. The original upload script was called ias3upload.pl, and could be obtained from https://github.com/kngenie/ias3upload. This script is no longer supported and make_metadata no longer generates output suitable for it (though it is simple to make it compatible if necessary). The replacement script is called internetarchive which is a Python tool which can also be run from the command line. It can be found at https://github.com/jjjake/internetarchive.

The make_metadata script generates CSV from the HPR database. It looks up details for each episode selected by the options, and performs various conversions and concatenations. The goal is to prepare items for the Internet Archive with as much detail as the format can support.

The resulting CSV file contains a header line listing the field names required by archive.org followed by as many CSV lines of episode data as requested (up to a limit of 20).

Since the upload method uses the HTTP protocol with fields stored in headers, there are restrictions on the way HTML can be formatted in the Details field. The script converts newlines, which are not allowed into <br/> tags where necessary.

HPR shows often have associated files, such as pictures, examples, long-form notes and so forth. The script finds these and downloads them to the cache area where the audio is kept and writes the necessary lines to the CSV file to ensure they are uploaded with the show. It modifies any HTML which links to these files to link to the archive.org copies in order to make the complete show self-contained.

DIAGNOSTICS

  • Configuration file ... not found

    One or more of the configuration files has not been found.

  • Path ... not found

    The path specified in the uploads definition in the configuration file .make_metadata.cfg does not exist. Check the configuration file.

  • Configuration data missing

    While checking the configuration file(s) the script has detected that settings are missing. Check the details specified below and provide the missing elements.

  • Mis-match between @fields and %dispatch!

    An internal error in the script has been detected where the elements of the @fields array do not match the keys of the %dispatch hash. This is probably the result of a failed attempt to edit either of these components.

    Correct the error and run the script again.

  • Invalid list; no elements

    There are no list elements in the -list=LIST option.

  • Invalid list; too many elements

    There are more than the allowed 20 elements in the list specified by the -list=LIST option.

  • Failed to parse -list=...

    A list was specified that did not contain a CSV list of numbers.

  • Invalid starting episode number (...)

    The value used in the -from option must be greater than 0.

  • Do not combine -to and -count

    Using both the -to and -count is not permitted (and makes no sense).

  • Invalid range; ... is greater than ...

    The -from episode number must be less than or equal to the -to number.

  • Invalid range; range is too big (>20)

    The difference between the starting and ending episode number is greater than 20.

  • Invalid - too many '%d' sequences in '...'

    There were more than two '%d' sequences in the the name of the output file if a range of episodes is being processed, or more than one if a single episode has been specified.

  • Invalid - too few '%d' sequences in '...'

    There were fewer than two '%d' sequences in the the name of the output file when a range of episodes was being processed.

  • Unable to open ... for output: ...

    The script was unable to open the requested output file.

  • Unable to find or download ...

    The script has not found a .WAV file in the cache area so has attempted to download the MP3 copy of the audio from the HPR website. This process has failed.

  • Failed to find requested episode

    An episode number could not be found in the database. This error is not fatal.

  • Nothing to do

    After processing the range of episodes specified the script could not find anything to do. This is most often caused by all of the episodes in the range being invalid.

  • Aborted due to missing summaries and/or tags

    One or more of the shows being processed does not have a summary or tags. The script has been told not to ignore this so has aborted before generating metadata.

  • HTML::TreeBuilder failed to parse notes: ...

    The script failed to parse the HTML in the notes of one of the episodes. This indicates a serious problem with these notes and is fatal since these notes need to be corrected before the episode is uploaded to the Internet Archive.

  • HTML::TreeBuilder failed to process ...: ...

    While parsing the HTML in a related file the parse has failed. The file being parsed is reported as well as the error that was encountered. This is likely due to bad HTML.

  • Unable to open ... for writing: ...

    The script is attempting to open an HTML file which it has downloaded to write back edited HTML, yet the open has failed. The filename is in the error message as is the cause of the error.

CONFIGURATION AND ENVIRONMENT

This script reads two configuration files in Config::General format (similar to Apache configuration files) for the path to the files to be uploaded and for credentials to access the HPR database. Two files are used because the database configuration file is used by several other scripts.

The general configuration file is .make_metadata.cfg (although this can be overridden through the -config=FILE option) and contains the following lines:

uploads = "<path to files>"
filetemplate = "hpr%04d.%s"
baseURL = "http://hackerpublicradio.org"
URLtemplate = "http://hackerpublicradio.org/local/%s"
IAURLtemplate = "http://archive.org/download/%s/%s"

The uploads line defines where the WAV files are to be found (currently /var/IA/uploads on the VPS). The same area is used to store downloaded MP3 files and any supplementary files associated with the episode.

The filetemplate line defines the format of an audio file such as hpr1234.wav. This should not be changed.

The baseURL line defines the common base for download URLs. It is used when parsing and standardising URLs relating to files on the HPR server.

The URLtemplate line defines the format of the URL required to download the MP3 audio. This should not be changed except in the unlikely event that the location of audio files on the server changes.

The IAURLtemplate line defines the format of URLs on archive.org which is used when generating new links in HTML notes or supplementary files.

The database configuration file is .hpr_db.cfg (although this can be overridden through the -dbconfig=FILE option).

The layout of the file should be as follows:

<database>
    host = 127.0.0.1
    port = PORT
    name = DATABASE
    user = USERNAME
    password = PASSWORD
</database>

DEPENDENCIES

Carp
Config::General
DBI
Data::Dumper
File::Find::Rule
File::Path
Getopt::Long
HTML::Entities
HTML::TreeBuilder
IO::HTML
LWP::Simple
List::MoreUtils
List::Util
Pod::Usage
Text::CSV_XS

BUGS AND LIMITATIONS

There are no known bugs in this module. Please report problems to Dave Morriss (Dave.Morriss@gmail.com) Patches are welcome.

AUTHOR

Dave Morriss (Dave.Morriss@gmail.com)

LICENCE AND COPYRIGHT

Copyright (c) 2014-2019 Dave Morriss (Dave.Morriss@gmail.com). All rights reserved.

This module is free software; you can redistribute it and/or modify it under the same terms as Perl itself. See perldoc perlartistic.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.