3 Managing_Assets
Ken Fallon edited this page 2023-09-02 16:39:32 +00:00

Static Site - asset management

Dave Morriss

Last update: 2023-09-01 11:55:51


Overview

This document describes a script which is being planned to perform actions on the HPR database. The main actions are:

  • Manage the existing contents in the assets table for a given show
    • This means that the table entries need to represent all of the files such as audio files, transcript files and any files included by the creator of the show.
    • The asset information is primarily used to provide information about audio files when creating RSS feeds including this show. In particular the audio file size is needed, as well as its name.
    • Show creator-provided assets are currently stored in the corresponding item on the Internet Archive. Previously they were held on the HPR server, but due to space restrictions on the new server are being held on archive.org. The assets table entries should indicate file locations. It is possible that alternative locations will be provided in the future, and this will need to be reflected in this table.

Can you clarify the location ? Do you mean relative to the server root ? If this is on someone else system we can only dictate the path under where they are serving the files from. So /some/path/to/web/server/hpr/ and from there hosts/ or series/hpr9999 etc.

  • It is intended that this script will be run to populate the assets table for a new show. **It will also set the valid field of the eps (episodes) table to true once the asset details are ready.

Should we try and split these functions into two ? Keeping to the unix idea of doing only one job.

  • The script will additionally manage links to assets in the notes field of the eps table for the show.

    • Existing shows with creator-provided assets will contain links to them, assuming they are on the HPR server. These must be updated to reflect their current locations (on the Internet Archive).
    • This same process will be required should assets be moved to another location.
    • The process of changing HTML links will be considerably simplified if a record of their current location is kept in the assets table. An URL stored in the assets table might be the only record of which URLs in the HTML notes and any HTML assets are assets and which are other resources. At present all asset URLs are either absolute, referring to the HPR IP address or are relative URLs where the base URL is assumed to be HPR.
    • It is assumed that URLs stored in the assets table will be absolute. Other strategies might be used however.
  • Details of show creator-provided assets do not exist at present. It will be necessary to gather this information as new shows are added and existing shows post-processed.

Script design

General

  • The script will use a configuration file for database credentials and other settings.

  • It will use options to specify which episode or episodes are to be processed and will be able to be run a dry-run mode to report what it would do without doing it.

  • The script has the name manage_assets at the moment.

  • It will log the actions it takes.

Information sources

  • One of the main sources of show creator asset information is currently the notes. This technique is used in the script make_metadata, the script which prepares shows for upload to the Internet Archive, which in the past has scanned the notes for file references on the HPR server, and has used this information has to download these files in order to upload them.

    • The approach taken by make_metadata has been to scan the notes for files, as mentioned, and if any of these have themselves been HTML, to scan these too for file references.
    • The goal of scanning for files was first to ensure that they were uploaded to the Internet Archive, but secondly to rewrite their URLs such that the shows were self-contained on the Internet Archive.
  • Another source of the asset information, both the show creator-produced assets, and the audio and transcripts, is the Internet Archive itself. It is possible to collect metadata from there (in JSON format) which lists all the files originally uploaded.

  • There may have been a few files which were not uploaded to the Internet Archive because they were not referenced in the notes. If this is the case, only a scan of the backups of the files stored on the old HPR server can identify them, and hopefully allow them to be added to the Internet Archive and referenced by the episode.

  • In the past, the information gathered about assets was not stored in the database. It is important that this deficiency be rectified by the manage_assets script so that it will not be necessary to hunt through notes and subsidiary HTML files to find their names in the future.

Algorithms

[This is a first draft, and is likely to be incomplete]

  • Given a show number, the script will search for it in the database.
    • If not found, then that show will be skipped
    • If found then the entries in the assets table will be collected as well as the eps.valid setting.
    • If no assets are found and eps.valid = 0 then this is a new show, the asset details of which are to be loaded into the assets table.
      • The Internet Archive upload might be on-going, which can be determined by querying the IA API for pending tasks. If all tasks have run the metadata can be collected and used to fill in asset details. Rather than waiting for tasks to complete it will probably be easier to skip this show and process it later.
      • Currently, some audio file details are obtained from the files themselves. Quite how to do this needs discussion - unless the manage_assets script is being run on the system that holds the files it might be problematic (though an SSH connection could be used to do this remotely)
    • If assets are found but eps.valid = 0 this is an anomaly.
    • If assets are found but eps.valid = 1 this is is a show that has previously been uploaded.
      • The assets can be collected from the Internet Archive metadata and compared with what is stored. Any that are missing can be added, and any that differ can be updated. Possibly, any asset records in the database but not on the Internet Archive can be deleted.
      • If it is necessary to obtain details of assets that are not stored in the Internet Archive then it might be necessary to download the files and store them in a cache for examination - after which they will be deleted.
    • NOTE This section is in need of further thought!
      The notes and asset files will need to be scanned to determine if the URLs need to be changed.
      • Ideally, asset URLs should be absolute. If so, it is simple to determine if a change is needed.
      • We have no means of marking a show in the database as having been processed by the manage_assets script.
      • Storing the absolute asset URL in the assets table will help to simplify processing. If the URL in the table is the same as that in the notes, then no change is needed. If not, then presumably an update is needed.
      • This is complicated by the presence of relative URLs.
      • The change required in a given asset URL can be determined by a base URL in the configuration file.