Addition of more design ideas to Managing_Assets.md

Dave Morriss 2023-09-01 23:17:46 +01:00
parent 8d8d4062d6
commit db898575db

@ -1,11 +1,11 @@
# Static Site - asset management
### Dave Morriss
### Last update: 2023-08-31 11:49:34
### Last update: 2023-09-01 11:55:51
* * *
# Overview
## Overview
This document describes a script which is being planned to perform actions on
the HPR database. The main actions are:
@ -50,6 +50,98 @@ the HPR database. The main actions are:
necessary to gather this information as new shows are added and existing
shows post-processed.
## Script design
### General
- The script will use a configuration file for database credentials and other
settings.
- It will use options to specify which episode or episodes are to be
processed and will be able to be run a `dry-run` mode to report what it
would do without doing it.
- The script has the name `manage_assets` at the moment.
- It will log the actions it takes.
### Information sources
- One of the main sources of show creator asset information is currently the
notes. This technique is used in the script `make_metadata`, the script
which prepares shows for upload to the Internet Archive, which in the past
has scanned the notes for file references on the HPR server, and has used
this information has to download these files in order to upload them.
- The approach taken by `make_metadata` has been to scan the notes for
files, as mentioned, and if any of these have themselves been HTML, to
scan these too for file references.
- The goal of scanning for files was first to ensure that they were
uploaded to the Internet Archive, but secondly to rewrite their URLs
such that the shows were self-contained on the Internet Archive.
- Another source of the asset information, both the show creator-produced
assets, and the audio and transcripts, is the Internet Archive itself. It is
possible to collect metadata from there (in JSON format) which lists all the
files originally uploaded.
- There may have been a few files which were not uploaded to the Internet
Archive because they were not referenced in the notes. If this is the case,
only a scan of the backups of the files stored on the old HPR server can
identify them, and hopefully allow them to be added to the Internet Archive
and referenced by the episode.
- In the past, the information gathered about assets was not stored in the
database. It is important that this deficiency be rectified by the
`manage_assets` script so that it will not be necessary to hunt through
notes and subsidiary HTML files to find their names in the future.
### Algorithms
[This is a first draft, and is likely to be incomplete]
- Given a show number, the script will search for it in the database.
- If not found, then that show will be skipped
- If found then the entries in the `assets` table will be collected as
well as the `eps.valid` setting.
- If no assets are found and `eps.valid = 0` then this is a new show, the
asset details of which are to be loaded into the `assets` table.
- The Internet Archive upload might be on-going, which can be
determined by querying the `IA` API for pending tasks. If all tasks
have run the metadata can be collected and used to fill in asset
details. Rather than waiting for tasks to complete it will probably
be easier to skip this show and process it later.
- Currently, some audio file details are obtained from the files
themselves. Quite how to do this needs discussion - unless the
`manage_assets` script is being run on the system that holds the
files it might be problematic (though an SSH connection could be
used to do this remotely)
- If assets are found but `eps.valid = 0` this is an anomaly.
- If assets are found but `eps.valid = 1` this is is a show that has
previously been uploaded.
- The assets can be collected from the Internet Archive metadata and
compared with what is stored. Any that are missing can be added, and
any that differ can be updated. Possibly, any asset records in the
database but not on the Internet Archive can be deleted.
- If it is necessary to obtain details of assets that are not stored
in the Internet Archive then it might be necessary to download the
files and store them in a cache for examination - after which they
will be deleted.
- **NOTE** This section is in need of further thought! \
The notes and asset files will need to be scanned to determine if the
URLs need to be changed.
- Ideally, asset URLs should be absolute. If so, it is simple to
determine if a change is needed.
- We have no means of marking a show in the database as having been
processed by the `manage_assets` script.
- Storing the absolute asset URL in the `assets` table will help to
simplify processing. If the URL in the table is the same as that in
the notes, then no change is needed. If not, then presumably an
update *is* needed.
- This is complicated by the presence of relative URLs.
- The change required in a given asset URL can be determined by a
*base URL* in the configuration file.
<!--
- vim: syntax=markdown:ts=8:sw=4:ai:et:tw=78:fo=tcqn:fdm=marker:com+=fb\:-
-->