Update 'Managing_Assets'
parent
db898575db
commit
102d4d7efe
@ -1,147 +1,149 @@
|
||||
|
||||
# Static Site - asset management
|
||||
### Dave Morriss
|
||||
### Last update: 2023-09-01 11:55:51
|
||||
|
||||
* * *
|
||||
|
||||
## Overview
|
||||
|
||||
This document describes a script which is being planned to perform actions on
|
||||
the HPR database. The main actions are:
|
||||
|
||||
- Manage the existing contents in the `assets` table for a given show
|
||||
- This means that the table entries need to represent all of the files
|
||||
such as audio files, transcript files and any files included by the
|
||||
creator of the show.
|
||||
- The asset information is primarily used to provide information about
|
||||
audio files when creating RSS feeds including this show. In particular
|
||||
the audio file size is needed, as well as its name.
|
||||
- Show creator-provided assets are currently stored in the corresponding
|
||||
item on the Internet Archive. Previously they were held on the HPR
|
||||
server, but due to space restrictions on the new server are being held
|
||||
on archive.org. The `assets` table entries should indicate file
|
||||
locations. It is possible that alternative locations will be
|
||||
provided in the future, and this will need to be reflected in this
|
||||
table.
|
||||
|
||||
- It is intended that this script will be run to populate the assets table for
|
||||
a new show. It will also set the `valid` field of the `eps` (episodes)
|
||||
table to `true` once the asset details are ready.
|
||||
|
||||
- The script will additionally manage links to assets in the `notes` field of
|
||||
the `eps` table for the show.
|
||||
- Existing shows with creator-provided assets will contain links to them,
|
||||
assuming they are on the HPR server. These must be updated to reflect
|
||||
their current locations (on the Internet Archive).
|
||||
- This same process will be required should assets be moved to another
|
||||
location.
|
||||
- The process of changing HTML links will be considerably simplified if a
|
||||
record of their current location is kept in the `assets` table. An URL
|
||||
stored in the `assets` table might be the only record of which URLs in
|
||||
the HTML notes and any HTML assets are assets and which are other
|
||||
resources. At present all asset URLs are either absolute, referring to
|
||||
the HPR IP address or are relative URLs where the base URL is assumed to
|
||||
be HPR.
|
||||
- It is assumed that URLs stored in the `assets` table will be absolute.
|
||||
Other strategies might be used however.
|
||||
|
||||
- Details of show creator-provided assets do not exist at present. It will be
|
||||
necessary to gather this information as new shows are added and existing
|
||||
shows post-processed.
|
||||
|
||||
## Script design
|
||||
|
||||
### General
|
||||
|
||||
- The script will use a configuration file for database credentials and other
|
||||
settings.
|
||||
|
||||
- It will use options to specify which episode or episodes are to be
|
||||
processed and will be able to be run a `dry-run` mode to report what it
|
||||
would do without doing it.
|
||||
|
||||
- The script has the name `manage_assets` at the moment.
|
||||
|
||||
- It will log the actions it takes.
|
||||
|
||||
### Information sources
|
||||
|
||||
- One of the main sources of show creator asset information is currently the
|
||||
notes. This technique is used in the script `make_metadata`, the script
|
||||
which prepares shows for upload to the Internet Archive, which in the past
|
||||
has scanned the notes for file references on the HPR server, and has used
|
||||
this information has to download these files in order to upload them.
|
||||
- The approach taken by `make_metadata` has been to scan the notes for
|
||||
files, as mentioned, and if any of these have themselves been HTML, to
|
||||
scan these too for file references.
|
||||
- The goal of scanning for files was first to ensure that they were
|
||||
uploaded to the Internet Archive, but secondly to rewrite their URLs
|
||||
such that the shows were self-contained on the Internet Archive.
|
||||
|
||||
- Another source of the asset information, both the show creator-produced
|
||||
assets, and the audio and transcripts, is the Internet Archive itself. It is
|
||||
possible to collect metadata from there (in JSON format) which lists all the
|
||||
files originally uploaded.
|
||||
|
||||
- There may have been a few files which were not uploaded to the Internet
|
||||
Archive because they were not referenced in the notes. If this is the case,
|
||||
only a scan of the backups of the files stored on the old HPR server can
|
||||
identify them, and hopefully allow them to be added to the Internet Archive
|
||||
and referenced by the episode.
|
||||
|
||||
- In the past, the information gathered about assets was not stored in the
|
||||
database. It is important that this deficiency be rectified by the
|
||||
`manage_assets` script so that it will not be necessary to hunt through
|
||||
notes and subsidiary HTML files to find their names in the future.
|
||||
|
||||
### Algorithms
|
||||
|
||||
[This is a first draft, and is likely to be incomplete]
|
||||
|
||||
- Given a show number, the script will search for it in the database.
|
||||
- If not found, then that show will be skipped
|
||||
- If found then the entries in the `assets` table will be collected as
|
||||
well as the `eps.valid` setting.
|
||||
- If no assets are found and `eps.valid = 0` then this is a new show, the
|
||||
asset details of which are to be loaded into the `assets` table.
|
||||
- The Internet Archive upload might be on-going, which can be
|
||||
determined by querying the `IA` API for pending tasks. If all tasks
|
||||
have run the metadata can be collected and used to fill in asset
|
||||
details. Rather than waiting for tasks to complete it will probably
|
||||
be easier to skip this show and process it later.
|
||||
- Currently, some audio file details are obtained from the files
|
||||
themselves. Quite how to do this needs discussion - unless the
|
||||
`manage_assets` script is being run on the system that holds the
|
||||
files it might be problematic (though an SSH connection could be
|
||||
used to do this remotely)
|
||||
- If assets are found but `eps.valid = 0` this is an anomaly.
|
||||
- If assets are found but `eps.valid = 1` this is is a show that has
|
||||
previously been uploaded.
|
||||
- The assets can be collected from the Internet Archive metadata and
|
||||
compared with what is stored. Any that are missing can be added, and
|
||||
any that differ can be updated. Possibly, any asset records in the
|
||||
database but not on the Internet Archive can be deleted.
|
||||
- If it is necessary to obtain details of assets that are not stored
|
||||
in the Internet Archive then it might be necessary to download the
|
||||
files and store them in a cache for examination - after which they
|
||||
will be deleted.
|
||||
- **NOTE** This section is in need of further thought! \
|
||||
The notes and asset files will need to be scanned to determine if the
|
||||
URLs need to be changed.
|
||||
- Ideally, asset URLs should be absolute. If so, it is simple to
|
||||
determine if a change is needed.
|
||||
- We have no means of marking a show in the database as having been
|
||||
processed by the `manage_assets` script.
|
||||
- Storing the absolute asset URL in the `assets` table will help to
|
||||
simplify processing. If the URL in the table is the same as that in
|
||||
the notes, then no change is needed. If not, then presumably an
|
||||
update *is* needed.
|
||||
- This is complicated by the presence of relative URLs.
|
||||
- The change required in a given asset URL can be determined by a
|
||||
*base URL* in the configuration file.
|
||||
|
||||
|
||||
<!--
|
||||
- vim: syntax=markdown:ts=8:sw=4:ai:et:tw=78:fo=tcqn:fdm=marker:com+=fb\:-
|
||||
-->
|
||||
# Static Site - asset management
|
||||
### Dave Morriss
|
||||
### Last update: 2023-09-01 11:55:51
|
||||
|
||||
* * *
|
||||
|
||||
## Overview
|
||||
|
||||
This document describes a script which is being planned to perform actions on
|
||||
the HPR database. The main actions are:
|
||||
|
||||
- Manage the existing contents in the `assets` table for a given show
|
||||
- This means that the table entries need to represent all of the files
|
||||
such as audio files, transcript files and any files included by the
|
||||
creator of the show.
|
||||
- The asset information is primarily used to provide information about
|
||||
audio files when creating RSS feeds including this show. In particular
|
||||
the audio file size is needed, as well as its name.
|
||||
- Show creator-provided assets are currently stored in the corresponding
|
||||
item on the Internet Archive. Previously they were held on the HPR
|
||||
server, but due to space restrictions on the new server are being held
|
||||
on archive.org. The `assets` table entries should indicate file
|
||||
locations. ***It is possible that alternative locations will be
|
||||
provided in the future, and this will need to be reflected in this
|
||||
table.***
|
||||
> Can you clarify the location ? Do you mean relative to the server root ? If this is on someone else system we can only dictate the path under where they are serving the files from. So `/some/path/to/web/server/hpr/` and from there `hosts/` or `series/hpr9999` etc.
|
||||
|
||||
- It is intended that this script will be run to populate the assets table for
|
||||
a new show. ***It will also set the `valid` field of the `eps` (episodes)
|
||||
table to `true` once the asset details are ready.*
|
||||
|
||||
> Should we try and split these functions into two ? Keeping to the unix idea of doing only one job.
|
||||
|
||||
- The script will additionally manage links to assets in the `notes` field of
|
||||
the `eps` table for the show.
|
||||
- Existing shows with creator-provided assets will contain links to them,
|
||||
assuming they are on the HPR server. These must be updated to reflect
|
||||
their current locations (on the Internet Archive).
|
||||
- This same process will be required should assets be moved to another
|
||||
location.
|
||||
- The process of changing HTML links will be considerably simplified if a
|
||||
record of their current location is kept in the `assets` table. An URL
|
||||
stored in the `assets` table might be the only record of which URLs in
|
||||
the HTML notes and any HTML assets are assets and which are other
|
||||
resources. At present all asset URLs are either absolute, referring to
|
||||
the HPR IP address or are relative URLs where the base URL is assumed to
|
||||
be HPR.
|
||||
- It is assumed that URLs stored in the `assets` table will be absolute.
|
||||
Other strategies might be used however.
|
||||
|
||||
- Details of show creator-provided assets do not exist at present. It will be
|
||||
necessary to gather this information as new shows are added and existing
|
||||
shows post-processed.
|
||||
|
||||
## Script design
|
||||
|
||||
### General
|
||||
|
||||
- The script will use a configuration file for database credentials and other
|
||||
settings.
|
||||
|
||||
- It will use options to specify which episode or episodes are to be
|
||||
processed and will be able to be run a `dry-run` mode to report what it
|
||||
would do without doing it.
|
||||
|
||||
- The script has the name `manage_assets` at the moment.
|
||||
|
||||
- It will log the actions it takes.
|
||||
|
||||
### Information sources
|
||||
|
||||
- One of the main sources of show creator asset information is currently the
|
||||
notes. This technique is used in the script `make_metadata`, the script
|
||||
which prepares shows for upload to the Internet Archive, which in the past
|
||||
has scanned the notes for file references on the HPR server, and has used
|
||||
this information has to download these files in order to upload them.
|
||||
- The approach taken by `make_metadata` has been to scan the notes for
|
||||
files, as mentioned, and if any of these have themselves been HTML, to
|
||||
scan these too for file references.
|
||||
- The goal of scanning for files was first to ensure that they were
|
||||
uploaded to the Internet Archive, but secondly to rewrite their URLs
|
||||
such that the shows were self-contained on the Internet Archive.
|
||||
|
||||
- Another source of the asset information, both the show creator-produced
|
||||
assets, and the audio and transcripts, is the Internet Archive itself. It is
|
||||
possible to collect metadata from there (in JSON format) which lists all the
|
||||
files originally uploaded.
|
||||
|
||||
- There may have been a few files which were not uploaded to the Internet
|
||||
Archive because they were not referenced in the notes. If this is the case,
|
||||
only a scan of the backups of the files stored on the old HPR server can
|
||||
identify them, and hopefully allow them to be added to the Internet Archive
|
||||
and referenced by the episode.
|
||||
|
||||
- In the past, the information gathered about assets was not stored in the
|
||||
database. It is important that this deficiency be rectified by the
|
||||
`manage_assets` script so that it will not be necessary to hunt through
|
||||
notes and subsidiary HTML files to find their names in the future.
|
||||
|
||||
### Algorithms
|
||||
|
||||
[This is a first draft, and is likely to be incomplete]
|
||||
|
||||
- Given a show number, the script will search for it in the database.
|
||||
- If not found, then that show will be skipped
|
||||
- If found then the entries in the `assets` table will be collected as
|
||||
well as the `eps.valid` setting.
|
||||
- If no assets are found and `eps.valid = 0` then this is a new show, the
|
||||
asset details of which are to be loaded into the `assets` table.
|
||||
- The Internet Archive upload might be on-going, which can be
|
||||
determined by querying the `IA` API for pending tasks. If all tasks
|
||||
have run the metadata can be collected and used to fill in asset
|
||||
details. Rather than waiting for tasks to complete it will probably
|
||||
be easier to skip this show and process it later.
|
||||
- Currently, some audio file details are obtained from the files
|
||||
themselves. Quite how to do this needs discussion - unless the
|
||||
`manage_assets` script is being run on the system that holds the
|
||||
files it might be problematic (though an SSH connection could be
|
||||
used to do this remotely)
|
||||
- If assets are found but `eps.valid = 0` this is an anomaly.
|
||||
- If assets are found but `eps.valid = 1` this is is a show that has
|
||||
previously been uploaded.
|
||||
- The assets can be collected from the Internet Archive metadata and
|
||||
compared with what is stored. Any that are missing can be added, and
|
||||
any that differ can be updated. Possibly, any asset records in the
|
||||
database but not on the Internet Archive can be deleted.
|
||||
- If it is necessary to obtain details of assets that are not stored
|
||||
in the Internet Archive then it might be necessary to download the
|
||||
files and store them in a cache for examination - after which they
|
||||
will be deleted.
|
||||
- **NOTE** This section is in need of further thought! \
|
||||
The notes and asset files will need to be scanned to determine if the
|
||||
URLs need to be changed.
|
||||
- Ideally, asset URLs should be absolute. If so, it is simple to
|
||||
determine if a change is needed.
|
||||
- We have no means of marking a show in the database as having been
|
||||
processed by the `manage_assets` script.
|
||||
- Storing the absolute asset URL in the `assets` table will help to
|
||||
simplify processing. If the URL in the table is the same as that in
|
||||
the notes, then no change is needed. If not, then presumably an
|
||||
update *is* needed.
|
||||
- This is complicated by the presence of relative URLs.
|
||||
- The change required in a given asset URL can be determined by a
|
||||
*base URL* in the configuration file.
|
||||
|
||||
|
||||
<!--
|
||||
- vim: syntax=markdown:ts=8:sw=4:ai:et:tw=78:fo=tcqn:fdm=marker:com+=fb\:-
|
||||
-->
|
||||
|
Loading…
Reference in New Issue
Block a user