Updates for missing asset "repair"

InternetArchive/recover_transcripts: Bash script to be run on 'borg'
    which collects files missing on the IA ready for upload as part of
    the missing asset repair process.

InternetArchive/repair_assets: Bash script to take assets from the IA
    (after they had been repaired on 'borg') and copy them to the HPR
    server for the notes to access. The local machine, where this was
    run, was used to store files being uploaded. The planned script to
    modify the notes to reflect the new file locations was never
    finished. Notes were edited with Vim using a few macros.

InternetArchive/repair_item: Bash script which is best run on 'borg',
    which repairs an IA item by comparing the files on the IA with the
    files on 'borg' (or a local machine). These files are either in
    '/data/IA/uploads/' or in the temporary file hierarchy used by
    'recover_transcripts' (which calls it). Used after a normal IA
    upload to check for and make good any missed file uploads (due to
    timeouts, etc). Also used during asset repairs, but that project is
    now finished.

InternetArchive/snapshot_metadata: Bash script which collects detailed
    metadata from the IA in JSON format and saves it locally (run on
    a local PC). Older shows on the IA often contained derivative files
    which were identified by the script 'view_derivatives'. These files
    were never needed, they were IA artefacts, so can be deleted (see
    the script header for how).

InternetArchive/view_derivatives: Perl script to interpret a file of
    JSON metadata from the IA for an HPR show in order to determine the
    parent-child hierarchy of files where there may be derivatives. We
    don't want IA-generated derivatives, but this process was hard to
    turn off in earlier times. Generates a hierarchical report and
    a list of unwanted derivatives (see 'snapshot_metadata' for more
    details of how this was used).
This commit is contained in:
Dave Morriss
2024-11-23 22:28:52 +00:00
parent 6aff45a10b
commit 31eb5d200f
5 changed files with 249 additions and 88 deletions

View File

@@ -6,7 +6,7 @@
# USAGE: ./repair_assets showid
#
# DESCRIPTION: Given a show where there was a directory of asset files on the
# old HPR server whichj got lost in the migration, rebuild it
# old HPR server which got lost in the migration, rebuild it
# and fill it with assets from the IA. Modify the show notes to
# point to these recovered assets.
#
@@ -15,15 +15,15 @@
# BUGS: ---
# NOTES: ---
# AUTHOR: Dave Morriss (djm), Dave.Morriss@gmail.com
# VERSION: 0.0.8
# VERSION: 0.0.10
# CREATED: 2024-05-10 21:26:31
# REVISION: 2024-08-23 11:55:25
# REVISION: 2024-10-02 17:34:47
#
#===============================================================================
# set -o nounset # Treat unset variables as an error
VERSION="0.0.8"
VERSION="0.0.10"
SCRIPT=${0##*/}
# DIR=${0%/*}
@@ -96,6 +96,38 @@ trap 'cleanup_temp $TMP1 $TMP2' SIGHUP SIGINT SIGPIPE SIGTERM EXIT
# $3 Name of array to receive list of missing assets
# RETURNS: Nothing
#===============================================================================
# find_missing () {
# local -n IA="${1}"
# local -n HPR="${2}"
# local output="${3}"
#
# local -A hIA hHPR
# local i key
#
# #
# # Make a hash keyed by the IA file base names from an indexed array
# #
# for (( i=0; i<${#IA[@]}; i++ )); do
# hIA+=([${IA[$i]##*/}]=${IA[$i]})
# done
#
# #
# # Make a hash keyed by the HPR file base names from an indexed array
# #
# for (( i=0; i<${#HPR[@]}; i++ )); do
# hHPR+=([${HPR[$i]##*/}]=${HPR[$i]})
# done
#
# #
# # Use the basename keys to check what's missing, but return the full path
# # names.
# #
# for key in "${!hIA[@]}"; do
# if ! exists_in hHPR "$key"; then
# eval "$output+=('${hIA[$key]}')"
# fi
# done
# }
find_missing () {
local -n IA="${1}"
local -n HPR="${2}"
@@ -105,26 +137,29 @@ find_missing () {
local i key
#
# Make a hash keyed by the IA file base names from an indexed array
# Make a hash keyed by the full IA paths from an indexed array
#
for (( i=0; i<${#IA[@]}; i++ )); do
hIA+=([${IA[$i]##*/}]=${IA[$i]})
hIA+=([${IA[$i]}]=$i)
done
#
# Make a hash keyed by the HPR file base names from an indexed array
# Make a hash keyed by the HPR file paths from an indexed array, but
# remove the first element for parity with the IA paths. We are going to
# copy the IA paths, not these, so we never need the full paths again
# here.
#
for (( i=0; i<${#HPR[@]}; i++ )); do
hHPR+=([${HPR[$i]##*/}]=${HPR[$i]})
hHPR+=([${HPR[$i]#*/}]=$i)
done
#
# Use the basename keys to check what's missing, but return the full path
# names.
# Use the full path keys to check what's missing, and return the IA full
# path names.
#
for key in "${!hIA[@]}"; do
if ! exists_in hHPR "$key"; then
eval "$output+=('${hIA[$key]}')"
eval "$output+=('$key')"
fi
done
}
@@ -267,10 +302,13 @@ fi
show="${1,,}"
#
# Ensure show id is correctly formatted. We want it to be 'hpr1234'
# Ensure show id is correctly formatted. We want it to be 'hpr1234' but we
# allow the 'hpr' bit to be omitted, as well as any leading zeroes. We need to
# handle the weirdness of "leading zero means octal" though, but we always
# store it as 'hpr1234' once processed.
#
if [[ $show =~ (hpr)?([0-9]+) ]]; then
printf -v show 'hpr%04d' "${BASH_REMATCH[2]}"
printf -v show 'hpr%04d' "$((10#${BASH_REMATCH[2]}))"
else
coloured 'red' "Incorrect show specification: $show"
coloured 'yellow' "Use 'hpr9999' or '9999' format"
@@ -443,37 +481,45 @@ ignore_re="index.html$"
# Run the command and save the output. Save the asset names returned in an
# array. TODO: Handle errors from the command
#
if [[ $DRYRUN -eq 0 ]]; then
eval "$command" > "$TMP2"
RES=$?
if [[ $RES -eq 0 ]]; then
_verbose "$(coloured 'green' "Remote command successful")"
while read -r hprfile; do
if [[ ! $hprfile =~ $ignore_re ]]; then
hpr_asset+=("${hprfile}")
fi
done < "$TMP2"
_verbose "$(coloured 'green' "Assets found on HPR server = ${#hpr_asset[@]}")"
_verbose "$(printf '%s\n' "${hpr_asset[@]}")"
_log "Assets found on HPR server = ${#hpr_asset[@]}"
else
coloured 'red' "Remote command failed"
_log "Failed while searching for HPR assets"
exit 1
fi
#
# NOTE: We also want to interrogate the HPR state in dry-run mode
#
# if [[ $DRYRUN -eq 0 ]]; then
# else
# coloured 'yellow' "Would have searched for assets on the HPR server"
# fi
eval "$command" > "$TMP2"
RES=$?
if [[ $RES -eq 0 ]]; then
_verbose "$(coloured 'green' "Remote command successful")"
while read -r hprfile; do
if [[ ! $hprfile =~ $ignore_re ]]; then
hpr_asset+=("${hprfile}")
fi
done < "$TMP2"
_verbose "$(coloured 'green' "Assets found on HPR server = ${#hpr_asset[@]}")"
_verbose "$(printf '%s\n' "${hpr_asset[@]}")"
_log "Assets found on HPR server = ${#hpr_asset[@]}"
else
coloured 'yellow' "Would have searched for assets on the HPR server"
coloured 'red' "Remote command failed"
_log "Failed while searching for HPR assets"
exit 1
fi
#-------------------------------------------------------------------------------
# Compare the two asset lists and return what's missing on the HPR server
#-------------------------------------------------------------------------------
# TODO: This algorithm does not handle the instance where there are pictures
# in one directory and a lower directory containing thumbnails, AND THE FILE
# NAMES ARE THE SAME!
# TODO: The algorithm in find_missing does not handle the instance where there
# are pictures in one directory and a lower directory containing thumbnails,
# AND THE FILE NAMES ARE THE SAME!
#
declare -a missing
find_missing ia_asset hpr_asset missing
if [[ ${#hpr_asset[@]} -eq 0 ]]; then
missing=( "${ia_asset[@]}" )
else
find_missing ia_asset hpr_asset missing
fi
_verbose "$(coloured 'cyan' "** missing (${#missing[@]}):")"
_verbose "$(printf '%s\n' "${missing[@]}")"