A tool to extract data from html files #3

New Issue

ken_fallon · 2024-12-24T11:50:10Z

ken_fallon commented

2024-12-24 11:50:10 +00:00

We cannot find a tool that would parse the html and extract the binary images, replacing the removed data stream with a link to the local image.

Requirement

Requires one or more html file to be passed to the app.
Optional prefix -p --prefix for the image, defaults to "image".
Option -nb --no-backup to disable backup files.
Option -f --force to enable overwriting of images.
Loop through each file in turn
- If the file is missing or is not a html file, report error and continue with next html file
- Unless the option -nb --no-backup is set, create a backup of the html file with a extension. Eg index.html would be copied to index.html.bak
- Find all data:.*"> data streams
  - if any fail to be image <img src=data:image.*"> data streams, then exit the entire program with a error.
- Save the image file to disk as ${fileprefix}_${prefix}_${increment}.${extension}, eg index_image_1_.jpg. Where:
  - ${fileprefix} is the ${html file name%.*} of the current html file.
  - ${prefix} is the provided --prefix string, or image if there is no --prefix provided.
  - ${increment} is the current count of images parsed by the script
  - ${extension} is the Media type of the image as discovered by the script
- If the image already exists stop processing this html file and continue with the next. Except when the option -f --force is passed - then overwrite the images.
- Replace the missing data stream it with a place holder <img src="${fileprefix}_${prefix}_${increment}.${extension}">, eg <img src="index_image_1_.jpg">

We cannot find a tool that would parse the html and extract the binary images, replacing the removed data stream with a link to the local image. ## Requirement - Requires one or more html file to be passed to the app. - Optional prefix `-p --prefix` for the image, defaults to "image". - Option `-nb --no-backup` to disable backup files. - Option `-f --force` to enable overwriting of images. - Loop through each file in turn - If the file is missing or is not a html file, report error and continue with next html file - Unless the option `-nb --no-backup` is set, create a backup of the html file with a extension. Eg `index.html` would be copied to `index.html.bak` - Find all `data:.*">` data streams - if any fail to be image `<img src=data:image.*">` data streams, then exit the entire program with a error. - Save the image file to disk as `${fileprefix}_${prefix}_${increment}.${extension}`, eg *index_image_1_.jpg*. Where: - `${fileprefix}` is the `${html file name%.*}` of the current html file. - `${prefix}` is the provided `--prefix` string, or *image* if there is no `--prefix` provided. - `${increment}` is the current count of images parsed by the script - `${extension}` is the Media type of the image as discovered by the script - If the image already exists stop processing this html file and continue with the next. Except when the option `-f --force` is passed - then overwrite the images. - Replace the missing data stream it with a place holder `<img src="${fileprefix}_${prefix}_${increment}.${extension}">`, eg `<img src="index_image_1_.jpg">`

davmo commented

2024-12-24 12:54:10 +00:00

Include a table of MIME types which are acceptable for further processing.
- Maybe this could be in a configuration file to make it more straightforward to change.
- Any data URI instances found which don't match the table would be rejected, or each instance could have a go/no-go marker.

- Include a table of MIME types which are acceptable for further processing. - Maybe this could be in a configuration file to make it more straightforward to change. - Any `data` URI instances found which don't match the table would be rejected, or each instance could have a `go/no-go` marker.

👍 1

ken_fallon commented

2024-12-27 13:08:44 +00:00

If external images are referenced then download them locally
- eg: <img src="https://example.com/image.png" /> would be saved locally as ${fileprefix}_${prefix}_${increment}.${extension}

- If external images are referenced then download them locally - eg: `<img src="https://example.com/image.png" />` would be saved locally as `${fileprefix}_${prefix}_${increment}.${extension}`

davmo commented

2025-01-04 11:51:46 +00:00

A first release of extract_images was made available in early January 2025.
It does not contain a MIME types table at present.
The request for external image collection has not yet been added.

A first release of `extract_images` was made available in early January 2025. It does not contain a MIME types table at present. The request for external image collection has not yet been added.

Sign in to join this conversation.

2 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: HPR/hpr-tools#3