A tool to extract data from html files #3

Open
opened 2024-12-24 11:50:10 +00:00 by ken_fallon · 3 comments
Owner

We cannot find a tool that would parse the html and extract the binary images, replacing the removed data stream with a link to the local image.

Requirement

  • Requires one or more html file to be passed to the app.

  • Optional prefix -p --prefix for the image, defaults to "image".

  • Option -nb --no-backup to disable backup files.

  • Option -f --force to enable overwriting of images.

  • Loop through each file in turn

    • If the file is missing or is not a html file, report error and continue with next html file
    • Unless the option -nb --no-backup is set, create a backup of the html file with a extension. Eg index.html would be copied to index.html.bak
    • Find all data:.*"> data streams
      • if any fail to be image <img src=data:image.*"> data streams, then exit the entire program with a error.
    • Save the image file to disk as ${fileprefix}_${prefix}_${increment}.${extension}, eg index_image_1_.jpg. Where:
      • ${fileprefix} is the ${html file name%.*} of the current html file.
      • ${prefix} is the provided --prefix string, or image if there is no --prefix provided.
      • ${increment} is the current count of images parsed by the script
      • ${extension} is the Media type of the image as discovered by the script
    • If the image already exists stop processing this html file and continue with the next. Except when the option -f --force is passed - then overwrite the images.
    • Replace the missing data stream it with a place holder <img src="${fileprefix}_${prefix}_${increment}.${extension}">, eg <img src="index_image_1_.jpg">
We cannot find a tool that would parse the html and extract the binary images, replacing the removed data stream with a link to the local image. ## Requirement - Requires one or more html file to be passed to the app. - Optional prefix `-p --prefix` for the image, defaults to "image". - Option `-nb --no-backup` to disable backup files. - Option `-f --force` to enable overwriting of images. - Loop through each file in turn - If the file is missing or is not a html file, report error and continue with next html file - Unless the option `-nb --no-backup` is set, create a backup of the html file with a extension. Eg `index.html` would be copied to `index.html.bak` - Find all `data:.*">` data streams - if any fail to be image `<img src=data:image.*">` data streams, then exit the entire program with a error. - Save the image file to disk as `${fileprefix}_${prefix}_${increment}.${extension}`, eg *index_image_1_.jpg*. Where: - `${fileprefix}` is the `${html file name%.*}` of the current html file. - `${prefix}` is the provided `--prefix` string, or *image* if there is no `--prefix` provided. - `${increment}` is the current count of images parsed by the script - `${extension}` is the Media type of the image as discovered by the script - If the image already exists stop processing this html file and continue with the next. Except when the option `-f --force` is passed - then overwrite the images. - Replace the missing data stream it with a place holder `<img src="${fileprefix}_${prefix}_${increment}.${extension}">`, eg `<img src="index_image_1_.jpg">`
Owner
  • Include a table of MIME types which are acceptable for further processing.
    • Maybe this could be in a configuration file to make it more straightforward to change.
    • Any data URI instances found which don't match the table would be rejected, or each instance could have a go/no-go marker.
- Include a table of MIME types which are acceptable for further processing. - Maybe this could be in a configuration file to make it more straightforward to change. - Any `data` URI instances found which don't match the table would be rejected, or each instance could have a `go/no-go` marker.
Author
Owner
  • If external images are referenced then download them locally
    • eg: <img src="https://example.com/image.png" /> would be saved locally as ${fileprefix}_${prefix}_${increment}.${extension}
- If external images are referenced then download them locally - eg: `<img src="https://example.com/image.png" />` would be saved locally as `${fileprefix}_${prefix}_${increment}.${extension}`
Owner

A first release of extract_images was made available in early January 2025.
It does not contain a MIME types table at present.
The request for external image collection has not yet been added.

A first release of `extract_images` was made available in early January 2025. It does not contain a MIME types table at present. The request for external image collection has not yet been added.
Sign in to join this conversation.
No Label
No Milestone
No project
No Assignees
2 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: HPR/hpr-tools#3
No description provided.