A tool to extract data from html files #3
Loading…
Reference in New Issue
Block a user
No description provided.
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
We cannot find a tool that would parse the html and extract the binary images, replacing the removed data stream with a link to the local image.
Requirement
Requires one or more html file to be passed to the app.
Optional prefix
-p --prefix
for the image, defaults to "image".Option
-nb --no-backup
to disable backup files.Option
-f --force
to enable overwriting of images.Loop through each file in turn
-nb --no-backup
is set, create a backup of the html file with a extension. Egindex.html
would be copied toindex.html.bak
data:.*">
data streams<img src=data:image.*">
data streams, then exit the entire program with a error.${fileprefix}_${prefix}_${increment}.${extension}
, eg index_image_1_.jpg. Where:${fileprefix}
is the${html file name%.*}
of the current html file.${prefix}
is the provided--prefix
string, or image if there is no--prefix
provided.${increment}
is the current count of images parsed by the script${extension}
is the Media type of the image as discovered by the script-f --force
is passed - then overwrite the images.<img src="${fileprefix}_${prefix}_${increment}.${extension}">
, eg<img src="index_image_1_.jpg">
data
URI instances found which don't match the table would be rejected, or each instance could have ago/no-go
marker.<img src="https://example.com/image.png" />
would be saved locally as${fileprefix}_${prefix}_${increment}.${extension}
A first release of
extract_images
was made available in early January 2025.It does not contain a MIME types table at present.
The request for external image collection has not yet been added.