A tool to extract data from html files #3
Reference in New Issue
Block a user
No description provided.
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
We cannot find a tool that would parse the html and extract the binary images, replacing the removed data stream with a link to the local image.
Requirement
Requires one or more html file to be passed to the app.
Optional prefix
-p --prefixfor the image, defaults to "image".Option
-nb --no-backupto disable backup files.Option
-f --forceto enable overwriting of images.Loop through each file in turn
-nb --no-backupis set, create a backup of the html file with a extension. Egindex.htmlwould be copied toindex.html.bakdata:.*">data streams<img src=data:image.*">data streams, then exit the entire program with a error.${fileprefix}_${prefix}_${increment}.${extension}, eg index_image_1_.jpg. Where:${fileprefix}is the${html file name%.*}of the current html file.${prefix}is the provided--prefixstring, or image if there is no--prefixprovided.${increment}is the current count of images parsed by the script${extension}is the Media type of the image as discovered by the script-f --forceis passed - then overwrite the images.<img src="${fileprefix}_${prefix}_${increment}.${extension}">, eg<img src="index_image_1_.jpg">dataURI instances found which don't match the table would be rejected, or each instance could have ago/no-gomarker.<img src="https://example.com/image.png" />would be saved locally as${fileprefix}_${prefix}_${increment}.${extension}A first release of
extract_imageswas made available in early January 2025.It does not contain a MIME types table at present.
The request for external image collection has not yet been added.