- MCP server with stdio transport for local use - Search episodes, transcripts, hosts, and series - 4,511 episodes with metadata and transcripts - Data loader with in-memory JSON storage 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
116 lines
6.8 KiB
Plaintext
116 lines
6.8 KiB
Plaintext
Episode: 3596
|
|
Title: HPR3596: Extracting text, tables and images from docx files using Python
|
|
Source: https://hub.hackerpublicradio.org/ccdn.php?filename=/eps/hpr3596/hpr3596.mp3
|
|
Transcribed: 2025-10-25 01:55:30
|
|
|
|
---
|
|
|
|
This is Hacker Public Radio Episode 3596 from Monday the 16th of May 2022.
|
|
Today's show is entitled Extracting Text Tables and Images from Docs Files Using Python.
|
|
It is part of the series a little bit of Python.
|
|
It is hosted by Be Easy and is about 9 minutes long.
|
|
It carries a clean flag.
|
|
The summary is, in this episode I describe how I use two Python libraries to extract
|
|
import data from Docs Files.
|
|
Hello this is Be Easy once again with a new episode for Hacker Public Radio.
|
|
Now this episode I want to talk about a project that I'm working on and the ask was this.
|
|
We have this directory of a couple thousand Docs documents, you know word files, Microsoft
|
|
Word files in the Doc X format.
|
|
Go through all these files, extract all the text, get all the images out and for all the
|
|
tables that are in there, turn the tables into CSV files.
|
|
So I thought to myself, how can I solve this problem?
|
|
The first thing I thought of, because I've used it before, and I've talked about actually
|
|
on Hacker Public Radio before, is a command line tool called Doc X to TXT.
|
|
It's really simple, you know just do Doc X to TXT, input file, output file, and if you
|
|
wanted to do it to the screen, you just do dash for the output.
|
|
The problem that I ran to this, although it got the data out, it only gets the data
|
|
out of the body and I needed data out of the headers and the footers as well of the documents.
|
|
So I looked on you know, Pipei for some options in Python on how to do this because I figured
|
|
you know, Python seems like a language that would have this.
|
|
And what do you know it does?
|
|
The first tool I found was a tool called Python-Doc X.
|
|
And similar to what OpenPie Excel does for Excel files, Python-Doc X is a tool library
|
|
that helps you both create and read Doc X files.
|
|
So you extend, instantiate a new object from the Doc X document constructor.
|
|
And then you, there are different attributes that you are presented with if that's not
|
|
an existing document.
|
|
If that is an existing document, you have sections of the document.
|
|
In the first section or in every section of the document, you have a header and a footer.
|
|
And also at that top layer, you also have paragraphs.
|
|
And inside of every paragraph, there is text.
|
|
Inside of every header and every footer object, there's also paragraphs.
|
|
And inside those paragraphs there's text.
|
|
And then the other thing which was so amazing was that inside of the headers, the footers,
|
|
and the main document, there's also an attribute called tables, which finds all the tables
|
|
in the documents exactly like I was looking for.
|
|
And inside of every table, there are rows.
|
|
And inside of every row, there are cells.
|
|
And inside of every cell, there's text.
|
|
And so I still have this problem with the images.
|
|
So I looked for another way to get the images out, you know, just using duck.go.
|
|
And I ran into a project called Doc X2 TXT, the number two, Doc X2 TXT.
|
|
What you do with that is it'll get out all the text out of the document in one go
|
|
and all the images at the same time.
|
|
And the way that you call that is really simple.
|
|
And I was really amazed at how simple it is.
|
|
All you have to do is say Doc X2 TXT.process
|
|
and then the path of the file and that returns the text of the file to you.
|
|
And if you include a second argument in that.process of a folder location,
|
|
it will automatically put all the images into that folder.
|
|
So all you have to do, so I did a simple thing where I just said full text equals
|
|
Doc X2 TXT.process, the source file, the image destination,
|
|
and then for the open up of file object and write the full text into that file buffer
|
|
to create the full text file.
|
|
And so like I said, there's more than one file there's over 2,000 files.
|
|
So put that on a for loop done.
|
|
Because it was so many files, I thought I might be able to do one better.
|
|
So I use the multi-processing library.
|
|
And I keep on meaning to use concurrent futures instead.
|
|
But I've just used multi-processing so many more times.
|
|
So it's really simple to do something that's that parallelizable.
|
|
You just do a pool equals multi-processing.pool.
|
|
And then you just say pool.map, the function to that you want to run,
|
|
and the iterator that you want to use it.
|
|
And the iterator that I chose was the path of all the items in the directory.
|
|
So the main part of the function is only one, two, three, four lines.
|
|
First I say, I declare the input directory.
|
|
And then I create a list that describes all the files or in that directory
|
|
using the pathlib.glob constructor.
|
|
And then I just say I create a multi-processing pool and pull.map on the big process file
|
|
function that I have, and that iterator of the list of files.
|
|
And I have a nice BFF12 core processor.
|
|
I think it took three or four minutes to process all those files.
|
|
So what I did for every file, I made a folder and I decided to put the original .x file
|
|
plus the full text of the file in the .xt file, and then a CSV file for every table,
|
|
and then a folder called .img inside that folder file for every image.
|
|
And the whole thing, I don't know, is with good formatting is, what is it, 39, 89 lines of code.
|
|
And I was able to get this whole thing done.
|
|
And with a couple more lines of code, I can dump the whole thing in like an s3 bucket
|
|
or something if I want to put it somewhere where it's accessible for other members of
|
|
the team of my company.
|
|
So I thought that would be useful to you guys and the hacking community that like to tinker.
|
|
You might have to tinker with word documents.
|
|
There's actually similar items that you can use for doing the same thing with ODS files
|
|
and ODS files.
|
|
But I'm not going to go into that here because that's not what the show is about.
|
|
And also I know LibreOffice has built in Python bindings that you can also connect
|
|
in to and create documents too.
|
|
And you can do a lot of amazing things with those I've seen some projects that use that
|
|
as well.
|
|
So I just want to give a couple of shout out to some of those other projects.
|
|
But yeah, that's what I have been working on.
|
|
Like I said, it was after a little bit of searching in Dr. Go 80 something lines of code,
|
|
extracting all the tables and all of the rows and all the images and all the text out
|
|
of Doc X files.
|
|
So thank you for listening.
|
|
This has been another episode of Hacker Public Radio.
|
|
This is B easy signing off to remind you to use free software.
|
|
You have been listening to Hacker Public Radio at Hacker Public Radio does work.
|
|
Today's show was contributed by a HBR listener like yourself.
|
|
If you ever thought of recording podcasts, you click on our contribute link to find out
|
|
how easy it really is.
|
|
Hosting for HBR has been kindly provided by an honesthost.com, the internet archive and
|
|
our syncs.net.
|
|
On the Sadois status, today's show is released under Creative Commons, Attribution 4.0 International
|
|
License.
|