hpr-knowledge-base/hpr_transcripts/hpr3596.txt

Episode: 3596
Title: HPR3596: Extracting text, tables and images from docx files using Python
Source: https://hub.hackerpublicradio.org/ccdn.php?filename=/eps/hpr3596/hpr3596.mp3
Transcribed: 2025-10-25 01:55:30

---

This is Hacker Public Radio Episode 3596 from Monday the 16th of May 2022.
Today's show is entitled Extracting Text Tables and Images from Docs Files Using Python.
It is part of the series a little bit of Python.
It is hosted by Be Easy and is about 9 minutes long.
It carries a clean flag.
The summary is, in this episode I describe how I use two Python libraries to extract
import data from Docs Files.
Hello this is Be Easy once again with a new episode for Hacker Public Radio.
Now this episode I want to talk about a project that I'm working on and the ask was this.
We have this directory of a couple thousand Docs documents, you know word files, Microsoft
Word files in the Doc X format.
Go through all these files, extract all the text, get all the images out and for all the
tables that are in there, turn the tables into CSV files.
So I thought to myself, how can I solve this problem?
The first thing I thought of, because I've used it before, and I've talked about actually
on Hacker Public Radio before, is a command line tool called Doc X to TXT.
It's really simple, you know just do Doc X to TXT, input file, output file, and if you
wanted to do it to the screen, you just do dash for the output.
The problem that I ran to this, although it got the data out, it only gets the data
out of the body and I needed data out of the headers and the footers as well of the documents.
So I looked on you know, Pipei for some options in Python on how to do this because I figured
you know, Python seems like a language that would have this.
And what do you know it does?
The first tool I found was a tool called Python-Doc X.
And similar to what OpenPie Excel does for Excel files, Python-Doc X is a tool library
that helps you both create and read Doc X files.
So you extend, instantiate a new object from the Doc X document constructor.
And then you, there are different attributes that you are presented with if that's not
an existing document.
If that is an existing document, you have sections of the document.
In the first section or in every section of the document, you have a header and a footer.
And also at that top layer, you also have paragraphs.
And inside of every paragraph, there is text.
Inside of every header and every footer object, there's also paragraphs.
And inside those paragraphs there's text.
And then the other thing which was so amazing was that inside of the headers, the footers,
and the main document, there's also an attribute called tables, which finds all the tables
in the documents exactly like I was looking for.
And inside of every table, there are rows.
And inside of every row, there are cells.
And inside of every cell, there's text.
And so I still have this problem with the images.
So I looked for another way to get the images out, you know, just using duck.go.
And I ran into a project called Doc X2 TXT, the number two, Doc X2 TXT.
What you do with that is it'll get out all the text out of the document in one go
and all the images at the same time.
And the way that you call that is really simple.
And I was really amazed at how simple it is.
All you have to do is say Doc X2 TXT.process
and then the path of the file and that returns the text of the file to you.
And if you include a second argument in that.process of a folder location,
it will automatically put all the images into that folder.
So all you have to do, so I did a simple thing where I just said full text equals
Doc X2 TXT.process, the source file, the image destination,
and then for the open up of file object and write the full text into that file buffer
to create the full text file.
And so like I said, there's more than one file there's over 2,000 files.
So put that on a for loop done.
Because it was so many files, I thought I might be able to do one better.
So I use the multi-processing library.
And I keep on meaning to use concurrent futures instead.
But I've just used multi-processing so many more times.
So it's really simple to do something that's that parallelizable.
You just do a pool equals multi-processing.pool.
And then you just say pool.map, the function to that you want to run,
and the iterator that you want to use it.
And the iterator that I chose was the path of all the items in the directory.
So the main part of the function is only one, two, three, four lines.
First I say, I declare the input directory.
And then I create a list that describes all the files or in that directory
using the pathlib.glob constructor.
And then I just say I create a multi-processing pool and pull.map on the big process file
function that I have, and that iterator of the list of files.
And I have a nice BFF12 core processor.
I think it took three or four minutes to process all those files.
So what I did for every file, I made a folder and I decided to put the original .x file
plus the full text of the file in the .xt file, and then a CSV file for every table,
and then a folder called .img inside that folder file for every image.
And the whole thing, I don't know, is with good formatting is, what is it, 39, 89 lines of code.
And I was able to get this whole thing done.
And with a couple more lines of code, I can dump the whole thing in like an s3 bucket
or something if I want to put it somewhere where it's accessible for other members of
the team of my company.
So I thought that would be useful to you guys and the hacking community that like to tinker.
You might have to tinker with word documents.
There's actually similar items that you can use for doing the same thing with ODS files
and ODS files.
But I'm not going to go into that here because that's not what the show is about.
And also I know LibreOffice has built in Python bindings that you can also connect
in to and create documents too.
And you can do a lot of amazing things with those I've seen some projects that use that
as well.
So I just want to give a couple of shout out to some of those other projects.
But yeah, that's what I have been working on.
Like I said, it was after a little bit of searching in Dr. Go 80 something lines of code,
extracting all the tables and all of the rows and all the images and all the text out
of Doc X files.
So thank you for listening.
This has been another episode of Hacker Public Radio.
This is B easy signing off to remind you to use free software.
You have been listening to Hacker Public Radio at Hacker Public Radio does work.
Today's show was contributed by a HBR listener like yourself.
If you ever thought of recording podcasts, you click on our contribute link to find out
how easy it really is.
Hosting for HBR has been kindly provided by an honesthost.com, the internet archive and
our syncs.net.
On the Sadois status, today's show is released under Creative Commons, Attribution 4.0 International
License.