Initial commit: HPR Knowledge Base MCP Server

- MCP server with stdio transport for local use - Search episodes, transcripts, hosts, and series - 4,511 episodes with metadata and transcripts - Data loader with in-memory JSON storage 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
2025-10-26 10:54:13 +00:00
commit 7c8efd2228
4494 changed files with 1705541 additions and 0 deletions
--- a/hpr_transcripts/hpr3596.txt
+++ b/hpr_transcripts/hpr3596.txt
@@ -0,0 +1,115 @@
+Episode: 3596
+Title: HPR3596: Extracting text, tables and images from docx files using Python
+Source: https://hub.hackerpublicradio.org/ccdn.php?filename=/eps/hpr3596/hpr3596.mp3
+Transcribed: 2025-10-25 01:55:30
+
+---
+
+This is Hacker Public Radio Episode 3596 from Monday the 16th of May 2022.
+Today's show is entitled Extracting Text Tables and Images from Docs Files Using Python.
+It is part of the series a little bit of Python.
+It is hosted by Be Easy and is about 9 minutes long.
+It carries a clean flag.
+The summary is, in this episode I describe how I use two Python libraries to extract
+import data from Docs Files.
+Hello this is Be Easy once again with a new episode for Hacker Public Radio.
+Now this episode I want to talk about a project that I'm working on and the ask was this.
+We have this directory of a couple thousand Docs documents, you know word files, Microsoft
+Word files in the Doc X format.
+Go through all these files, extract all the text, get all the images out and for all the
+tables that are in there, turn the tables into CSV files.
+So I thought to myself, how can I solve this problem?
+The first thing I thought of, because I've used it before, and I've talked about actually
+on Hacker Public Radio before, is a command line tool called Doc X to TXT.
+It's really simple, you know just do Doc X to TXT, input file, output file, and if you
+wanted to do it to the screen, you just do dash for the output.
+The problem that I ran to this, although it got the data out, it only gets the data
+out of the body and I needed data out of the headers and the footers as well of the documents.
+So I looked on you know, Pipei for some options in Python on how to do this because I figured
+you know, Python seems like a language that would have this.
+And what do you know it does?
+The first tool I found was a tool called Python-Doc X.
+And similar to what OpenPie Excel does for Excel files, Python-Doc X is a tool library
+that helps you both create and read Doc X files.
+So you extend, instantiate a new object from the Doc X document constructor.
+And then you, there are different attributes that you are presented with if that's not
+an existing document.
+If that is an existing document, you have sections of the document.
+In the first section or in every section of the document, you have a header and a footer.
+And also at that top layer, you also have paragraphs.
+And inside of every paragraph, there is text.
+Inside of every header and every footer object, there's also paragraphs.
+And inside those paragraphs there's text.
+And then the other thing which was so amazing was that inside of the headers, the footers,
+and the main document, there's also an attribute called tables, which finds all the tables
+in the documents exactly like I was looking for.
+And inside of every table, there are rows.
+And inside of every row, there are cells.
+And inside of every cell, there's text.
+And so I still have this problem with the images.
+So I looked for another way to get the images out, you know, just using duck.go.
+And I ran into a project called Doc X2 TXT, the number two, Doc X2 TXT.
+What you do with that is it'll get out all the text out of the document in one go
+and all the images at the same time.
+And the way that you call that is really simple.
+And I was really amazed at how simple it is.
+All you have to do is say Doc X2 TXT.process
+and then the path of the file and that returns the text of the file to you.
+And if you include a second argument in that.process of a folder location,
+it will automatically put all the images into that folder.
+So all you have to do, so I did a simple thing where I just said full text equals
+Doc X2 TXT.process, the source file, the image destination,
+and then for the open up of file object and write the full text into that file buffer
+to create the full text file.
+And so like I said, there's more than one file there's over 2,000 files.
+So put that on a for loop done.
+Because it was so many files, I thought I might be able to do one better.
+So I use the multi-processing library.
+And I keep on meaning to use concurrent futures instead.
+But I've just used multi-processing so many more times.
+So it's really simple to do something that's that parallelizable.
+You just do a pool equals multi-processing.pool.
+And then you just say pool.map, the function to that you want to run,
+and the iterator that you want to use it.
+And the iterator that I chose was the path of all the items in the directory.
+So the main part of the function is only one, two, three, four lines.
+First I say, I declare the input directory.
+And then I create a list that describes all the files or in that directory
+using the pathlib.glob constructor.
+And then I just say I create a multi-processing pool and pull.map on the big process file
+function that I have, and that iterator of the list of files.
+And I have a nice BFF12 core processor.
+I think it took three or four minutes to process all those files.
+So what I did for every file, I made a folder and I decided to put the original .x file
+plus the full text of the file in the .xt file, and then a CSV file for every table,
+and then a folder called .img inside that folder file for every image.
+And the whole thing, I don't know, is with good formatting is, what is it, 39, 89 lines of code.
+And I was able to get this whole thing done.
+And with a couple more lines of code, I can dump the whole thing in like an s3 bucket
+or something if I want to put it somewhere where it's accessible for other members of
+the team of my company.
+So I thought that would be useful to you guys and the hacking community that like to tinker.
+You might have to tinker with word documents.
+There's actually similar items that you can use for doing the same thing with ODS files
+and ODS files.
+But I'm not going to go into that here because that's not what the show is about.
+And also I know LibreOffice has built in Python bindings that you can also connect
+in to and create documents too.
+And you can do a lot of amazing things with those I've seen some projects that use that
+as well.
+So I just want to give a couple of shout out to some of those other projects.
+But yeah, that's what I have been working on.
+Like I said, it was after a little bit of searching in Dr. Go 80 something lines of code,
+extracting all the tables and all of the rows and all the images and all the text out
+of Doc X files.
+So thank you for listening.
+This has been another episode of Hacker Public Radio.
+This is B easy signing off to remind you to use free software.
+You have been listening to Hacker Public Radio at Hacker Public Radio does work.
+Today's show was contributed by a HBR listener like yourself.
+If you ever thought of recording podcasts, you click on our contribute link to find out
+how easy it really is.
+Hosting for HBR has been kindly provided by an honesthost.com, the internet archive and
+our syncs.net.
+On the Sadois status, today's show is released under Creative Commons, Attribution 4.0 International
+License.