Initial commit: HPR Knowledge Base MCP Server

- MCP server with stdio transport for local use - Search episodes, transcripts, hosts, and series - 4,511 episodes with metadata and transcripts - Data loader with in-memory JSON storage 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
2025-10-26 10:54:13 +00:00
commit 7c8efd2228
4494 changed files with 1705541 additions and 0 deletions
--- a/hpr_transcripts/hpr3998.txt
+++ b/hpr_transcripts/hpr3998.txt
@@ -0,0 +1,343 @@
+Episode: 3998
+Title: HPR3998: Using open source OCR to digitize my mom's book
+Source: https://hub.hackerpublicradio.org/ccdn.php?filename=/eps/hpr3998/hpr3998.mp3
+Transcribed: 2025-10-25 18:27:32
+
+---
+
+This is Hacker Public Radio Episode 3998 for Wednesday, the 29th of November 2023.
+Today's show is entitled, using open source OC are to digitize my mom's book.
+It is hosted by Delta Ray and is about 31 minutes long.
+It carries a clean flag.
+The summary is, how I used open source tools such as Cotto and the OC or software
+test erect to digitize pages.
+Hello, I'm Delta Ray and welcome to Hacker Public Radio.
+Today I'm going to talk about a bucket list item that I was recently able to cross off
+my list a long time ago back in the 1990s and the 20th century.
+My mom wrote a book and she passed away a few years ago and this book was never published
+unfortunately, but it's been on my list for a while now to try to get it published, get
+it digitized and so on.
+I've been able to use several open source tools to help me do that, at least to get digitized.
+The problem was that when she wrote the book, it was about 1994 or so when she started
+and she used a computer-ranked Windows 3.1 and a piece of software that is no longer
+supported and at the time was honestly not very common.
+It was something that was bundled with the computer and of course, since it wasn't very
+common, it used a document format that couldn't be converted into something more modern.
+So she was a bit stubborn in that regard.
+We tried to get her word perfect and have her use that, but she went ahead and used this
+other program.
+Fortunately, she actually printed the whole thing out over the next four years she worked
+on this book and printed out a few different copies.
+I have one of the copies and it's 338 pages.
+So trying to digitize all this in by typing it all in, while I tried that at first, it
+was going to take probably a month or two of typing about an hour a day or so.
+I mean, I can type fairly fast, but it was just a lot and I also found that even though
+I could catch most of my mistakes, I still ended up with a few mistakes when I was typing
+it in.
+So I thought, well, the nice thing about how this is print out is that it's just a page
+per sheet, just 8.5x11 paper and I can just scan it, but using a scanner is often kind
+of clunky and, you know, you have to wait for the scanner, scanning bed to run the scanner
+over it and then wait for the image to be imported and stuff like that.
+It's a bit slow, I think, and I needed my workflow to be a bit faster than that.
+So I decided to try to use, obviously I need to use like an optical character recognition
+system to be able to, you know, turn the image of the words into actual text, like digital
+text that can be copied and pasted into a word processor.
+And so, you know, instead of using a scanner, I have a digital SLR camera that has a USB
+output on it and there's some programs that you can use on Linux at least for accessing
+that.
+One of them is called GPhoto2, we'll get to that in a second, but basically I thought if
+I could set up my camera instead of having like a top-down view so that, you know, I have
+to like mount it overhead and lay the pages flat on a table or something like that.
+I thought, why don't I just use like a document stand, which I already had a document stand,
+you know, something that you'd get at like an office supply store and it would hold your
+document up at like an 80 degree angle or something like that.
+And then a tripod for the camera and kind of aim the camera, you know, just a few, from
+a few feet away, aim the camera at the page through the open air and light up the document,
+you know, well enough so that it gets a really good shot so that I can be more certain
+that the OCR is going to have a really good input to work with and have less errors.
+So I set that up pretty much right in front of me right next to the computer so that, you
+know, like on the right side of me was the tripod with the camera on it pointing to the
+left side of me, which is on the table where the document reader is and the computer is
+right in front of me so I can just lay out the book, pages right in front of me and quickly
+flip, you know, one page to the document reader, I'm sorry, the document stand and then
+pull the previous page down to like a, you know, finish stack of pages.
+And then, you know, with the computer right in front of me, I can just trigger it to take
+the next picture, import it and so on.
+So like I said, the open source tools for this are, you know, available.
+The first one is called G Photo 2 and this is software for working with digital cameras
+over a USB connection, maybe some air connections too, but I've been using it over a USB and just
+had a USB cable to, I think it was a mini USB port because that's what this older camera
+that I have uses, but you could use probably any USB type of USB connection.
+So it can trigger, G Photo 2 can actually trigger the camera shutter and transfer the photo
+that way I don't have to like, I can just leave the camera turned on and in place and
+instead of using battery on the camera and then ring out battery and then having to change
+the battery, I actually have an AC adapter for the camera so that, you know, I can plug
+in the AC adapter to keep it running and so on.
+So I had to do a lot of alignment, you know, with the camera and do some testing to make
+sure I have it lined up properly and one way that I did that was to take a picture using
+the photo 2 and then open it up in GIMP or some photo editing program that has guide lines
+that you can pull down and then you can look at the page of text and pull the guideline
+to like the bottom of a line of text and make sure from left to right that it lines up properly
+and there's no, you know, it doesn't like go up or go down and make sure that the margin
+along the left side is also all lined up vertically.
+That way you can be sure that the page on your document stand is perpendicular, you know,
+almost precisely with the camera lens which will just increase the accuracy of the OCR.
+So you want to have a good setup so you can have good results.
+But G photo, you know, has some options for triggering the shutter and then transferring
+the photo and then from there I just, you know, saved the photo to a file called from
+camera.jpeg and then there's a narrow piece of software called image magic which has
+some options for cropping the photo and it's, you know, image magic is command line tool
+and it has a program called convert and with this I just, you know, run convert space
+from camera.jpeg, space, dash crop and then I give it, you know, the crop resolution.
+So in this case I had to, you know, take it in the Gimp and find the right pixels, you
+know, the right dimensions for the crop using the guide.
+So it's, in this case it was 2275x meaning times, you know, 1700 plus 785 plus 560.
+So in our words, the width and height dimensions of the box that's going to be the crop and
+then the offset from the upper left and then, and then the save file name, you know, output
+dash crop.jpeg to further use it from there.
+So the crop is going to make it so that, you know, it's, it's nice and cut.
+The image has been cropped down to just the page and stuff like that, that way any of
+the background that was in the picture can be removed that might confuse the OCR software.
+And then pulled into the OCR software and there's a piece of open source OCR software
+called Tesseract, it's T-E-S-S-E-R-A-C-T, I'll put this information in the show notes.
+But Tesseract is a program that's been around for a long time, it was originally developed
+by Hewlett Packard in the 80s and 90s and then made open source in 2005.
+I guess it was worked on by HP in partnership with University of North Las Vegas, I guess
+UNLV.
+They might have needed OCR software and HP developed it for them or something, but fortunately
+for us, it was made open source and now it's, you know, freely available just by, you
+know, in your program repose on your Linux distribution.
+And it works really well, I think, you know, I had, I didn't have too many problems with
+errors and for the most part I was able to rely on it.
+I didn't really use the configuration options to, you know, an extent that might have cut
+down on the errors or something, I probably could have done that, like, set the language
+and maybe excluded some characters or something.
+One reason why I didn't want to use excluding characters was because even though I've read
+my mom's book, I wasn't sure whether she used characters like a slash, you know, like
+this or that using a slash there or using like a vertical bar.
+These are the two characters that I ended up bringing into sometimes.
+Like for instance, if she said aisle, like I apostrophe LL and if it was italics, it would
+actually sometimes turn the eye into a forward slash and, you know, that's obviously wrong.
+But if I had programmed it to exclude, I can easily catch those types of mistakes by doing
+like a search or something or using the spell checker.
+But if it had replaced like a slash that was literally used in the book by just excluding
+or something, I might have lost that information and not been able to easily find it.
+So that's why I didn't actually program to exclude certain characters.
+So anyways, Tesseract takes in the image and then it just basically, you can say Tesseract
+space, image name, space and then I used a dash to basically say just write the data
+to standard out.
+You can also, you know, have it like output it to a file name or something like that.
+Actually in my script, I wrote a bash script for actually running all this stuff to make
+it easier.
+But you can have it, you know, write out to a file name.
+So I just said like OCR dash text.
+But if you want to, you can just put a dash there to say write to standard out so it
+just shows up.
+That's good for when you're testing it.
+And then from Tesseract, I used the Grap programs with the dash capital E option for extended
+regular expressions and also dash capital P for the Perl compatible regular expressions
+because what I wanted to do on each page, each page of the book has like a header and
+a footer.
+The header has the title of the book and the footer has, you know, my mom's name and also
+page numbers and stuff.
+And so the way she did page numbers was to put dash number dash.
+And I thought, well, I don't, you know, when I copy this into, I used Libra office for
+Libra office writer for actually doing the word processing.
+And what I wanted to do was make a script that accelerates my workflow so that I can
+do this very quickly.
+And so I wanted to basically scan a page, have it do all the necessary processing so that
+all I have to do after it's done scanning the page is just paste right into Libra office.
+And so I wanted to remove the header and the footer if it matched something like I, you
+know, I set up so that it uses the dash V option and Grap and gave it some expressions
+to like remove blank lines because my mom wrote the book and doubles in double spaced
+line.
+So there's like a blank line in between each line.
+And of course, I can just remove this by using a regular expression like care it and then
+dollar, which means just match a blank line and remove that from the output.
+And then, you know, another expression with the dash E option and say, you know, the
+title of the book and another dash E and give it my mom's name.
+I want to give it as much context as possible.
+So I use a care it before the title and then a dollar, you know, maybe a dollar sign
+after it to catch just the title of the book on a line by itself.
+You want to keep in mind is that does that appear anywhere else, you know, in the book
+or something, you don't want to end up excluding it if it's like in the middle of a page
+or something like that.
+And of course, the same thing with anything that you're trying to match here, it's like,
+you want to make sure that you catch all the corner cases and consider all those.
+So giving Grap as much context as possible is a good idea.
+And then, let me see.
+And then also removing like the page number, so, you know, use care it.
+The care it is basically shift six on U.S. keyboards while it's that, you know, it looks
+like a greater than sign that's been turned up on its side.
+And then backslash space to, to say, match a space and then an asterisk to say zero or
+more spaces.
+So basically, if there were spaces that popped up before, it's going to match those.
+And then dash and square brackets, zero dash nine, closing square brackets, plus, and
+then other dash.
+That matches the page number, and it could be any arbitrary page number, and it's going
+to skip that line basically in the output, so they will display that.
+And then pipe that into Grap again, but this time using dash capital P. And the reason
+why I wanted to pass it through Grap again is because sometimes I would end up with this
+form feed control character, it's like the hexadecimal code for it is zero C. And I guess
+Tesseract would insert that in cases where it found like multiple blank blank lines or
+something it would insert a form feed.
+And I didn't want that to actually show up in the output, especially when I passed it
+into the copy paste program, which I'll talk about in a second.
+So I passed it into Grap dash capital B, P, I'm sorry, Grap, space, dash capital P, space
+dash V to exclude what it matches, space, carrot, backslash, X, which sets up the hexadecimal
+matching, and then zero C, dollar sign.
+And so that will make it so a skip A line that has just a form feed on it.
+And so this first Grap pipeline is for the purpose of displaying to the screen so that
+I can see what it matched in the OCR software.
+Then I run it again, basically the same expression, a second time.
+And what I'm doing here is passing this into the X cell program, X, S, E, L, space,
+dash B. And what that does is in Linux on an X 11 desktop, it'll take the input from
+standard input and put it into your copy paste buffer.
+This is a really handy program, especially in like a script like this, so that I don't
+have to select the text on the screen, and then get the mouse selected correctly or something
+like that, just basically removes the errors that might happen and speeds up the process.
+So from there, I can just do a control V to paste my copy paste buffer directly into
+Libra Office, which is pretty handy.
+This script, I'm going to put the script in the show notes.
+So if you're not following all this, just orally, then you'll be able to read along.
+And then the next thing I do is detect the actual page number.
+And the reason why I do this is because one of the things that the script does is it takes
+that the cropped image and copies it to a image file name with the page number in it.
+So that later as I'm going through and kind of editing the digitized text, I can compare
+it with the scan of the page to check for formatting problems and also to double check that
+the text actually match properly and stuff.
+So I set up a variable called detected page and then I give it a process substitution,
+operator, so dollar, parentheses, space, tail space, OCR-text.txt to get the last 10 lines
+of the OCR text output, pipe that into Grep and then turn off the color, which you don't
+really need to do, but I do it anyways explicitly.
+And then dash capital P, because I'm going to use a special feature of the Pro-Regular
+Expressions called look ahead and look behind assertions.
+And then dash O, which means only give me the matched output from the regular expression,
+only give me the text that matches the regular expression.
+And then dash E to give it the expression.
+And here's where I use something called look ahead and look behind assertions.
+If you're not familiar with more advanced regular expressions, purl regular expressions
+give you the option to basically match context around what you want to match without adding
+that context to the match itself.
+So in this case, the page numbers have a dash before and after the number.
+And I don't want that to be part of the match.
+And so I can use this special syntax, which is basically, it's kind of complicated.
+You'll see it in the script, but it's parentheses, question mark, greater than equals
+carrot dash to match the dash, a closed parentheses, and that matches the dash before.
+And then I give it, you know, the grouping operator to match the number in between.
+And so it's like square brackets, zero dash nine, closing square brackets, plus that's
+going to match the numbers, any arbitrary number.
+And then followed by the look ahead assertion for the dash after the number.
+And so that basically will end up printing out just the page number itself without the
+surrounding dashes.
+I could also use a set expression or something to do the same thing, but I thought I'd use
+a look ahead, look behind assertions instead.
+And so detective page ends up with a detective page.
+I pass that into an if statement that's, make sure that it's actually a number.
+And if there's a number there, it asks you a question, what page is this, followed
+by, you know, showing the detective page and square brackets to kind of show that that's
+the default if you just press enter.
+Or you could actually type in a detective page or you could type in a page number of your
+own.
+So, you know, for instance, if it didn't get the the number right or something like that,
+you can correct it there.
+And then, you know, giving the base case of it, it couldn't find, you know, there was
+no number detected in the detective page.
+So it's going to just say error couldn't determine the page and exit out.
+So then at the end of the script, you know, it's going to run convert again on the output
+crops page to basically, I don't need like the full size image, I can scale it down a
+little bit and also reduce the quality a little bit in the JPEG image to save space.
+That way I end up with like a, you know, 500 kilobyte image per page instead of a four
+megabyte image per page.
+So, you know, 10 time size reduction almost.
+So I run the convert command again with the dash quality option, space 85 to give it
+like an 85 JPEG compression, followed by dash resize space 80% to reduce the size by
+80, down to 80, 80% of the original size, so like a 20% reduction.
+And then followed by the input file name and the output file name, but this time with
+the title of the book, dash, the dollar page variable, but putting the word page for the
+variable name inside curly braces, that way, you don't run into a case where any text that
+comes after that in the file name doesn't become part of the variable name, so you're kind
+of isolating the variable.
+And then just printing done at the end of it to say, you know, this is all done.
+And then live like error detection stuff, and also this is just a one off script.
+So you know, I'm not really super concerned about general use cases and stuff, but the
+script may be useful for you if you want to do something similar in helping you get
+that set up.
+So from there, I took it in.
+So every time I would need to scan a page, I just hit, I just ran this get page dot
+sh program again.
+And I'd hear the camera click to take a picture and then wait a few seconds for test
+rack to do its analysis.
+And then it would print out the text on the page and I can just take a quick look comparing
+the output of test rack with the actual physical page.
+And then once I'm satisfied, just pressing control V into labor office on the next page
+and making sure that it pays in OK.
+And then creating a page break to make sure that the page numbers are, you know, going
+to match up that way, I can be sure that I'm on the right page and stuff like that.
+Because after doing three here and 38 pages, it, you know, it's kind of mundane and everything
+and I could end up making a mistake.
+In the end, I was able to scan all three here and 38 pages in about six hours worth of
+time over a two day period, which was a substantial improvement over, you know, having to take
+like two months of typing or something.
+And I was able to get done, you know, that was the most important thing.
+This is something I've wanted to do for like 10 years or so.
+And so I'm happy that I was able to finally do it.
+You know, the funny thing is that about page numbering and labor office is work processors,
+you know, have always been kind of a hot mess.
+They've, I've been using word processors since the 1980s and formatting issues and the
+way you do certain things has always been complicated and everything.
+And one of the things I needed to do was make the book document so that the page numbering
+starts a couple pages in because, you know, you have the title page and the dedication
+page and the table of contents and stuff like that.
+And so you don't want to start the page numbering on page one of the document.
+You want to start like in chapter one and doing that turned out to be, you know, a bit
+complex of a process.
+And when I just searched, um, labor office, uh, on, on the search engine for labor office,
+how to restart the page numbering, I came across a labor office page where, you know,
+the guy said, well, it's not as complex as you think, uh, follow these 10 steps or whatever.
+And you know, it's, it's like you go through it and you're like, oh, this is kind of complicated
+or whatever.
+But I, you know, I wasn't too surprised that it was, it was complicated because word
+processors are complicated.
+But that didn't stop some people who are angry at, uh, open source being complicated
+or they're, they're perceiving open source to be complicated, uh, from complaining about
+it and criticizing the labor office developers and everything.
+And so I was like, I think this is maybe a bit unfair or something.
+Um, I ended up going to a YouTube video about how to renumber stuff in labor office.
+And there in the comments, I found a bunch of people complaining about how it, you know,
+the user interface of labor office is not helpful and, and, you know, this could be a lot
+easier.
+You, the developers need to think about the users and so on.
+So for comparison, I thought, well, how hard is this to do in, um, in Microsoft word?
+And I found another YouTube video that was actually a little bit longer than the labor
+office video.
+And, um, I watched the process for how to do this same thing in word.
+And it actually takes more steps and is longer and is more complicated.
+But, hypocritically, you know, somewhat, I look down the comments and the, uh, the comments
+are all like, thank you for showing me this.
+Oh, this is great.
+Not criticizing that it would, you know, it takes a long time to do this in Microsoft word.
+And they somehow hold Microsoft word to a different standard, um, thinking that somehow
+it's better or like it's okay that it's complicated because I guess they paid for it.
+So they feel like, oh, this is just the way it is because Microsoft word is the better
+tool or the, you know, the status quo tool.
+If anything, uh, the fact that people are, you know, talking about how labor office should
+be more like Microsoft word shows where the problem is, you know, we have a monopoly here.
+We have a problem with the monopoly, um, anyways, so I was happy I was able to come up with
+this modular setup to make the whole workflow of, uh, digitizing my, uh, mom's book a lot
+easier and I hope you were able to, uh, get some use out of this.
+So thanks.
+If, um, if you enjoyed this and you have some stories of your own or, or stuff of your
+own to contribute, I encourage you to, uh, download a program like, um, audacity or socks
+or, or something and find a microphone or use your phone, uh, you can even call into
+hacker public radio to record a show and tell, tell the world about your own, uh, story,
+your own ideas and, and your own adventures, uh, um, in, um, in computing and, and other
+things.
+Okay.
+So, until next time, talk to you later, bye.
+You have been listening to hacker public radio at hacker public radio does work.
+Today's show was contributed by a HBR listener like yourself, if you ever thought of recording
+a podcast, you click on our contribute link to find out how easy it really is.
+Hosting for HBR has been kindly provided by an honesthost.com, the Internet Archive
+and our sync.net.
+On the Sadois status, today's show is released under Creative Commons, Attribution 4.0 International