Initial commit: HPR Knowledge Base MCP Server
- MCP server with stdio transport for local use - Search episodes, transcripts, hosts, and series - 4,511 episodes with metadata and transcripts - Data loader with in-memory JSON storage 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
This commit is contained in:
343
hpr_transcripts/hpr3998.txt
Normal file
343
hpr_transcripts/hpr3998.txt
Normal file
@@ -0,0 +1,343 @@
|
||||
Episode: 3998
|
||||
Title: HPR3998: Using open source OCR to digitize my mom's book
|
||||
Source: https://hub.hackerpublicradio.org/ccdn.php?filename=/eps/hpr3998/hpr3998.mp3
|
||||
Transcribed: 2025-10-25 18:27:32
|
||||
|
||||
---
|
||||
|
||||
This is Hacker Public Radio Episode 3998 for Wednesday, the 29th of November 2023.
|
||||
Today's show is entitled, using open source OC are to digitize my mom's book.
|
||||
It is hosted by Delta Ray and is about 31 minutes long.
|
||||
It carries a clean flag.
|
||||
The summary is, how I used open source tools such as Cotto and the OC or software
|
||||
test erect to digitize pages.
|
||||
Hello, I'm Delta Ray and welcome to Hacker Public Radio.
|
||||
Today I'm going to talk about a bucket list item that I was recently able to cross off
|
||||
my list a long time ago back in the 1990s and the 20th century.
|
||||
My mom wrote a book and she passed away a few years ago and this book was never published
|
||||
unfortunately, but it's been on my list for a while now to try to get it published, get
|
||||
it digitized and so on.
|
||||
I've been able to use several open source tools to help me do that, at least to get digitized.
|
||||
The problem was that when she wrote the book, it was about 1994 or so when she started
|
||||
and she used a computer-ranked Windows 3.1 and a piece of software that is no longer
|
||||
supported and at the time was honestly not very common.
|
||||
It was something that was bundled with the computer and of course, since it wasn't very
|
||||
common, it used a document format that couldn't be converted into something more modern.
|
||||
So she was a bit stubborn in that regard.
|
||||
We tried to get her word perfect and have her use that, but she went ahead and used this
|
||||
other program.
|
||||
Fortunately, she actually printed the whole thing out over the next four years she worked
|
||||
on this book and printed out a few different copies.
|
||||
I have one of the copies and it's 338 pages.
|
||||
So trying to digitize all this in by typing it all in, while I tried that at first, it
|
||||
was going to take probably a month or two of typing about an hour a day or so.
|
||||
I mean, I can type fairly fast, but it was just a lot and I also found that even though
|
||||
I could catch most of my mistakes, I still ended up with a few mistakes when I was typing
|
||||
it in.
|
||||
So I thought, well, the nice thing about how this is print out is that it's just a page
|
||||
per sheet, just 8.5x11 paper and I can just scan it, but using a scanner is often kind
|
||||
of clunky and, you know, you have to wait for the scanner, scanning bed to run the scanner
|
||||
over it and then wait for the image to be imported and stuff like that.
|
||||
It's a bit slow, I think, and I needed my workflow to be a bit faster than that.
|
||||
So I decided to try to use, obviously I need to use like an optical character recognition
|
||||
system to be able to, you know, turn the image of the words into actual text, like digital
|
||||
text that can be copied and pasted into a word processor.
|
||||
And so, you know, instead of using a scanner, I have a digital SLR camera that has a USB
|
||||
output on it and there's some programs that you can use on Linux at least for accessing
|
||||
that.
|
||||
One of them is called GPhoto2, we'll get to that in a second, but basically I thought if
|
||||
I could set up my camera instead of having like a top-down view so that, you know, I have
|
||||
to like mount it overhead and lay the pages flat on a table or something like that.
|
||||
I thought, why don't I just use like a document stand, which I already had a document stand,
|
||||
you know, something that you'd get at like an office supply store and it would hold your
|
||||
document up at like an 80 degree angle or something like that.
|
||||
And then a tripod for the camera and kind of aim the camera, you know, just a few, from
|
||||
a few feet away, aim the camera at the page through the open air and light up the document,
|
||||
you know, well enough so that it gets a really good shot so that I can be more certain
|
||||
that the OCR is going to have a really good input to work with and have less errors.
|
||||
So I set that up pretty much right in front of me right next to the computer so that, you
|
||||
know, like on the right side of me was the tripod with the camera on it pointing to the
|
||||
left side of me, which is on the table where the document reader is and the computer is
|
||||
right in front of me so I can just lay out the book, pages right in front of me and quickly
|
||||
flip, you know, one page to the document reader, I'm sorry, the document stand and then
|
||||
pull the previous page down to like a, you know, finish stack of pages.
|
||||
And then, you know, with the computer right in front of me, I can just trigger it to take
|
||||
the next picture, import it and so on.
|
||||
So like I said, the open source tools for this are, you know, available.
|
||||
The first one is called G Photo 2 and this is software for working with digital cameras
|
||||
over a USB connection, maybe some air connections too, but I've been using it over a USB and just
|
||||
had a USB cable to, I think it was a mini USB port because that's what this older camera
|
||||
that I have uses, but you could use probably any USB type of USB connection.
|
||||
So it can trigger, G Photo 2 can actually trigger the camera shutter and transfer the photo
|
||||
that way I don't have to like, I can just leave the camera turned on and in place and
|
||||
instead of using battery on the camera and then ring out battery and then having to change
|
||||
the battery, I actually have an AC adapter for the camera so that, you know, I can plug
|
||||
in the AC adapter to keep it running and so on.
|
||||
So I had to do a lot of alignment, you know, with the camera and do some testing to make
|
||||
sure I have it lined up properly and one way that I did that was to take a picture using
|
||||
the photo 2 and then open it up in GIMP or some photo editing program that has guide lines
|
||||
that you can pull down and then you can look at the page of text and pull the guideline
|
||||
to like the bottom of a line of text and make sure from left to right that it lines up properly
|
||||
and there's no, you know, it doesn't like go up or go down and make sure that the margin
|
||||
along the left side is also all lined up vertically.
|
||||
That way you can be sure that the page on your document stand is perpendicular, you know,
|
||||
almost precisely with the camera lens which will just increase the accuracy of the OCR.
|
||||
So you want to have a good setup so you can have good results.
|
||||
But G photo, you know, has some options for triggering the shutter and then transferring
|
||||
the photo and then from there I just, you know, saved the photo to a file called from
|
||||
camera.jpeg and then there's a narrow piece of software called image magic which has
|
||||
some options for cropping the photo and it's, you know, image magic is command line tool
|
||||
and it has a program called convert and with this I just, you know, run convert space
|
||||
from camera.jpeg, space, dash crop and then I give it, you know, the crop resolution.
|
||||
So in this case I had to, you know, take it in the Gimp and find the right pixels, you
|
||||
know, the right dimensions for the crop using the guide.
|
||||
So it's, in this case it was 2275x meaning times, you know, 1700 plus 785 plus 560.
|
||||
So in our words, the width and height dimensions of the box that's going to be the crop and
|
||||
then the offset from the upper left and then, and then the save file name, you know, output
|
||||
dash crop.jpeg to further use it from there.
|
||||
So the crop is going to make it so that, you know, it's, it's nice and cut.
|
||||
The image has been cropped down to just the page and stuff like that, that way any of
|
||||
the background that was in the picture can be removed that might confuse the OCR software.
|
||||
And then pulled into the OCR software and there's a piece of open source OCR software
|
||||
called Tesseract, it's T-E-S-S-E-R-A-C-T, I'll put this information in the show notes.
|
||||
But Tesseract is a program that's been around for a long time, it was originally developed
|
||||
by Hewlett Packard in the 80s and 90s and then made open source in 2005.
|
||||
I guess it was worked on by HP in partnership with University of North Las Vegas, I guess
|
||||
UNLV.
|
||||
They might have needed OCR software and HP developed it for them or something, but fortunately
|
||||
for us, it was made open source and now it's, you know, freely available just by, you
|
||||
know, in your program repose on your Linux distribution.
|
||||
And it works really well, I think, you know, I had, I didn't have too many problems with
|
||||
errors and for the most part I was able to rely on it.
|
||||
I didn't really use the configuration options to, you know, an extent that might have cut
|
||||
down on the errors or something, I probably could have done that, like, set the language
|
||||
and maybe excluded some characters or something.
|
||||
One reason why I didn't want to use excluding characters was because even though I've read
|
||||
my mom's book, I wasn't sure whether she used characters like a slash, you know, like
|
||||
this or that using a slash there or using like a vertical bar.
|
||||
These are the two characters that I ended up bringing into sometimes.
|
||||
Like for instance, if she said aisle, like I apostrophe LL and if it was italics, it would
|
||||
actually sometimes turn the eye into a forward slash and, you know, that's obviously wrong.
|
||||
But if I had programmed it to exclude, I can easily catch those types of mistakes by doing
|
||||
like a search or something or using the spell checker.
|
||||
But if it had replaced like a slash that was literally used in the book by just excluding
|
||||
or something, I might have lost that information and not been able to easily find it.
|
||||
So that's why I didn't actually program to exclude certain characters.
|
||||
So anyways, Tesseract takes in the image and then it just basically, you can say Tesseract
|
||||
space, image name, space and then I used a dash to basically say just write the data
|
||||
to standard out.
|
||||
You can also, you know, have it like output it to a file name or something like that.
|
||||
Actually in my script, I wrote a bash script for actually running all this stuff to make
|
||||
it easier.
|
||||
But you can have it, you know, write out to a file name.
|
||||
So I just said like OCR dash text.
|
||||
But if you want to, you can just put a dash there to say write to standard out so it
|
||||
just shows up.
|
||||
That's good for when you're testing it.
|
||||
And then from Tesseract, I used the Grap programs with the dash capital E option for extended
|
||||
regular expressions and also dash capital P for the Perl compatible regular expressions
|
||||
because what I wanted to do on each page, each page of the book has like a header and
|
||||
a footer.
|
||||
The header has the title of the book and the footer has, you know, my mom's name and also
|
||||
page numbers and stuff.
|
||||
And so the way she did page numbers was to put dash number dash.
|
||||
And I thought, well, I don't, you know, when I copy this into, I used Libra office for
|
||||
Libra office writer for actually doing the word processing.
|
||||
And what I wanted to do was make a script that accelerates my workflow so that I can
|
||||
do this very quickly.
|
||||
And so I wanted to basically scan a page, have it do all the necessary processing so that
|
||||
all I have to do after it's done scanning the page is just paste right into Libra office.
|
||||
And so I wanted to remove the header and the footer if it matched something like I, you
|
||||
know, I set up so that it uses the dash V option and Grap and gave it some expressions
|
||||
to like remove blank lines because my mom wrote the book and doubles in double spaced
|
||||
line.
|
||||
So there's like a blank line in between each line.
|
||||
And of course, I can just remove this by using a regular expression like care it and then
|
||||
dollar, which means just match a blank line and remove that from the output.
|
||||
And then, you know, another expression with the dash E option and say, you know, the
|
||||
title of the book and another dash E and give it my mom's name.
|
||||
I want to give it as much context as possible.
|
||||
So I use a care it before the title and then a dollar, you know, maybe a dollar sign
|
||||
after it to catch just the title of the book on a line by itself.
|
||||
You want to keep in mind is that does that appear anywhere else, you know, in the book
|
||||
or something, you don't want to end up excluding it if it's like in the middle of a page
|
||||
or something like that.
|
||||
And of course, the same thing with anything that you're trying to match here, it's like,
|
||||
you want to make sure that you catch all the corner cases and consider all those.
|
||||
So giving Grap as much context as possible is a good idea.
|
||||
And then, let me see.
|
||||
And then also removing like the page number, so, you know, use care it.
|
||||
The care it is basically shift six on U.S. keyboards while it's that, you know, it looks
|
||||
like a greater than sign that's been turned up on its side.
|
||||
And then backslash space to, to say, match a space and then an asterisk to say zero or
|
||||
more spaces.
|
||||
So basically, if there were spaces that popped up before, it's going to match those.
|
||||
And then dash and square brackets, zero dash nine, closing square brackets, plus, and
|
||||
then other dash.
|
||||
That matches the page number, and it could be any arbitrary page number, and it's going
|
||||
to skip that line basically in the output, so they will display that.
|
||||
And then pipe that into Grap again, but this time using dash capital P. And the reason
|
||||
why I wanted to pass it through Grap again is because sometimes I would end up with this
|
||||
form feed control character, it's like the hexadecimal code for it is zero C. And I guess
|
||||
Tesseract would insert that in cases where it found like multiple blank blank lines or
|
||||
something it would insert a form feed.
|
||||
And I didn't want that to actually show up in the output, especially when I passed it
|
||||
into the copy paste program, which I'll talk about in a second.
|
||||
So I passed it into Grap dash capital B, P, I'm sorry, Grap, space, dash capital P, space
|
||||
dash V to exclude what it matches, space, carrot, backslash, X, which sets up the hexadecimal
|
||||
matching, and then zero C, dollar sign.
|
||||
And so that will make it so a skip A line that has just a form feed on it.
|
||||
And so this first Grap pipeline is for the purpose of displaying to the screen so that
|
||||
I can see what it matched in the OCR software.
|
||||
Then I run it again, basically the same expression, a second time.
|
||||
And what I'm doing here is passing this into the X cell program, X, S, E, L, space,
|
||||
dash B. And what that does is in Linux on an X 11 desktop, it'll take the input from
|
||||
standard input and put it into your copy paste buffer.
|
||||
This is a really handy program, especially in like a script like this, so that I don't
|
||||
have to select the text on the screen, and then get the mouse selected correctly or something
|
||||
like that, just basically removes the errors that might happen and speeds up the process.
|
||||
So from there, I can just do a control V to paste my copy paste buffer directly into
|
||||
Libra Office, which is pretty handy.
|
||||
This script, I'm going to put the script in the show notes.
|
||||
So if you're not following all this, just orally, then you'll be able to read along.
|
||||
And then the next thing I do is detect the actual page number.
|
||||
And the reason why I do this is because one of the things that the script does is it takes
|
||||
that the cropped image and copies it to a image file name with the page number in it.
|
||||
So that later as I'm going through and kind of editing the digitized text, I can compare
|
||||
it with the scan of the page to check for formatting problems and also to double check that
|
||||
the text actually match properly and stuff.
|
||||
So I set up a variable called detected page and then I give it a process substitution,
|
||||
operator, so dollar, parentheses, space, tail space, OCR-text.txt to get the last 10 lines
|
||||
of the OCR text output, pipe that into Grep and then turn off the color, which you don't
|
||||
really need to do, but I do it anyways explicitly.
|
||||
And then dash capital P, because I'm going to use a special feature of the Pro-Regular
|
||||
Expressions called look ahead and look behind assertions.
|
||||
And then dash O, which means only give me the matched output from the regular expression,
|
||||
only give me the text that matches the regular expression.
|
||||
And then dash E to give it the expression.
|
||||
And here's where I use something called look ahead and look behind assertions.
|
||||
If you're not familiar with more advanced regular expressions, purl regular expressions
|
||||
give you the option to basically match context around what you want to match without adding
|
||||
that context to the match itself.
|
||||
So in this case, the page numbers have a dash before and after the number.
|
||||
And I don't want that to be part of the match.
|
||||
And so I can use this special syntax, which is basically, it's kind of complicated.
|
||||
You'll see it in the script, but it's parentheses, question mark, greater than equals
|
||||
carrot dash to match the dash, a closed parentheses, and that matches the dash before.
|
||||
And then I give it, you know, the grouping operator to match the number in between.
|
||||
And so it's like square brackets, zero dash nine, closing square brackets, plus that's
|
||||
going to match the numbers, any arbitrary number.
|
||||
And then followed by the look ahead assertion for the dash after the number.
|
||||
And so that basically will end up printing out just the page number itself without the
|
||||
surrounding dashes.
|
||||
I could also use a set expression or something to do the same thing, but I thought I'd use
|
||||
a look ahead, look behind assertions instead.
|
||||
And so detective page ends up with a detective page.
|
||||
I pass that into an if statement that's, make sure that it's actually a number.
|
||||
And if there's a number there, it asks you a question, what page is this, followed
|
||||
by, you know, showing the detective page and square brackets to kind of show that that's
|
||||
the default if you just press enter.
|
||||
Or you could actually type in a detective page or you could type in a page number of your
|
||||
own.
|
||||
So, you know, for instance, if it didn't get the the number right or something like that,
|
||||
you can correct it there.
|
||||
And then, you know, giving the base case of it, it couldn't find, you know, there was
|
||||
no number detected in the detective page.
|
||||
So it's going to just say error couldn't determine the page and exit out.
|
||||
So then at the end of the script, you know, it's going to run convert again on the output
|
||||
crops page to basically, I don't need like the full size image, I can scale it down a
|
||||
little bit and also reduce the quality a little bit in the JPEG image to save space.
|
||||
That way I end up with like a, you know, 500 kilobyte image per page instead of a four
|
||||
megabyte image per page.
|
||||
So, you know, 10 time size reduction almost.
|
||||
So I run the convert command again with the dash quality option, space 85 to give it
|
||||
like an 85 JPEG compression, followed by dash resize space 80% to reduce the size by
|
||||
80, down to 80, 80% of the original size, so like a 20% reduction.
|
||||
And then followed by the input file name and the output file name, but this time with
|
||||
the title of the book, dash, the dollar page variable, but putting the word page for the
|
||||
variable name inside curly braces, that way, you don't run into a case where any text that
|
||||
comes after that in the file name doesn't become part of the variable name, so you're kind
|
||||
of isolating the variable.
|
||||
And then just printing done at the end of it to say, you know, this is all done.
|
||||
And then live like error detection stuff, and also this is just a one off script.
|
||||
So you know, I'm not really super concerned about general use cases and stuff, but the
|
||||
script may be useful for you if you want to do something similar in helping you get
|
||||
that set up.
|
||||
So from there, I took it in.
|
||||
So every time I would need to scan a page, I just hit, I just ran this get page dot
|
||||
sh program again.
|
||||
And I'd hear the camera click to take a picture and then wait a few seconds for test
|
||||
rack to do its analysis.
|
||||
And then it would print out the text on the page and I can just take a quick look comparing
|
||||
the output of test rack with the actual physical page.
|
||||
And then once I'm satisfied, just pressing control V into labor office on the next page
|
||||
and making sure that it pays in OK.
|
||||
And then creating a page break to make sure that the page numbers are, you know, going
|
||||
to match up that way, I can be sure that I'm on the right page and stuff like that.
|
||||
Because after doing three here and 38 pages, it, you know, it's kind of mundane and everything
|
||||
and I could end up making a mistake.
|
||||
In the end, I was able to scan all three here and 38 pages in about six hours worth of
|
||||
time over a two day period, which was a substantial improvement over, you know, having to take
|
||||
like two months of typing or something.
|
||||
And I was able to get done, you know, that was the most important thing.
|
||||
This is something I've wanted to do for like 10 years or so.
|
||||
And so I'm happy that I was able to finally do it.
|
||||
You know, the funny thing is that about page numbering and labor office is work processors,
|
||||
you know, have always been kind of a hot mess.
|
||||
They've, I've been using word processors since the 1980s and formatting issues and the
|
||||
way you do certain things has always been complicated and everything.
|
||||
And one of the things I needed to do was make the book document so that the page numbering
|
||||
starts a couple pages in because, you know, you have the title page and the dedication
|
||||
page and the table of contents and stuff like that.
|
||||
And so you don't want to start the page numbering on page one of the document.
|
||||
You want to start like in chapter one and doing that turned out to be, you know, a bit
|
||||
complex of a process.
|
||||
And when I just searched, um, labor office, uh, on, on the search engine for labor office,
|
||||
how to restart the page numbering, I came across a labor office page where, you know,
|
||||
the guy said, well, it's not as complex as you think, uh, follow these 10 steps or whatever.
|
||||
And you know, it's, it's like you go through it and you're like, oh, this is kind of complicated
|
||||
or whatever.
|
||||
But I, you know, I wasn't too surprised that it was, it was complicated because word
|
||||
processors are complicated.
|
||||
But that didn't stop some people who are angry at, uh, open source being complicated
|
||||
or they're, they're perceiving open source to be complicated, uh, from complaining about
|
||||
it and criticizing the labor office developers and everything.
|
||||
And so I was like, I think this is maybe a bit unfair or something.
|
||||
Um, I ended up going to a YouTube video about how to renumber stuff in labor office.
|
||||
And there in the comments, I found a bunch of people complaining about how it, you know,
|
||||
the user interface of labor office is not helpful and, and, you know, this could be a lot
|
||||
easier.
|
||||
You, the developers need to think about the users and so on.
|
||||
So for comparison, I thought, well, how hard is this to do in, um, in Microsoft word?
|
||||
And I found another YouTube video that was actually a little bit longer than the labor
|
||||
office video.
|
||||
And, um, I watched the process for how to do this same thing in word.
|
||||
And it actually takes more steps and is longer and is more complicated.
|
||||
But, hypocritically, you know, somewhat, I look down the comments and the, uh, the comments
|
||||
are all like, thank you for showing me this.
|
||||
Oh, this is great.
|
||||
Not criticizing that it would, you know, it takes a long time to do this in Microsoft word.
|
||||
And they somehow hold Microsoft word to a different standard, um, thinking that somehow
|
||||
it's better or like it's okay that it's complicated because I guess they paid for it.
|
||||
So they feel like, oh, this is just the way it is because Microsoft word is the better
|
||||
tool or the, you know, the status quo tool.
|
||||
If anything, uh, the fact that people are, you know, talking about how labor office should
|
||||
be more like Microsoft word shows where the problem is, you know, we have a monopoly here.
|
||||
We have a problem with the monopoly, um, anyways, so I was happy I was able to come up with
|
||||
this modular setup to make the whole workflow of, uh, digitizing my, uh, mom's book a lot
|
||||
easier and I hope you were able to, uh, get some use out of this.
|
||||
So thanks.
|
||||
If, um, if you enjoyed this and you have some stories of your own or, or stuff of your
|
||||
own to contribute, I encourage you to, uh, download a program like, um, audacity or socks
|
||||
or, or something and find a microphone or use your phone, uh, you can even call into
|
||||
hacker public radio to record a show and tell, tell the world about your own, uh, story,
|
||||
your own ideas and, and your own adventures, uh, um, in, um, in computing and, and other
|
||||
things.
|
||||
Okay.
|
||||
So, until next time, talk to you later, bye.
|
||||
You have been listening to hacker public radio at hacker public radio does work.
|
||||
Today's show was contributed by a HBR listener like yourself, if you ever thought of recording
|
||||
a podcast, you click on our contribute link to find out how easy it really is.
|
||||
Hosting for HBR has been kindly provided by an honesthost.com, the Internet Archive
|
||||
and our sync.net.
|
||||
On the Sadois status, today's show is released under Creative Commons, Attribution 4.0 International
|
||||
Reference in New Issue
Block a user