Episode: 3998
Title: HPR3998: Using open source OCR to digitize my mom's book
Source: https://hub.hackerpublicradio.org/ccdn.php?filename=/eps/hpr3998/hpr3998.mp3
Transcribed: 2025-10-25 18:27:32

---

This is Hacker Public Radio Episode 3998 for Wednesday, the 29th of November 2023.
Today's show is entitled, using open source OC are to digitize my mom's book.
It is hosted by Delta Ray and is about 31 minutes long.
It carries a clean flag.
The summary is, how I used open source tools such as Cotto and the OC or software
test erect to digitize pages.
Hello, I'm Delta Ray and welcome to Hacker Public Radio.
Today I'm going to talk about a bucket list item that I was recently able to cross off
my list a long time ago back in the 1990s and the 20th century.
My mom wrote a book and she passed away a few years ago and this book was never published
unfortunately, but it's been on my list for a while now to try to get it published, get
it digitized and so on.
I've been able to use several open source tools to help me do that, at least to get digitized.
The problem was that when she wrote the book, it was about 1994 or so when she started
and she used a computer-ranked Windows 3.1 and a piece of software that is no longer
supported and at the time was honestly not very common.
It was something that was bundled with the computer and of course, since it wasn't very
common, it used a document format that couldn't be converted into something more modern.
So she was a bit stubborn in that regard.
We tried to get her word perfect and have her use that, but she went ahead and used this
other program.
Fortunately, she actually printed the whole thing out over the next four years she worked
on this book and printed out a few different copies.
I have one of the copies and it's 338 pages.
So trying to digitize all this in by typing it all in, while I tried that at first, it
was going to take probably a month or two of typing about an hour a day or so.
I mean, I can type fairly fast, but it was just a lot and I also found that even though
I could catch most of my mistakes, I still ended up with a few mistakes when I was typing
it in.
So I thought, well, the nice thing about how this is print out is that it's just a page
per sheet, just 8.5x11 paper and I can just scan it, but using a scanner is often kind
of clunky and, you know, you have to wait for the scanner, scanning bed to run the scanner
over it and then wait for the image to be imported and stuff like that.
It's a bit slow, I think, and I needed my workflow to be a bit faster than that.
So I decided to try to use, obviously I need to use like an optical character recognition
system to be able to, you know, turn the image of the words into actual text, like digital
text that can be copied and pasted into a word processor.
And so, you know, instead of using a scanner, I have a digital SLR camera that has a USB
output on it and there's some programs that you can use on Linux at least for accessing
that.
One of them is called GPhoto2, we'll get to that in a second, but basically I thought if
I could set up my camera instead of having like a top-down view so that, you know, I have
to like mount it overhead and lay the pages flat on a table or something like that.
I thought, why don't I just use like a document stand, which I already had a document stand,
you know, something that you'd get at like an office supply store and it would hold your
document up at like an 80 degree angle or something like that.
And then a tripod for the camera and kind of aim the camera, you know, just a few, from
a few feet away, aim the camera at the page through the open air and light up the document,
you know, well enough so that it gets a really good shot so that I can be more certain
that the OCR is going to have a really good input to work with and have less errors.
So I set that up pretty much right in front of me right next to the computer so that, you
know, like on the right side of me was the tripod with the camera on it pointing to the
left side of me, which is on the table where the document reader is and the computer is
right in front of me so I can just lay out the book, pages right in front of me and quickly
flip, you know, one page to the document reader, I'm sorry, the document stand and then
pull the previous page down to like a, you know, finish stack of pages.
And then, you know, with the computer right in front of me, I can just trigger it to take
the next picture, import it and so on.
So like I said, the open source tools for this are, you know, available.
The first one is called G Photo 2 and this is software for working with digital cameras
over a USB connection, maybe some air connections too, but I've been using it over a USB and just
had a USB cable to, I think it was a mini USB port because that's what this older camera
that I have uses, but you could use probably any USB type of USB connection.
So it can trigger, G Photo 2 can actually trigger the camera shutter and transfer the photo
that way I don't have to like, I can just leave the camera turned on and in place and
instead of using battery on the camera and then ring out battery and then having to change
the battery, I actually have an AC adapter for the camera so that, you know, I can plug
in the AC adapter to keep it running and so on.
So I had to do a lot of alignment, you know, with the camera and do some testing to make
sure I have it lined up properly and one way that I did that was to take a picture using
the photo 2 and then open it up in GIMP or some photo editing program that has guide lines
that you can pull down and then you can look at the page of text and pull the guideline
to like the bottom of a line of text and make sure from left to right that it lines up properly
and there's no, you know, it doesn't like go up or go down and make sure that the margin
along the left side is also all lined up vertically.
That way you can be sure that the page on your document stand is perpendicular, you know,
almost precisely with the camera lens which will just increase the accuracy of the OCR.
So you want to have a good setup so you can have good results.
But G photo, you know, has some options for triggering the shutter and then transferring
the photo and then from there I just, you know, saved the photo to a file called from
camera.jpeg and then there's a narrow piece of software called image magic which has
some options for cropping the photo and it's, you know, image magic is command line tool
and it has a program called convert and with this I just, you know, run convert space
from camera.jpeg, space, dash crop and then I give it, you know, the crop resolution.
So in this case I had to, you know, take it in the Gimp and find the right pixels, you
know, the right dimensions for the crop using the guide.
So it's, in this case it was 2275x meaning times, you know, 1700 plus 785 plus 560.
So in our words, the width and height dimensions of the box that's going to be the crop and
then the offset from the upper left and then, and then the save file name, you know, output
dash crop.jpeg to further use it from there.
So the crop is going to make it so that, you know, it's, it's nice and cut.
The image has been cropped down to just the page and stuff like that, that way any of
the background that was in the picture can be removed that might confuse the OCR software.
And then pulled into the OCR software and there's a piece of open source OCR software
called Tesseract, it's T-E-S-S-E-R-A-C-T, I'll put this information in the show notes.
But Tesseract is a program that's been around for a long time, it was originally developed
by Hewlett Packard in the 80s and 90s and then made open source in 2005.
I guess it was worked on by HP in partnership with University of North Las Vegas, I guess
UNLV.
They might have needed OCR software and HP developed it for them or something, but fortunately
for us, it was made open source and now it's, you know, freely available just by, you
know, in your program repose on your Linux distribution.
And it works really well, I think, you know, I had, I didn't have too many problems with
errors and for the most part I was able to rely on it.
I didn't really use the configuration options to, you know, an extent that might have cut
down on the errors or something, I probably could have done that, like, set the language
and maybe excluded some characters or something.
One reason why I didn't want to use excluding characters was because even though I've read
my mom's book, I wasn't sure whether she used characters like a slash, you know, like
this or that using a slash there or using like a vertical bar.
These are the two characters that I ended up bringing into sometimes.
Like for instance, if she said aisle, like I apostrophe LL and if it was italics, it would
actually sometimes turn the eye into a forward slash and, you know, that's obviously wrong.
But if I had programmed it to exclude, I can easily catch those types of mistakes by doing
like a search or something or using the spell checker.
But if it had replaced like a slash that was literally used in the book by just excluding
or something, I might have lost that information and not been able to easily find it.
So that's why I didn't actually program to exclude certain characters.
So anyways, Tesseract takes in the image and then it just basically, you can say Tesseract
space, image name, space and then I used a dash to basically say just write the data
to standard out.
You can also, you know, have it like output it to a file name or something like that.
Actually in my script, I wrote a bash script for actually running all this stuff to make
it easier.
But you can have it, you know, write out to a file name.
So I just said like OCR dash text.
But if you want to, you can just put a dash there to say write to standard out so it
just shows up.
That's good for when you're testing it.
And then from Tesseract, I used the Grap programs with the dash capital E option for extended
regular expressions and also dash capital P for the Perl compatible regular expressions
because what I wanted to do on each page, each page of the book has like a header and
a footer.
The header has the title of the book and the footer has, you know, my mom's name and also
page numbers and stuff.
And so the way she did page numbers was to put dash number dash.
And I thought, well, I don't, you know, when I copy this into, I used Libra office for
Libra office writer for actually doing the word processing.
And what I wanted to do was make a script that accelerates my workflow so that I can
do this very quickly.
And so I wanted to basically scan a page, have it do all the necessary processing so that
all I have to do after it's done scanning the page is just paste right into Libra office.
And so I wanted to remove the header and the footer if it matched something like I, you
know, I set up so that it uses the dash V option and Grap and gave it some expressions
to like remove blank lines because my mom wrote the book and doubles in double spaced
line.
So there's like a blank line in between each line.
And of course, I can just remove this by using a regular expression like care it and then
dollar, which means just match a blank line and remove that from the output.
And then, you know, another expression with the dash E option and say, you know, the
title of the book and another dash E and give it my mom's name.
I want to give it as much context as possible.
So I use a care it before the title and then a dollar, you know, maybe a dollar sign
after it to catch just the title of the book on a line by itself.
You want to keep in mind is that does that appear anywhere else, you know, in the book
or something, you don't want to end up excluding it if it's like in the middle of a page
or something like that.
And of course, the same thing with anything that you're trying to match here, it's like,
you want to make sure that you catch all the corner cases and consider all those.
So giving Grap as much context as possible is a good idea.
And then, let me see.
And then also removing like the page number, so, you know, use care it.
The care it is basically shift six on U.S. keyboards while it's that, you know, it looks
like a greater than sign that's been turned up on its side.
And then backslash space to, to say, match a space and then an asterisk to say zero or
more spaces.
So basically, if there were spaces that popped up before, it's going to match those.
And then dash and square brackets, zero dash nine, closing square brackets, plus, and
then other dash.
That matches the page number, and it could be any arbitrary page number, and it's going
to skip that line basically in the output, so they will display that.
And then pipe that into Grap again, but this time using dash capital P. And the reason
why I wanted to pass it through Grap again is because sometimes I would end up with this
form feed control character, it's like the hexadecimal code for it is zero C. And I guess
Tesseract would insert that in cases where it found like multiple blank blank lines or
something it would insert a form feed.
And I didn't want that to actually show up in the output, especially when I passed it
into the copy paste program, which I'll talk about in a second.
So I passed it into Grap dash capital B, P, I'm sorry, Grap, space, dash capital P, space
dash V to exclude what it matches, space, carrot, backslash, X, which sets up the hexadecimal
matching, and then zero C, dollar sign.
And so that will make it so a skip A line that has just a form feed on it.
And so this first Grap pipeline is for the purpose of displaying to the screen so that
I can see what it matched in the OCR software.
Then I run it again, basically the same expression, a second time.
And what I'm doing here is passing this into the X cell program, X, S, E, L, space,
dash B. And what that does is in Linux on an X 11 desktop, it'll take the input from
standard input and put it into your copy paste buffer.
This is a really handy program, especially in like a script like this, so that I don't
have to select the text on the screen, and then get the mouse selected correctly or something
like that, just basically removes the errors that might happen and speeds up the process.
So from there, I can just do a control V to paste my copy paste buffer directly into
Libra Office, which is pretty handy.
This script, I'm going to put the script in the show notes.
So if you're not following all this, just orally, then you'll be able to read along.
And then the next thing I do is detect the actual page number.
And the reason why I do this is because one of the things that the script does is it takes
that the cropped image and copies it to a image file name with the page number in it.
So that later as I'm going through and kind of editing the digitized text, I can compare
it with the scan of the page to check for formatting problems and also to double check that
the text actually match properly and stuff.
So I set up a variable called detected page and then I give it a process substitution,
operator, so dollar, parentheses, space, tail space, OCR-text.txt to get the last 10 lines
of the OCR text output, pipe that into Grep and then turn off the color, which you don't
really need to do, but I do it anyways explicitly.
And then dash capital P, because I'm going to use a special feature of the Pro-Regular
Expressions called look ahead and look behind assertions.
And then dash O, which means only give me the matched output from the regular expression,
only give me the text that matches the regular expression.
And then dash E to give it the expression.
And here's where I use something called look ahead and look behind assertions.
If you're not familiar with more advanced regular expressions, purl regular expressions
give you the option to basically match context around what you want to match without adding
that context to the match itself.
So in this case, the page numbers have a dash before and after the number.
And I don't want that to be part of the match.
And so I can use this special syntax, which is basically, it's kind of complicated.
You'll see it in the script, but it's parentheses, question mark, greater than equals
carrot dash to match the dash, a closed parentheses, and that matches the dash before.
And then I give it, you know, the grouping operator to match the number in between.
And so it's like square brackets, zero dash nine, closing square brackets, plus that's
going to match the numbers, any arbitrary number.
And then followed by the look ahead assertion for the dash after the number.
And so that basically will end up printing out just the page number itself without the
surrounding dashes.
I could also use a set expression or something to do the same thing, but I thought I'd use
a look ahead, look behind assertions instead.
And so detective page ends up with a detective page.
I pass that into an if statement that's, make sure that it's actually a number.
And if there's a number there, it asks you a question, what page is this, followed
by, you know, showing the detective page and square brackets to kind of show that that's
the default if you just press enter.
Or you could actually type in a detective page or you could type in a page number of your
own.
So, you know, for instance, if it didn't get the the number right or something like that,
you can correct it there.
And then, you know, giving the base case of it, it couldn't find, you know, there was
no number detected in the detective page.
So it's going to just say error couldn't determine the page and exit out.
So then at the end of the script, you know, it's going to run convert again on the output
crops page to basically, I don't need like the full size image, I can scale it down a
little bit and also reduce the quality a little bit in the JPEG image to save space.
That way I end up with like a, you know, 500 kilobyte image per page instead of a four
megabyte image per page.
So, you know, 10 time size reduction almost.
So I run the convert command again with the dash quality option, space 85 to give it
like an 85 JPEG compression, followed by dash resize space 80% to reduce the size by
80, down to 80, 80% of the original size, so like a 20% reduction.
And then followed by the input file name and the output file name, but this time with
the title of the book, dash, the dollar page variable, but putting the word page for the
variable name inside curly braces, that way, you don't run into a case where any text that
comes after that in the file name doesn't become part of the variable name, so you're kind
of isolating the variable.
And then just printing done at the end of it to say, you know, this is all done.
And then live like error detection stuff, and also this is just a one off script.
So you know, I'm not really super concerned about general use cases and stuff, but the
script may be useful for you if you want to do something similar in helping you get
that set up.
So from there, I took it in.
So every time I would need to scan a page, I just hit, I just ran this get page dot
sh program again.
And I'd hear the camera click to take a picture and then wait a few seconds for test
rack to do its analysis.
And then it would print out the text on the page and I can just take a quick look comparing
the output of test rack with the actual physical page.
And then once I'm satisfied, just pressing control V into labor office on the next page
and making sure that it pays in OK.
And then creating a page break to make sure that the page numbers are, you know, going
to match up that way, I can be sure that I'm on the right page and stuff like that.
Because after doing three here and 38 pages, it, you know, it's kind of mundane and everything
and I could end up making a mistake.
In the end, I was able to scan all three here and 38 pages in about six hours worth of
time over a two day period, which was a substantial improvement over, you know, having to take
like two months of typing or something.
And I was able to get done, you know, that was the most important thing.
This is something I've wanted to do for like 10 years or so.
And so I'm happy that I was able to finally do it.
You know, the funny thing is that about page numbering and labor office is work processors,
you know, have always been kind of a hot mess.
They've, I've been using word processors since the 1980s and formatting issues and the
way you do certain things has always been complicated and everything.
And one of the things I needed to do was make the book document so that the page numbering
starts a couple pages in because, you know, you have the title page and the dedication
page and the table of contents and stuff like that.
And so you don't want to start the page numbering on page one of the document.
You want to start like in chapter one and doing that turned out to be, you know, a bit
complex of a process.
And when I just searched, um, labor office, uh, on, on the search engine for labor office,
how to restart the page numbering, I came across a labor office page where, you know,
the guy said, well, it's not as complex as you think, uh, follow these 10 steps or whatever.
And you know, it's, it's like you go through it and you're like, oh, this is kind of complicated
or whatever.
But I, you know, I wasn't too surprised that it was, it was complicated because word
processors are complicated.
But that didn't stop some people who are angry at, uh, open source being complicated
or they're, they're perceiving open source to be complicated, uh, from complaining about
it and criticizing the labor office developers and everything.
And so I was like, I think this is maybe a bit unfair or something.
Um, I ended up going to a YouTube video about how to renumber stuff in labor office.
And there in the comments, I found a bunch of people complaining about how it, you know,
the user interface of labor office is not helpful and, and, you know, this could be a lot
easier.
You, the developers need to think about the users and so on.
So for comparison, I thought, well, how hard is this to do in, um, in Microsoft word?
And I found another YouTube video that was actually a little bit longer than the labor
office video.
And, um, I watched the process for how to do this same thing in word.
And it actually takes more steps and is longer and is more complicated.
But, hypocritically, you know, somewhat, I look down the comments and the, uh, the comments
are all like, thank you for showing me this.
Oh, this is great.
Not criticizing that it would, you know, it takes a long time to do this in Microsoft word.
And they somehow hold Microsoft word to a different standard, um, thinking that somehow
it's better or like it's okay that it's complicated because I guess they paid for it.
So they feel like, oh, this is just the way it is because Microsoft word is the better
tool or the, you know, the status quo tool.
If anything, uh, the fact that people are, you know, talking about how labor office should
be more like Microsoft word shows where the problem is, you know, we have a monopoly here.
We have a problem with the monopoly, um, anyways, so I was happy I was able to come up with
this modular setup to make the whole workflow of, uh, digitizing my, uh, mom's book a lot
easier and I hope you were able to, uh, get some use out of this.
So thanks.
If, um, if you enjoyed this and you have some stories of your own or, or stuff of your
own to contribute, I encourage you to, uh, download a program like, um, audacity or socks
or, or something and find a microphone or use your phone, uh, you can even call into
hacker public radio to record a show and tell, tell the world about your own, uh, story,
your own ideas and, and your own adventures, uh, um, in, um, in computing and, and other
things.
Okay.
So, until next time, talk to you later, bye.
You have been listening to hacker public radio at hacker public radio does work.
Today's show was contributed by a HBR listener like yourself, if you ever thought of recording
a podcast, you click on our contribute link to find out how easy it really is.
Hosting for HBR has been kindly provided by an honesthost.com, the Internet Archive
and our sync.net.
On the Sadois status, today's show is released under Creative Commons, Attribution 4.0 International