Files
hpr-knowledge-base/hpr_transcripts/hpr1657.txt
Lee Hanken 7c8efd2228 Initial commit: HPR Knowledge Base MCP Server
- MCP server with stdio transport for local use
- Search episodes, transcripts, hosts, and series
- 4,511 episodes with metadata and transcripts
- Data loader with in-memory JSON storage

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-10-26 10:54:13 +00:00

259 lines
18 KiB
Plaintext

Episode: 1657
Title: HPR1657: Hacking Gutenberg eBooks
Source: https://hub.hackerpublicradio.org/ccdn.php?filename=/eps/hpr1657/hpr1657.mp3
Transcribed: 2025-10-18 06:28:23
---
It's Tuesday 9th of December 2014.
This is HPR Episode 1657 entitled Hacking Gutenberg Ebooks.
It is hosted by John Kulp and is about 27 minutes long.
Feedback can be sent to JohnlandChickelp at mail.com or by leaving a comment on this episode.
The summary is, I talk about ebook formatting and how to customize an ebook from Project
Gutenberg.
This episode of HPR is brought to you by An Honesthost.com.
Get 15% discount on all shared hosting with the offer code HPR15 that's HPR15.
Get your web hosting that's honest and fair at An Honesthost.com.
Hey everybody, John Kulp in Lafayette, Louisiana here and it's been quite a long time since
I recorded an episode for HPR.
I went back and looked and it was in May and so it's high time that I did another one
especially since apparently shows are running short.
So I'm going to talk for a few minutes today about something that's really interested
me a lot lately and that is ebooks.
Now I've been a book lover for most of my life and in fact there was quite a while when
I was in my 20s when I collected rare books and I really prize the book as an artifact.
However, in the last couple of years I've really grown to love ebooks almost as much
if not more than regular books.
Part of this is the convenience and part of it is the fact that they are so much more
accessible than physical books in terms of things like font size and cross platform availability
and also accessible in my pocket.
I mean with ebooks everywhere I go I have a book that I can read if I get bored.
It's on my phone, it's on my laptop, it's on my tablet and the thing that really got
me interested in ebook formatting was my purchase at the end of the spring semester
I think last year I got a Kindle.
And a Kindle is a wonderful device.
It's not the only really good ebook reader but it's the only one I have, well my kids
have the nook color which to me is not as good of a ebook reading experience.
The thing that's great about the Kindle is the eink technology which is a really wonderful
looking, I don't even know what to call it but it's a way of displaying text on a screen
that is not using a glowing screen.
When you use an eink device you can take it out right out in direct sunlight and see
it perfectly.
In fact you can see it better in direct sunlight than you can in a dark which is exactly
the opposite of a smartphone or a tablet which you cannot possibly read if you're out
in the sun.
The Kindle that I got is the Kindle Paper White and it's got built in LED back lighting
if you have to read in low lighting situations and most of the time I keep those lights
on.
The battery life is incredible, it'll last a long time.
It does not have expandable storage but it holds enough books.
I use the Caliber ebook management program to manage my ebook library and transfer books
over to the Kindle when I want to.
Now what got me interested in hacking ebooks was the fact that the Kindle is wonderful as
it is has one really serious flaw which is it is not able to do decent justified text
and almost every ebook comes with text fully justified so that in other words the left
and right margins are straight and while that looks wonderful in a printed book it looks
awful on a Kindle because the Kindle is not able to break words in a sane way.
In fact it does not try to break words at all and so a book that has justified text
reading on the Kindle ends up with all these giant spaces between words which is extremely
annoying to me and so I decided I'm going to learn how to get into these ebooks and
fix that where every ebook I read has left justification instead of full justification.
Left alignment maybe I should say.
So the only margin I care about then is the left one.
Everything lines up on the left and the right is a ragged margin which I don't mind so
much.
Maybe it looks a little bit prettier if the right margin is all nice and straight but I
would prefer to have the ragged right margin and have equal spacing between words instead
of having both margins nice and straight but having really irregular wildly erratic spacing
between words.
Okay so my workflow when I get a new book I read a lot of books from Gutenberg.
I thankfully have a terrific appreciation of 19th century literature and that means that
I can get tons and tons of stuff to read for free from Project Gutenberg and I will have
a link in the notes for Gutenberg.
If you've never gone there then you should.
If you're a reader and you like public domain fiction Project Gutenberg is awesome.
And as a test case I'm going to use a book that I read recently from there called Washington
Square by the American author Henry James.
Now I normally will go right to the Gutenberg website and download the book and I'm actually
going to put a link to this book in the show notes as well.
And I download the ePub version of the book even though the caliber eBook manager cannot
sorry the Kindle does not read ePub format.
The Kindle reads a different format AZW3 or MOBI, either one of those.
I normally download the ePub anyway and then I work on it and convert it to the AZW3 format.
So I'm going to download the ePub file and I'm on Firefox on Linux.
Everything I'm doing is using the Linux versions of everything.
So I download it and it puts it into my downloads folder and then I go to my caliber eBook
management program that's caliber spelled C-A-L-I-B-R-E.
It looks like Calibre which would make sense.
I mean the word Libre implies books but whatever I think it's supposed to be pronounced caliber.
And I will have a link to the caliber website also.
There are versions of caliber for Linux, Windows and Mac and I have used it on all three
works beautifully.
This is a caliber is a great tool for organizing your library, keeping track of everything
you can add tags, you can sort things by title, author, date and so forth.
And you can use it to side load books over to your reading device.
And so far I've only used it with the Kindle and with the Nook color but for both of those
devices as soon as I plug it in it recognizes that a device has been attached and it will
load up the library on that device and you can easily transfer books back and forth to
it.
So I've downloaded the Henry James book and it's in my downloads folder right now.
So what I need to do is add it to my caliber library and I will do that by clicking the
upper left hand button in the caliber interface that says add books.
When I do that it opens up a file selector window and I'll go and find the file in this
case it's pg2870.epub and it is adding it to my library.
I used to have this I actually deleted it from my library and then it says it's already
here so I'm just going to select add it anyway.
Not sure what's going to happen here okay so it's in my library now.
And when I select it it shows a funny looking ebook reader device image over there on the
right hand side.
There are a few things that you can do with it.
One thing I like to do is go find a picture for the cover because the Project Gutenberg books
do not come with cover images they just have plain text and so I will often if it's a
book I know I want to keep around I will go and find a picture of some addition of that
book on an image search and then add it in the metadata editing window.
For now I'm just going to open up the book and start poking around with the style sheet
to see and you know to make the adjustments that I like to make.
The most important adjustments for me are the justification change it from full justification
to left and also the line height and if there has been any kind of indication about font
size I remove that at least from the body text of the book.
In general ebooks should be formatted as simply as possible so that they can just adapt naturally
to whatever ebook device is being used to view it.
Like in my own style sheets for ebooks I never indicate a specific font for the main
body text because I want to be able to use the embedded fonts or the built-in fonts
on my devices for that.
I think you're by specifying certain fonts you're kind of interfering with a user's ability
to choose what fonts he or she wants and I'm all about choice.
So the style should be fairly simple and normally the books that I get from Project Gutenberg
are pretty good in that respect.
Sorry I just took a look at my recorder to make sure it was still recording.
One time I did this and I got finished talking half an hour later and realized that I had
not been recording so that's why I took a moment and looked there.
I'm going to open up Washington Square by right clicking on it and choosing edit book and
it opens up the ebook editor that is part of Caliber.
When you open that up you can see a great big blank gray spot in the middle and then
a left hand file browser and then over on the right side there's a live preview area.
This one appears to be done in one giant HTML file.
Best practice would be for each chapter to have a separate HTML file and that's something
that will happen when I run the conversion to make an AZW3 here in a couple of minutes.
When I open it up by the way a little knowledge of HTML goes a long way in editing an ebook
because ebooks are essentially HTML files that are packaged up in a certain way.
This one it appears that every chapter heading is done with an H3 and I would prefer to
have it done with H2 because my conversion settings on Caliber are done so that whenever
it detects an H2 or heading level 2 it will insert a page break there to make sure that
the new chapter starts on a new page.
The first thing I'm going to do now that I have opened this up and I'm looking at it I'm
going to change all of the H3s to H2s and the way to do that is once you have what I did
first was under the text area in the left hand file I selected the second of the two HTML
files.
The first one normally is just some random front matter.
The second one in this case is where the whole book is and so actually you know what it
looks like I was wrong about that I'm sorry they've got two HTML files.
The first one has maybe the first half of the book and the second one has the second
half and as I look through it I see a few things that I want to change.
First of all it does not have any indentation of paragraphs.
This one is basically done like it would be if you were going to read it on the web rather
than as a book so it has a good bit of space between every paragraph and no indentation.
What I want to do is remove most of the space between the paragraphs and then do a first
line indent on all of those.
And as I mentioned the chapter headings are done heading level three and I want to change
those to heading two.
So underneath the source code there's a little search and replace thing or if you don't
see that you can do control F and it will appear control F for find.
So I'm going to find H3 and I'm going to replace it with H2 and there are a couple of
options here there's a mode I'm going to use normal mode you can also use reg X mode
which allows you to use regular expressions and I'm going to have it search through all
text files you can also search through just the current file or all of the style files
or whatever I'm going to use all the text files and I have in the find field I put H3
and in the replace field H2 and I'm going to click replace all and it did 68 times so that
looks like there are 34 chapters it does an opening and closing tag for each chapter.
So now all of the headers are H2 and that's what I want.
Now let's look at the style sheet.
The style sheet will be on if on the file browser on the left hand side this one is called
pgepub.css that would stand for I assume Project Gutenberg ePub.css I'm going to select it
and then press enter and I can see the style settings that they have here.
It's this is a very very simple style sheet which in general I like I appreciate that
I don't like it when they get too fancy.
It has a few settings for body has a couple of settings for H2 oddly because it didn't
have any H2s in the whole thing it only had H3 and then it has a couple of settings
for the Project Gutenberg disclaimers and various things.
So the first thing I'm going to do is delete all of this and select all and backspace because
I have my own basic ebook style sheet that I always start with I call it basic ebook.css
I'm going to copy and paste my style sheet into the little style sheet source code
window and I have a link in the show notes to my paste bin site where I put the style sheet
there.
Now suddenly everything is different.
The line height is set at 1.25em.
I set the margins to have 0.1em above and 0.1em below on each paragraph and then I set
the text indent at 1em, em is a unit of measurement that's used in CSS.
You could also use pixels as a unit of measurement but I normally use either em or a percentage.
So now I also have in my style sheet a setting for H2 and H1.
This is one place where I do sometimes change the font family I changed it to sands and
that's certainly not necessary but I like to do it for my own ebooks.
If I were publishing this I probably would not do that.
I would leave it undefined and let people's ebook readers determine what font is shown there.
For my headings I also have a good bit of margin below and that allows it to have a little
bit of separation between the text of the paragraph and the chapter heading.
What other settings do I have?
So right now all of the paragraphs have a first line indent of 1em.
Now that's not ideal because in normal books you may never have noticed this but the first
paragraph of a chapter normally is not indented and then all subsequent chapters are.
So what I'm going to do is look in here and find there's a way to fix this where every
first paragraph of a chapter will have a will not have an indent and what I do is I
look for the closing header 2 tag so it's less than slash H2 greater than followed by a
new line followed by less than P so that would be the closing H2 tag followed by a blank
line followed by the opening paragraph tag.
I'm going to search for that by pressing Ctrl F and that string automatically appears
in the find field.
Actually I'm going to copy it to and then in the replace field I'm going to replace it
with the same thing except add a class to it and that is my class equals no indent.
I have a class in my style sheet called no indent which has a first line indent of 0 and
I'm going to click replace all and it did 35 times so that should be correct and now when
I go through there the first paragraph of each chapter has no indent and then every
subsequent paragraph is indented 1em.
So part of my style sheet is to align everything on the left and I do that in the body part
of the style sheet what else.
If you want to get really fancy with this if it's a favorite book or one that you are
going to want to share with other people or something and you want to make it look really
nice you can do a drop cap which is something I think I did when I was reading this book the
first time I'm not looking at my own copy of this right now I'm looking at one that I'm
doing on the fly for this podcast but a drop cap is the very first letter of a chapter
will sometimes be big enough to span about two or three lines vertically and the way you
do that is to go into the source code and find the first letter of the paragraph there.
In this case it says win the child was about 10 years old and so on the word win I can
select the W or just I can select the W and then there's a little tool here actually I
can't use that what you have to do is put span tags around that W so span and then after
the W put a closing span tag and then you have to give that letter a class and I have it
I call it the drop cap class I think yeah in my style sheet I have a dot drop cap so my drop cap
class will tell that letter to float left I have a font size of 2.8m and then sets a couple of
margin settings and so when you do that that one letter is going to be much bigger than all the
others and it will span a couple of lines and it looks kind of nice it makes it look a little bit
more like a real book and one more thing I typically do with project Gutenberg books is to smarten
up the punctuation because they use all straight quotes and straight single quotes and I like that
the look of the smart quotes and they have a little tool called smart and punctuation if you look
at your set of buttons across the top there one of them has a pair of right hand quotes and if you
hover over it says smart and punctuation so I'm going to click that now and it will turn all of
those straight quotes into smart quotes and it will also take things like double hyphens and make
M dashes out of them and so that's it's a nice touch so when you're done with these things or
whatever else you want to do you want to save the file by doing control S and at that point you
can exit out of the eBook editor and transfer the book over to your reading device or email it
to yourself or something like that now this one is still an ePub and I would convert it over to
AZW3 to be able to read on my Kindle and that might be information for another episode how to
optimize an eBook in the conversion process what essentially will happen is when I convert this
it will chop those two giant HTML files up into probably 35 HTML files one for each chapter plus
some front matter and so forth and that way it will always have a new page and for each new chapter
anyway hope you guys have enjoyed that all of this relates to editing books that are not
covered by DRM now you can open up books with DRM on them if you've got certain plugins installed
I'm not going to go into how to do that but there is ample information online on how to make
caliber do that I've done it on my laptop because even books that I buy that are published and have DRM
I don't want to have them fully justified I want the left justification so I fix it
so anyway hope you've enjoyed that go grab yourself an eBook hack it and then read it it's fun talk
you all later bye
you've been listening to Hacker Public Radio at hackerpublicradio.org we are a community podcast
network that releases shows every weekday Monday through Friday today's show like all our shows
was contributed by an hbr listener like yourself if you ever thought of recording a podcast
then click on our contribute link to find out how easy it really is Hacker Public Radio was
founded by the digital dog pound and the infonomican computer club and it's part of the binary
revolution at binrev.com if you have comments on today's show please email the host directly leave
a comment on the website or record a follow-up episode yourself unless otherwise status today's
show is released on the creative comments attribution share a life 3.0 license