Episode: 1657 Title: HPR1657: Hacking Gutenberg eBooks Source: https://hub.hackerpublicradio.org/ccdn.php?filename=/eps/hpr1657/hpr1657.mp3 Transcribed: 2025-10-18 06:28:23 --- It's Tuesday 9th of December 2014. This is HPR Episode 1657 entitled Hacking Gutenberg Ebooks. It is hosted by John Kulp and is about 27 minutes long. Feedback can be sent to JohnlandChickelp at mail.com or by leaving a comment on this episode. The summary is, I talk about ebook formatting and how to customize an ebook from Project Gutenberg. This episode of HPR is brought to you by An Honesthost.com. Get 15% discount on all shared hosting with the offer code HPR15 that's HPR15. Get your web hosting that's honest and fair at An Honesthost.com. Hey everybody, John Kulp in Lafayette, Louisiana here and it's been quite a long time since I recorded an episode for HPR. I went back and looked and it was in May and so it's high time that I did another one especially since apparently shows are running short. So I'm going to talk for a few minutes today about something that's really interested me a lot lately and that is ebooks. Now I've been a book lover for most of my life and in fact there was quite a while when I was in my 20s when I collected rare books and I really prize the book as an artifact. However, in the last couple of years I've really grown to love ebooks almost as much if not more than regular books. Part of this is the convenience and part of it is the fact that they are so much more accessible than physical books in terms of things like font size and cross platform availability and also accessible in my pocket. I mean with ebooks everywhere I go I have a book that I can read if I get bored. It's on my phone, it's on my laptop, it's on my tablet and the thing that really got me interested in ebook formatting was my purchase at the end of the spring semester I think last year I got a Kindle. And a Kindle is a wonderful device. It's not the only really good ebook reader but it's the only one I have, well my kids have the nook color which to me is not as good of a ebook reading experience. The thing that's great about the Kindle is the eink technology which is a really wonderful looking, I don't even know what to call it but it's a way of displaying text on a screen that is not using a glowing screen. When you use an eink device you can take it out right out in direct sunlight and see it perfectly. In fact you can see it better in direct sunlight than you can in a dark which is exactly the opposite of a smartphone or a tablet which you cannot possibly read if you're out in the sun. The Kindle that I got is the Kindle Paper White and it's got built in LED back lighting if you have to read in low lighting situations and most of the time I keep those lights on. The battery life is incredible, it'll last a long time. It does not have expandable storage but it holds enough books. I use the Caliber ebook management program to manage my ebook library and transfer books over to the Kindle when I want to. Now what got me interested in hacking ebooks was the fact that the Kindle is wonderful as it is has one really serious flaw which is it is not able to do decent justified text and almost every ebook comes with text fully justified so that in other words the left and right margins are straight and while that looks wonderful in a printed book it looks awful on a Kindle because the Kindle is not able to break words in a sane way. In fact it does not try to break words at all and so a book that has justified text reading on the Kindle ends up with all these giant spaces between words which is extremely annoying to me and so I decided I'm going to learn how to get into these ebooks and fix that where every ebook I read has left justification instead of full justification. Left alignment maybe I should say. So the only margin I care about then is the left one. Everything lines up on the left and the right is a ragged margin which I don't mind so much. Maybe it looks a little bit prettier if the right margin is all nice and straight but I would prefer to have the ragged right margin and have equal spacing between words instead of having both margins nice and straight but having really irregular wildly erratic spacing between words. Okay so my workflow when I get a new book I read a lot of books from Gutenberg. I thankfully have a terrific appreciation of 19th century literature and that means that I can get tons and tons of stuff to read for free from Project Gutenberg and I will have a link in the notes for Gutenberg. If you've never gone there then you should. If you're a reader and you like public domain fiction Project Gutenberg is awesome. And as a test case I'm going to use a book that I read recently from there called Washington Square by the American author Henry James. Now I normally will go right to the Gutenberg website and download the book and I'm actually going to put a link to this book in the show notes as well. And I download the ePub version of the book even though the caliber eBook manager cannot sorry the Kindle does not read ePub format. The Kindle reads a different format AZW3 or MOBI, either one of those. I normally download the ePub anyway and then I work on it and convert it to the AZW3 format. So I'm going to download the ePub file and I'm on Firefox on Linux. Everything I'm doing is using the Linux versions of everything. So I download it and it puts it into my downloads folder and then I go to my caliber eBook management program that's caliber spelled C-A-L-I-B-R-E. It looks like Calibre which would make sense. I mean the word Libre implies books but whatever I think it's supposed to be pronounced caliber. And I will have a link to the caliber website also. There are versions of caliber for Linux, Windows and Mac and I have used it on all three works beautifully. This is a caliber is a great tool for organizing your library, keeping track of everything you can add tags, you can sort things by title, author, date and so forth. And you can use it to side load books over to your reading device. And so far I've only used it with the Kindle and with the Nook color but for both of those devices as soon as I plug it in it recognizes that a device has been attached and it will load up the library on that device and you can easily transfer books back and forth to it. So I've downloaded the Henry James book and it's in my downloads folder right now. So what I need to do is add it to my caliber library and I will do that by clicking the upper left hand button in the caliber interface that says add books. When I do that it opens up a file selector window and I'll go and find the file in this case it's pg2870.epub and it is adding it to my library. I used to have this I actually deleted it from my library and then it says it's already here so I'm just going to select add it anyway. Not sure what's going to happen here okay so it's in my library now. And when I select it it shows a funny looking ebook reader device image over there on the right hand side. There are a few things that you can do with it. One thing I like to do is go find a picture for the cover because the Project Gutenberg books do not come with cover images they just have plain text and so I will often if it's a book I know I want to keep around I will go and find a picture of some addition of that book on an image search and then add it in the metadata editing window. For now I'm just going to open up the book and start poking around with the style sheet to see and you know to make the adjustments that I like to make. The most important adjustments for me are the justification change it from full justification to left and also the line height and if there has been any kind of indication about font size I remove that at least from the body text of the book. In general ebooks should be formatted as simply as possible so that they can just adapt naturally to whatever ebook device is being used to view it. Like in my own style sheets for ebooks I never indicate a specific font for the main body text because I want to be able to use the embedded fonts or the built-in fonts on my devices for that. I think you're by specifying certain fonts you're kind of interfering with a user's ability to choose what fonts he or she wants and I'm all about choice. So the style should be fairly simple and normally the books that I get from Project Gutenberg are pretty good in that respect. Sorry I just took a look at my recorder to make sure it was still recording. One time I did this and I got finished talking half an hour later and realized that I had not been recording so that's why I took a moment and looked there. I'm going to open up Washington Square by right clicking on it and choosing edit book and it opens up the ebook editor that is part of Caliber. When you open that up you can see a great big blank gray spot in the middle and then a left hand file browser and then over on the right side there's a live preview area. This one appears to be done in one giant HTML file. Best practice would be for each chapter to have a separate HTML file and that's something that will happen when I run the conversion to make an AZW3 here in a couple of minutes. When I open it up by the way a little knowledge of HTML goes a long way in editing an ebook because ebooks are essentially HTML files that are packaged up in a certain way. This one it appears that every chapter heading is done with an H3 and I would prefer to have it done with H2 because my conversion settings on Caliber are done so that whenever it detects an H2 or heading level 2 it will insert a page break there to make sure that the new chapter starts on a new page. The first thing I'm going to do now that I have opened this up and I'm looking at it I'm going to change all of the H3s to H2s and the way to do that is once you have what I did first was under the text area in the left hand file I selected the second of the two HTML files. The first one normally is just some random front matter. The second one in this case is where the whole book is and so actually you know what it looks like I was wrong about that I'm sorry they've got two HTML files. The first one has maybe the first half of the book and the second one has the second half and as I look through it I see a few things that I want to change. First of all it does not have any indentation of paragraphs. This one is basically done like it would be if you were going to read it on the web rather than as a book so it has a good bit of space between every paragraph and no indentation. What I want to do is remove most of the space between the paragraphs and then do a first line indent on all of those. And as I mentioned the chapter headings are done heading level three and I want to change those to heading two. So underneath the source code there's a little search and replace thing or if you don't see that you can do control F and it will appear control F for find. So I'm going to find H3 and I'm going to replace it with H2 and there are a couple of options here there's a mode I'm going to use normal mode you can also use reg X mode which allows you to use regular expressions and I'm going to have it search through all text files you can also search through just the current file or all of the style files or whatever I'm going to use all the text files and I have in the find field I put H3 and in the replace field H2 and I'm going to click replace all and it did 68 times so that looks like there are 34 chapters it does an opening and closing tag for each chapter. So now all of the headers are H2 and that's what I want. Now let's look at the style sheet. The style sheet will be on if on the file browser on the left hand side this one is called pgepub.css that would stand for I assume Project Gutenberg ePub.css I'm going to select it and then press enter and I can see the style settings that they have here. It's this is a very very simple style sheet which in general I like I appreciate that I don't like it when they get too fancy. It has a few settings for body has a couple of settings for H2 oddly because it didn't have any H2s in the whole thing it only had H3 and then it has a couple of settings for the Project Gutenberg disclaimers and various things. So the first thing I'm going to do is delete all of this and select all and backspace because I have my own basic ebook style sheet that I always start with I call it basic ebook.css I'm going to copy and paste my style sheet into the little style sheet source code window and I have a link in the show notes to my paste bin site where I put the style sheet there. Now suddenly everything is different. The line height is set at 1.25em. I set the margins to have 0.1em above and 0.1em below on each paragraph and then I set the text indent at 1em, em is a unit of measurement that's used in CSS. You could also use pixels as a unit of measurement but I normally use either em or a percentage. So now I also have in my style sheet a setting for H2 and H1. This is one place where I do sometimes change the font family I changed it to sands and that's certainly not necessary but I like to do it for my own ebooks. If I were publishing this I probably would not do that. I would leave it undefined and let people's ebook readers determine what font is shown there. For my headings I also have a good bit of margin below and that allows it to have a little bit of separation between the text of the paragraph and the chapter heading. What other settings do I have? So right now all of the paragraphs have a first line indent of 1em. Now that's not ideal because in normal books you may never have noticed this but the first paragraph of a chapter normally is not indented and then all subsequent chapters are. So what I'm going to do is look in here and find there's a way to fix this where every first paragraph of a chapter will have a will not have an indent and what I do is I look for the closing header 2 tag so it's less than slash H2 greater than followed by a new line followed by less than P so that would be the closing H2 tag followed by a blank line followed by the opening paragraph tag. I'm going to search for that by pressing Ctrl F and that string automatically appears in the find field. Actually I'm going to copy it to and then in the replace field I'm going to replace it with the same thing except add a class to it and that is my class equals no indent. I have a class in my style sheet called no indent which has a first line indent of 0 and I'm going to click replace all and it did 35 times so that should be correct and now when I go through there the first paragraph of each chapter has no indent and then every subsequent paragraph is indented 1em. So part of my style sheet is to align everything on the left and I do that in the body part of the style sheet what else. If you want to get really fancy with this if it's a favorite book or one that you are going to want to share with other people or something and you want to make it look really nice you can do a drop cap which is something I think I did when I was reading this book the first time I'm not looking at my own copy of this right now I'm looking at one that I'm doing on the fly for this podcast but a drop cap is the very first letter of a chapter will sometimes be big enough to span about two or three lines vertically and the way you do that is to go into the source code and find the first letter of the paragraph there. In this case it says win the child was about 10 years old and so on the word win I can select the W or just I can select the W and then there's a little tool here actually I can't use that what you have to do is put span tags around that W so span and then after the W put a closing span tag and then you have to give that letter a class and I have it I call it the drop cap class I think yeah in my style sheet I have a dot drop cap so my drop cap class will tell that letter to float left I have a font size of 2.8m and then sets a couple of margin settings and so when you do that that one letter is going to be much bigger than all the others and it will span a couple of lines and it looks kind of nice it makes it look a little bit more like a real book and one more thing I typically do with project Gutenberg books is to smarten up the punctuation because they use all straight quotes and straight single quotes and I like that the look of the smart quotes and they have a little tool called smart and punctuation if you look at your set of buttons across the top there one of them has a pair of right hand quotes and if you hover over it says smart and punctuation so I'm going to click that now and it will turn all of those straight quotes into smart quotes and it will also take things like double hyphens and make M dashes out of them and so that's it's a nice touch so when you're done with these things or whatever else you want to do you want to save the file by doing control S and at that point you can exit out of the eBook editor and transfer the book over to your reading device or email it to yourself or something like that now this one is still an ePub and I would convert it over to AZW3 to be able to read on my Kindle and that might be information for another episode how to optimize an eBook in the conversion process what essentially will happen is when I convert this it will chop those two giant HTML files up into probably 35 HTML files one for each chapter plus some front matter and so forth and that way it will always have a new page and for each new chapter anyway hope you guys have enjoyed that all of this relates to editing books that are not covered by DRM now you can open up books with DRM on them if you've got certain plugins installed I'm not going to go into how to do that but there is ample information online on how to make caliber do that I've done it on my laptop because even books that I buy that are published and have DRM I don't want to have them fully justified I want the left justification so I fix it so anyway hope you've enjoyed that go grab yourself an eBook hack it and then read it it's fun talk you all later bye you've been listening to Hacker Public Radio at hackerpublicradio.org we are a community podcast network that releases shows every weekday Monday through Friday today's show like all our shows was contributed by an hbr listener like yourself if you ever thought of recording a podcast then click on our contribute link to find out how easy it really is Hacker Public Radio was founded by the digital dog pound and the infonomican computer club and it's part of the binary revolution at binrev.com if you have comments on today's show please email the host directly leave a comment on the website or record a follow-up episode yourself unless otherwise status today's show is released on the creative comments attribution share a life 3.0 license