Files
Lee Hanken 7c8efd2228 Initial commit: HPR Knowledge Base MCP Server
- MCP server with stdio transport for local use
- Search episodes, transcripts, hosts, and series
- 4,511 episodes with metadata and transcripts
- Data loader with in-memory JSON storage

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-10-26 10:54:13 +00:00

266 lines
24 KiB
Plaintext

Episode: 3384
Title: HPR3384: Page Numbers in EPUB eBook Files
Source: https://hub.hackerpublicradio.org/ccdn.php?filename=/eps/hpr3384/hpr3384.mp3
Transcribed: 2025-10-24 22:28:54
---
This is Hacker Public Radio Episode 3384 for Thursday, the 22nd of July 2021.
Today's show is entitled, Page Numbers in a pub ebuk files.
It is hosted by John Culp and is about 28 minutes long and carries a clean flag.
The summary is, response to HPR 3367i, describe how to specify page numbers in an ebuk ebuk.
This episode of HPR is brought to you by an honesthost.com.
Get 15% discount on all shared hosting with the offer code HPR15, that's HPR15.
Better web hosting that's honest and fair at an honesthost.com.
Hey everybody, this is John Culp in Lafayette, Louisiana.
Yes, I'm still alive.
It's been a long time since I recorded an episode.
Although I do think I have one in calendar year 2021.
It seems like I recorded one right at the beginning of the year about something or other.
I don't remember what it was, but anyway, I've been like the last time.
I've been away from HPR for a pretty good while not only as a contributor, but also as a listener,
sadly. I just haven't had time to listen every day the way I probably should.
But I did check back in a couple days ago and saw that there was an episode
by Dave Morris and Andrew Conway where they talked about e-books.
And of course, I'm a degenerate for e-books and e-book readers and formatting.
And I love all that stuff.
I totally geek out on it, so I had to listen to that one anyway.
And they brought up some things in there that I thought it would be worth exploring.
And then based on what I found, I decided to go ahead and record a response episode.
So what I'm doing here is responding to HPR 3367, which I definitely encourage everyone to go
listen to if they haven't already. And in that one, Andrew was talking about his process for
creating e-books. And Dave apparently is going to be having a follow-up episode where he talks
about his process. And so I'm always interested in how people create these things, the tools they
use and all that kind of stuff. So first of all, thanks guys for the name check on me.
Andrew, I think, said that he was somewhat inspired by the episode HPR 1512 that I did a few years
ago about creating a digital edition of an old counterpoint textbook. And so thanks for that.
I haven't done that kind of work in a while. Most of the e-book work I do lately is just
fixing whatever e-books I purchase to read or e-books that I get from Project Gutenberg for free
or something like that. I like to do a little bit of tweaking and reformatting to suit my
preferences. And also I have to convert them to e-pub format. If they're in kindle format,
I have to convert them to e-pub before I can put them on my wonderful Kobo Aura 1 e-book reader.
But most of the time it's just doing a couple of tweaks to the CSS. I haven't really gotten into
the nitty gritty of the e-book internals in a while. So it was kind of fun for this topic to come
up and have an excuse to get back in there and poke around. The issue that I really wanted to focus
on was that of page numbers. Now if I recall right, it came up when Andrew and Dave were discussing
the notion of an index. An index in an e-book is something... I'm not sure I've really seen it before.
I haven't seen a lot of academic titles in... Actually I should look. I have a couple of like
leadership books and stuff like that that I've read. I might check it at the very back and see
if they've got indices in them. But I... they definitely have bibliographies. But I don't know that
they have indices the way a paper book would have. But in academic books and technical books,
it's super important to have a good index in there to make the book much more useful. And of course
in e-books, the index becomes a little bit less necessary than in a paper book. Because of course
you can quickly search through the text of an e-book and find what you need. You can find all the
instances of a person's name or a topic that you're looking for or a term or something like that.
And it's not hard at all. But what you can't do with an e-book that has no index is you can't
just browse the index, which is for an academic, that's one of the things they do. The first time I
get a new book, I'll kind of flip right back to the bibliography to see what sources they used
and also check through the index to see what topics they cover. And this might sound kind of
weird to people who just read books for pleasure. But I assure you in scientific fields and academia
and stuff, it's perfectly normal to jump right to the index and start looking around. And so I can
certainly understand Andrew's concern in making an index for the e-book that has a certain
functionality like going, you want to be able to have your list of search terms and be able to
tap on something and have it go right there in the book and having good page numbers that refer
specifically to the places in the original paper version would be kind of important. And also
in academia, it's important when we are doing research ourselves, writing papers and books and
we always have to cite our sources. And part of that is not only saying what book you got
it from, but what page it was on in that book. And with e-books, it kind of throws this into confusion.
And so of course it'd be wonderful if there were a predictable, reliable way to have the same
pagination in an e-book that you do in a paper book. And by that I don't mean that I want every
page to look the same. I mean to me, it's critical that an e-book be able to flow
to fit the screen that you're looking at. So when I'm reading an e-book on my phone, which has
what a six inch screen or something, or on my Kobo or a one that has about a seven or eight inch
screen or my iPad with a ten inch screen, or my shiny new Kobo mini with a five inch screen,
that book should reflow to fit all those screens. And I should be able to reliably change the
font size and have it still fill up the screen just fine and not end up, you know, reading
really tiny words to try to fit all the, you know, what I don't want is an image of every page,
right? The text needs to flow. But we also, in academia, we kind of need to know where the page
numbers fall. So all of this is to say, I perfectly understand the issue that Andrew was talking
about and the day was talking about. And it's something that has concerned me a little bit,
but I've never really tried to follow up with it. Incidentally, Dave mentioned that his son
told him about indexing in using Latak, and I can confirm it's very easy to make an index in
Latak. But of course, Latak is something that's normally meant to end up with a print product,
or I mean a PDF, but to me, a PDF is barely better than paper because it's completely inflexible.
It doesn't have the reflowing capability that a true ebook format does. But it is, it's very
easy to make an index. And I remember because several years ago, I made a cookbook for my wife
of all of her favorite recipes so that she'd have them in one place in a book and I actually printed
it out. But I've made an index for it so that like in the index, it has the names of all of the
recipes, but also names of certain kinds of ingredients so that you could look at an ingredient and see
which recipe is it shows up in and that kind of thing. But anyway, you just kind of, whenever you
have a word in the in the text that you want to appear in the index, you just tag it with a certain
thing and then you run an indexing command and it voila generates it for you. It's wonderful.
No such thing exists for e-pubs.
So after listening to their episode, I decided I wanted to try to figure this out because I thought
I remembered hearing at some point or reading somewhere that there was as part of the e-pub
three specification, there was support for page numbers. In other words, for publishers to put
in there the actual page numbers that correspond to the paper versions of their books. And so I did
some reading and found that yes, that's true. And there was some limited support under e-pub too,
but I couldn't make it work under e-pub too. And I mean to be honest, I didn't really
know much about the difference between e-pub two and e-pub three, but essentially all of the e-pub
files that I've got in my library and there are thousands are in e-pub two format
unless I'm unaware of it. But the main difference is in the navigation file.
But there's a way to convert your e-pub to book into an e-pub three and that's the first step in
putting page numbers into your e-book. And so I did that to one of my, there's a reading that I
like to have my music history students do like a 19th century German critic writing about the
music of Beethoven. And it's only about six pages long. And so I decided, well I'm going to start
with a short reading like this that comes from an academic book where I kind of do want them to
have the page numbers handy. And these are page, you know, it's only six pages long, but the page
numbers in the paper copy are like 776 to 782 or something like that. And so of course when you open
up in an e-book reader it's going to display page numbers like one two three four five and six
instead of the page numbers that are actually in the 700s. So I thought that would be a pretty good
proof of concept thing. So the first thing to do was to figure out how to convert it into e-pub three.
And what I ended up using was caliber, caliber is what I use for management of my entire e-book
collection, but also for editing e-pub files. And before I could use it though, I had to uninstall
the repository version of caliber. I'm on Ubuntu 16.04 and you might don't at me. I know it's an
old version, but it's the one that still has compatibility with bladder speech recognition,
which is really important for me. So I have not upgraded. So I uninstalled caliber from the
repository and then just downloaded it from the caliber website, the latest version. I think it
will see what version this is. Five point two three. This is caliber 5.23 that I've got here.
And the newest version, I think even after version four points something, he has a way in there
very easily to convert from e-pub two to e-pub three. And so what you do is you open up
whatever book it is that you want to work on. So I have here selected one of my books and I just press
T or you can right click and choose edit e-book. And then once it's open in the editor,
you go to the tools menu. That's the third one from the upper left. And the very bottom item on the
menu says upgrade book internals. Now that's not the most discoverable e-pub two to e-pub three
conversion, but that was actually the first one I tried and it did it just fine.
So what it does is it creates a different kind of navigation file. The default navigation file in
e-pub two is called TOC dot NCX. So NCX. And it's kind of it's an XML file. And it's kind of
far cumbersome and difficult to navigate and understand. And when you upgrade to e-pub three,
what you get is a new file called nav.xhtml, which is much easier to read for me. Anyway,
it's a lot less cluttered and easier to work with. And so anyway, once you've done that,
you've got one of the key pieces in place. You've got your book upgraded to e-pub three and it's
ready to start inserting pages. Now, after you do that, you've got to insert page anchors and
that tells that you just put an anchor everywhere that you want a page break to be and you tell it
what page number it should be. Now, for some of the books that I've either edited or recreated
or whatever, I already had a rudimentary form of this. Like in the one that I was working on
for my music history students, I had already put right in line visible in the text,
just page numbers in square brackets. So they'd be reading right along and in the middle of a
sentence, it would say 7777 for a new page number in square brackets, which is not very elegant,
but it did tell them what page they were on. And so that made it easy for me to go through and find
first of all where the page breaks were and then what page number to assign to those.
And once you have that, what you want to do is put in an empty span. So it's a span tag.
And I will have an example in the show notes. If you want to follow along, it might be easier.
It says, so open span tag and then right after we're at span, there's a space and an e-pub colon
type equals quote page break, end quote space, ID equals quote page 57. Well, in the one I have
here on the in the show notes, it's page 57, ID equals quote page 57, end quote space title equals
quote 57, end quote, and then you close the opening span tag and then immediately you put the
closed span tag. But that probably doesn't make sense the way I'm you really need to see it to
make better sense of it. Anyway, it's kind of a cumbersome bunch of text that you've got to put
in there just to get a single page number. And of course, I like to try to automate any tedious
repetitive tasks. And so I made a a bladder voice command that would do this for me. So all I
have to do is in my file, I type in the page number. In this case, it would be 57. And then I select it.
And then I speak the words page break. And when it hears that command, it copies that number into
the clipboard and then runs a Python script that I wrote and puts the entire bit of HTML span
stuff there and then inserts the number 57 at the two appropriate spots and then pastes it into
the ebook. So it's a pretty quick way to do that. Now, my counterpoint book, the subject of
HPR, what, 1512? Yeah. I actually had the foresight to do as part of the kind of
infrastructure of that book. I did specify page numbers all the way through in kind of invisible
page anchors. Now, they're not formatted the way you would need to for ePub3. But they're formatted
very consistently. And the page numbers are all in there. And so I could very easily do a search
and replace to replace the anchors that I've put with the correct ones that will work. And I haven't
done that yet, but I probably will very soon. And while I was working on the book, the reason I did
it was in part because I thought, well, at some point I'm probably going to want to know where the pages
are and maybe there'll be a way to have it show it correctly in an ebook reader. But in a more
practical way, I was dealing with making a digital version of a paper book. And it just helped me
find my place in the HTML file to be able to go up into the address field and put like a hashtag
followed by a page number and press entering. And it would take me directly to that spot of the
HTML file that would correspond to a certain page in the book. So it just helped me navigate
things a lot easier. But it's all still in there invisible, but there. And it's ready to be called
into service. Okay, so once you've got your page anchors and you put those just right in line
in like right in the middle of a sentence, wherever there's a page break, just put the page number
there as an empty span. And it won't be visible while you're reading like in the middle of the
sentence. But when everything works correctly, and if you look at it in the right, well, the only
reader that seems to work with it, over in the margin, it'll say what page you're on based on your
specified page numbers. Okay, so you got your page anchors. The next thing you need to do is create
a page list. And that goes in the navigation file. That's the new navigation file that's generated
when you convert from EPUB 2 to EPUB 3 format. And I've got in the show notes an example of a page
list. And in the exam, it's kind of a minimal example where it just goes from page 122 to 126.
And as I say here, that's the kind of thing that would happen if like let's say you wanted to make
an ebook out of a five-page article from an academic journal. And that article appears
kind of toward the end of the volume. It's going to have pages in the hundreds. It won't start with
page one. And so this would enable you to specify that these are pages 122 to 126 from that journal.
And then you'd be able to use that appropriately to cite your sources blah, blah, blah, whatever.
So there's a navigation block that has a very simple ordered list inside it. And the ordered list
is just a series of list items with hyperlinks to the page anchors that you've created.
It's a much more simple and elegant way to deal with it than the old NCX XML kind of thing.
I actually tried doing that too and it failed. When I tried to open the book and my ebook readers
had choked and said there's something wrong with this file. I don't know that it matters very much
where you put this navigation block in your nav.xhtml file. But I decided to put mine between the
table of contents block and what they call a landmarks block. I don't even know what the landmarks
block does. But I stuck it between those. And when I saved it and opened it up in an ebook app,
it worked. Now creating this list, I've got an example of a script I wrote
to automate some of the process of creating your page list. Because of course it could be very
tedious if you've got hundreds of pages making an ordered list that's hundreds of list items long
would be very tedious. So that definitely needs to be scripted. And so I wrote a little bash script.
Forgive me, Dave, in advance, for writing a script that's probably going to make you choke a
little bit. But I just use bash. You can probably make a better one in Perl or Python or something.
This is what I know best and I figured I could probably do it. So I wrote a script that I call
pagelist.sh. And this script takes two command line arguments. The first and they're both
numbers. The first is the opening page number. And the second is the closing page number.
So in my example on the HPR show notes, I just say the command that you'd run would be pagelist.shu
space 42 space 61. So this would create the navigation block for something where you wanted pages 42
through 61. It just grabs those command line arguments and passes them in there. What it does is
there's a for loop. It says for I in dollar sign open parenthesis SEQ space. And then I've got
that beginning and ending numbers. And then it has it do the stuff. And it's it's way easier to
look at this. I should not I should not be trying to read scripts in your ear. But it iterates through
all the numbers between 42 and 61 and creates a list item for each one and just keeps adding it
to the temporary file. And then at the end of my own script, I actually have it opening up in my
editor. Although I left that part out of the example here. The one thing that you'll need to do
is make sure that the URLs in your page list are correct. I didn't really incorporate that
part very well into my script. And so after it was done running, I open it up in the editor and
just at a search and replace to put the correct HTML file name, which you you get that by when you
open up your ebook in the editor in caliber. If you look over on the file browser pane on the left
hand side under the text block, it will have it will have the file names for all of the files.
And so on the one that I've got open right now, the file name is index underscore split underscore
zero zero zero dot X HTML. And then it's, you know, there are a bunch more after that zero zero one zero zero
two zero zero three and so forth. On my minimal examples that I did, there was only one file that had
all of that stuff in it. So it was fairly easy to get the URLs correct on the page list. But you
just got to make sure they're all pointing to the right place. So once you've got your page list
in your navigation file, then just save the book and try opening it up in a book reader.
Now here's where some of the problems start to come. There's not widespread support for displaying
the publisher page numbers in these things. So when I opened it up on my cobo, for example,
there was no difference at all. It made no difference in what page numbers were displayed.
The cobo displays page numbers based on an algorithm that it's got in its internals. I think it
just counts 250 words and then puts a new page. And there might be a way to go in and adjust the
the word count to make it divide up the pages a little bit differently. But it does something
like that. It doesn't look for page numbers that you have specified in your book.
The only application I found that will display your shiny new custom page numbers is iBooks.
And I know that in the crowd that I'm talking to here, Apple is not one of the most favorite
companies and I only have one Apple device. It's just a regular iPad. I like the device fairly well.
But I like to have at least one iOS device to be able to test things and be able to see what my
students are looking at because so many of them use these things. Anyway, the iBooks app,
when you open up your book with the new page numbers embedded in there,
if you tap on the table of contents menu item and then at the very bottom it will say show
publisher page numbers. If you tap that, then when you go back to reading, it will suddenly show
the page numbers that you've told it to show instead of the ones that it generates automatically.
And so it works very well. Now, I also tried it in overdrive on my Android phone. I tried it in
Marvin, which is an EPUB reading app for iOS that I like quite a lot. It didn't work in either of
those. It didn't work on my cobo. I have not tried converting it to a Kindle format and then
opening it on a Kindle. I haven't tried that yet. And I'm curious whether it might work in one
of those open source alternative ebook readers like KO reader. If you hack your Kindle
and put an alternate reader on it, it might work in there. I haven't tried that either, but maybe
that's something for the future. But anyway, hopefully at some point the firmware is for all these
ebook devices will be upgraded so that they will support the display of these page numbers.
But even if you can't see them displayed in the page number area down at the bottom of your ebook,
it could still be useful for the purpose that Andrew and Dave were talking about, which was
to make an index. Because in your index, you could put the whatever search term that you
are trying to show. And you could put a series of page numbers that are linked to the page numbers
you've put in your file. And it will jump right there. So for that purpose, it might be very useful
for actually displaying the page numbers. The only one that will do it that I found is iBooks.
Anyway, that's probably enough for that. You guys have probably had enough of me talking about ebook
stuff. But I've had fun learning about it and enabling it in a couple of things. And I've
definitely got a few more books in the queue that I want to do it to. So if I learn anything more
about it, I will write. I'll do another episode. I mean, anyway, it's been fun. Glad to be talking to
y'all again. And I hope I'll have time to listen to some more episodes very soon. And I've actually
got a couple of ideas for follow-up episodes for myself. One about my cobo mini ebook reader and
another about watermarks in Libre Office. But those will be left for another day. That's all for now.
It's been fun. I will talk to you later. This has been John Culp and Lafayette, Louisiana. Bye, y'all.
You've been listening to HecopobliGradio at HecopobliGradio.org. We are a community podcast
network that releases shows every weekday Monday through Friday. Today's show, like all our shows,
was contributed by an HBR listener like yourself. If you ever thought of recording a podcast,
then click on our contributing to find out how easy it really is. HecopobliGradio was found
by the digital dog pound and the infonomicon computer club and is part of the binary revolution
at binrev.com. If you have comments on today's show, please email the host directly, leave a comment
on the website or record a follow-up episode yourself. Unless otherwise stated, today's show is
released under Creative Commons, Attribution, ShareLite, 3.0 license.