442 lines
40 KiB
Plaintext
442 lines
40 KiB
Plaintext
|
|
Episode: 3367
|
||
|
|
Title: HPR3367: Making books with linux - part 1
|
||
|
|
Source: https://hub.hackerpublicradio.org/ccdn.php?filename=/eps/hpr3367/hpr3367.mp3
|
||
|
|
Transcribed: 2025-10-24 22:00:44
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
This is Hacker Public Radio Episode 3367 Fortusity, the 29th of June 2021.
|
||
|
|
Tid's show is entitled, Making Books with Linux, Part 1.
|
||
|
|
It is hosted by Andrew Conway and is about 56 minutes long and carries a clean flag.
|
||
|
|
The summary is a discussion about assembling books, using simple tools commonly found in most Linux distros.
|
||
|
|
This episode of HPR is brought to you by AnanasThost.com.
|
||
|
|
Get 15% discount on all shared hosting with the offer code HPR15.
|
||
|
|
That's HPR15.
|
||
|
|
Better web hosting that's honest and fair at AnanasThost.com.
|
||
|
|
Hello everybody, welcome to Hacker Public Radio.
|
||
|
|
This is Dave Morris and today Nallu and I, that's Andrew, are having a bit of a chat about
|
||
|
|
a particular subject. So I think you're going to kick off today, Andrew.
|
||
|
|
And do you like to talk about where we're coming from with this?
|
||
|
|
Yes, well thanks Dave, yes and hello to all the HPR folks out there.
|
||
|
|
Yes, well this is, I think, classic HPR material in that it turned out that Dave had an itch
|
||
|
|
and I had an itch and we were both scratching our respective features,
|
||
|
|
Antiscofford, that they had something in common in how we were doing the scratching.
|
||
|
|
I am talking purely metaphorically here, of course.
|
||
|
|
That's a relief.
|
||
|
|
Yep, with coronavirus restrictions, I can't scratch Dave and I don't think he'd want me to.
|
||
|
|
But anyway, what I'm talking about to be less generic is that we're both generating documents
|
||
|
|
to be published, made public and we want to do it with simple,
|
||
|
|
maybe sort of unixilinux-like text processing tools. So we both have ended up starting from
|
||
|
|
Markdown and we want to do a lot to transform it into something we can put on the web, for example,
|
||
|
|
or republished in some way, but we also are interested in doing processing, for example,
|
||
|
|
for my case, for references and also to make an index. I think it was generating an index
|
||
|
|
of, for some material, that's the question that you asked that I latched on to,
|
||
|
|
some questions as well. That's right. Yes, I was searching for a generally available way of
|
||
|
|
making an index out of a Markdown thing, without really thinking it through and you said that
|
||
|
|
you'd done this and pointed me at your methodology, which approaches it from a root that I hadn't
|
||
|
|
quite thought about. So yeah, there's lots of mileage there for talking about how it's done and
|
||
|
|
you know what we wanted to do and ways and means of achieving it. Indeed, and I should also
|
||
|
|
mention that the route that I started down in generating a book, because that's what I'm doing,
|
||
|
|
I'm creating a book to be published by actually a regular publisher, ultimately. But the reason
|
||
|
|
that I started down this route years ago, I think, was an HPR episode by John Culp, where he was,
|
||
|
|
if I remember correctly, I think he might have been taking an out of print music book and
|
||
|
|
republishing it under Creative Commons, or public domain, I don't recall details, but I really like
|
||
|
|
the way he just kept it simple with a bit of, I don't know if he started from Markdown or HTML,
|
||
|
|
but one of the two, a little bit of CSS, it was such, there was so light touch, and I thought
|
||
|
|
that it was just so simple. That's what I'm going to do, and I really have never regretted it.
|
||
|
|
It's been, I'm able to automate everything about the whole process, and yeah, well,
|
||
|
|
rather than talk, talk around it, maybe we should just get stuck in.
|
||
|
|
Yes, yes. Well, we decided in our sort of pre-chat that there was probably enough material here
|
||
|
|
for a couple of shows, which I'm sure Ken would be uploading as a distance, cheering.
|
||
|
|
So we would sort of give a summary of our two different positions, and our needs, and now we've
|
||
|
|
had to solve them and solve them and maybe have a chat at the end. So today, you are going to kick
|
||
|
|
off with your situation, I think Andrew, yeah. That's right, and you can quiz me and pull me up
|
||
|
|
if I'm not being clear, and then we'll switch roles for the next one where you'll discuss what you're
|
||
|
|
doing, and I'll quiz you as we go. Okay, so yeah, I'm generating a book, which is composed of
|
||
|
|
chapters, and I have figures, like graphs and charts, that kind of thing, and I also have
|
||
|
|
tables in the book. But my starting point is essentially a text file, so each chapter
|
||
|
|
is a text file, and it's in Mark, I write it in Markdown, and I can also throw, I actually do
|
||
|
|
throw occasionally some HTML in there for something that's either not supported in Markdown, or
|
||
|
|
ambiguously supported, depending on flavor of Markdown. So it's mainly Markdown with a little bit
|
||
|
|
of HTML, and one of the actual bits of HTML comes in with the figures. Now, when I write, and I want
|
||
|
|
a figure in there, I'd actually just write in the text, I think I write a greater than sign,
|
||
|
|
which I think means an indentation or something, I forget what the greater than sign means in Markdown,
|
||
|
|
so, but, and then I write figure, and then ampersand, NBSP, same equal on, for non-breaking space,
|
||
|
|
and then I will write a tag, like age underscore distribution, that shows the distribution of ages
|
||
|
|
in the population, you know, and then when I want to discuss this figure somewhere in the text,
|
||
|
|
I will just use that same tag, that when that tag appears in the text, and an idea is that some
|
||
|
|
post-processing, these tags will be filmed, numbered in order of, in the chapters of the first figure
|
||
|
|
will be in chapter three, will be 3.1, the second with 3.2, and I don't need to worry about their
|
||
|
|
references. Now, I mean, that's, I'm sure there's other tools out there to do that, and I know I've
|
||
|
|
used latex in the past to do this, but latex was just too big a versatile hammer for this job,
|
||
|
|
I want to keep it much more simple than that, as I said before. So that's the first job,
|
||
|
|
is that I have references for the figures and tables. Now, not only that, but the, the other thing
|
||
|
|
that my post-processing will do is it will, where it finds one of these figures, it then knows from
|
||
|
|
that tag, like I forget what I said before, but like ages underscore population, say, it will then go
|
||
|
|
off to a directory, and it'll look for ages underscore population dot CSV, and if it finds that,
|
||
|
|
it'll then fire up another script, which will turn that CSV into my figure, and the CSV file will
|
||
|
|
contain not only the data, but some meta information about what the graph should look like, whether
|
||
|
|
it should be a bar chart, where the legend should be, if there should be a legend, all this kind of
|
||
|
|
stuff and the scale of the graphs. So, so the principle was to everything, every bit of material for
|
||
|
|
the book, it starts life as a text file in the soul processing, and so the workflow is to write
|
||
|
|
the source material, which I do entirely separately, and then when I'm ready, I then run a script,
|
||
|
|
and the script then it literally just takes the chapter, some chapters don't have any figures,
|
||
|
|
let introductory chapters, no figures, so I can just literally, I actually want to do, is I take
|
||
|
|
that chapter, and I, and I, I cat it to all a file called all that marked on, and then I use
|
||
|
|
Python, some very short Python programs to go through and put in the references that I just
|
||
|
|
described, and it also will generate the image file, and the IMG tag, which is a bit of HTML that
|
||
|
|
will go with the figure caption is in the chapter, and then it spits all of that out, and it
|
||
|
|
pins it to the all dot marked on file, so I'll call that for chapters one, two, three, and it'll
|
||
|
|
just go through and do the whole lot, and then tack on at the very end, it will spittle using cat,
|
||
|
|
again, or an echo commands, just the, the end material that goes right at the end of the HTML file,
|
||
|
|
to close off the tags and stuff like that, and then I run it through markdown underscore pie,
|
||
|
|
that takes the all dot marked on file, and generates the HTML file, and then the script that pulls
|
||
|
|
all this together, well then I can tell it to create a draft, which will then open up in the web
|
||
|
|
browser automatically for me, so I can even, I have got the set up so that as soon as I save one of
|
||
|
|
the chapter files, I use a command that will, will monitor the text file, see if they've changed,
|
||
|
|
and if they have changed, run the script, and it will build everything, and then immediately refresh
|
||
|
|
the web browser, so that I've, I've got almost a live rendering of the book, or a part of the book,
|
||
|
|
as I'm writing it, because sometimes that's useful if you're checking layout, and proofreading when
|
||
|
|
that's brief. I do something very similar, and it is immensely useful, because you can, you can
|
||
|
|
prang your, the look of your duck, I can then you do, I shouldn't say you can, I can make a horrible
|
||
|
|
mess of my document without realizing, it looks fine in the markdown, but it's awful when it gets
|
||
|
|
converted to HTML, so seeing that in as close to real time as possible helps me a huge amount.
|
||
|
|
Yes, I mean I think, and it's not whizzywig, but it's as close as you can get to it with this method,
|
||
|
|
I mean that is the downside of this method, it's not whizzywig, but then if you're used to,
|
||
|
|
you know, writing HTML, or latex, or any of the, or markdown, you're used to not seeing what you
|
||
|
|
exactly, what you're going to get until later in the process. I should mention that I just looked,
|
||
|
|
the, the way I did that automatically, it's just a one line thing on the command line,
|
||
|
|
if I'm working in chapter five, the command I would issue would be LS, chapter five, dot markdown,
|
||
|
|
pipe, and then ENTR, enter space, and then the name of the script, which in my case is called
|
||
|
|
markdown to HTML, so that ENTR command is the one that effective in monitors, in this case, chapter five,
|
||
|
|
and it'll notice a five changed anything to do with chapter five, and then regenerate the regenerate
|
||
|
|
chapter five accordingly, should I, should I change it? So, so the other things that I,
|
||
|
|
other options I have is I've got, I've got basically four options, one is draft, in fact there's five,
|
||
|
|
one is draft, one is just to check references, that's minima references, of course web links,
|
||
|
|
so the check references, all it does is goes through and checks all the links are valid and tells me
|
||
|
|
if there's any, it gets any 404s or 403s or something, what's the move to one, whatever the move to
|
||
|
|
one is, or sort of server ever, that comes occasionally, so the check refs is one, draft is the working one
|
||
|
|
I mentioned, this print, which is basically generating the final thing for the publisher,
|
||
|
|
and then there's web, which I don't really use, but that I could use to put a version on a website,
|
||
|
|
I haven't used that in a while, but the last one is, will generate an ebook, so it'll generate,
|
||
|
|
I think, looking at it, it generates an ePub, I think, although maybe it's actually more flexible than
|
||
|
|
that, I wrote this so long ago, I don't actually quite remember what it does, but it definitely does
|
||
|
|
generate an ebook of some kind, I can see that, I see lots of opf and ncx files,
|
||
|
|
I should also mention that a lot of the formatting takes place in the CSS file, that's a very
|
||
|
|
important part of the process, in that I say nothing about how it should look beyond tags,
|
||
|
|
formatting tags, the formatting, what h1 means and what the p tag means, all of that stuff
|
||
|
|
is kept strictly in the CSS file, and that actually greatly helps with keeping the draft,
|
||
|
|
the print, the web, the ebook, version all quite distinct, because that difference all takes place
|
||
|
|
in the CSS file, really. So yeah, that you have covered areas that I've been tangling with,
|
||
|
|
it's not just the book in my case, I actually came up with something similar for making my
|
||
|
|
HPR show notes for any show of any complexity that I do, so I actually do two types of show,
|
||
|
|
as far as the computation is concerned, one of which is fairly simple, it's just like
|
||
|
|
the notes are in the database on the HPR site, which is when you post stuff through the form,
|
||
|
|
it gets dropped into the, eventually gets dropped into the database, and that's what served up
|
||
|
|
when you go and look at an HPR page, but I also do a thing where I write longer, more complicated
|
||
|
|
notes with images and whatever, and examples as a separate file that gets put up alongside the
|
||
|
|
the audio and stuff on the HPR website, so I wrote a thing which manages all of that, and I used
|
||
|
|
make to build it, and I've got a thing that creates a make file, depending on what type of thing it is,
|
||
|
|
and whether it's got pictures and stuff, so yeah, I ended up looking at my book requirements
|
||
|
|
with those eyes thinking, oh yes, I could make a, in fact I have written a make file to manage it
|
||
|
|
all, so you can do make PDF and bang out comes a PDF and that type of thing, but yeah, different
|
||
|
|
approaches to the same sort of idea, fascinating, that we've come at it, come at the same similar
|
||
|
|
problem in such different ways. Yes, yeah, now it's interesting to mention makes, I did look
|
||
|
|
going down that route, just, you know, it does this feels like you're compiling the book,
|
||
|
|
aren't you, I mean it's like you compile computer code, it doesn't feel very much like that,
|
||
|
|
that I'm compiling the book, now why did I not go down the make route, I did look at it and then
|
||
|
|
decided against, I think, I think it was just, I think it was just another layer of complexity I
|
||
|
|
didn't want to tangle with, I felt that the bash script, I mean I'm looking at everything I've
|
||
|
|
just described, it takes place in the bash script, and the bash script is only 57 lines long,
|
||
|
|
and about a quarter of those lines are comments, you know, just, so, even the majority of it,
|
||
|
|
yeah, it's very simple actually, and the bash script really only calls Python, I do refs.py
|
||
|
|
through Python, and it uses, as I mentioned, Markdown underscore pi, so that's it, you know,
|
||
|
|
I don't think there's anything else other than that, it's all text-based, you know, cats and
|
||
|
|
echoes and pipes redirects to a file, that's all there is. Yeah, so I think that was part of what I
|
||
|
|
was, part of the simplicity that I was going for is that really I wanted to use as few tools as
|
||
|
|
possible, and Python is the only what you might call dependency that I've got here.
|
||
|
|
Yes, yes, I came at this, as I said, I don't want to digress too much, but I came at this
|
||
|
|
originally thinking, wouldn't it be nice if there was a way in which you could make your HPR show
|
||
|
|
relatively easily, I started writing a bash script to hand out to the to the world,
|
||
|
|
that would allow you to do things like bring together notes and maybe turn them into HTML
|
||
|
|
through some route or whatever, and would even submit the show for you to HPR with all of your
|
||
|
|
credentials in the days when we use FTP for it, but the thing grew and grew and grew, I'm terrible
|
||
|
|
at coming up with an idea, and then, you know, like Wallace and Grommit is attaching a few planks
|
||
|
|
on the end and the big nail and stuff, and it grows, and you know, you know, that's stable,
|
||
|
|
so I'll just put another chicken on the end of that, and you know, so yeah, you've come at this
|
||
|
|
from a much cleaner position, I think, much simpler and more maintainable, I would imagine
|
||
|
|
go. Well, I don't know. Yeah, I don't know, I mean, you were thinking of, you mentioned there,
|
||
|
|
you were thinking of other people using your script, weren't you, and that was a consideration
|
||
|
|
what you were doing. It was, yeah, originally. Yeah, and I think that's a difference, because this
|
||
|
|
was, this is just to scratch my own edge, this whole file, I never, I mean, I'm perfectly happy,
|
||
|
|
this particular file that the script I'm talking about, it isn't, I haven't put that
|
||
|
|
out online anywhere for people to hack or own with, but the other components of this whole thing
|
||
|
|
are the bit that does the figures, for example, and the index, they're all online, we can put the
|
||
|
|
I can put the GitHub link in the show notes for those, but it's actually the screen in the script
|
||
|
|
isn't, you know, I'm still, it's only for my own personal consumption, so I'm slightly embarrassed
|
||
|
|
about bits of it, and of course, there's bits of it that have still have bespoke links to pass
|
||
|
|
that only makes sense in my file system, so I have to, you know, I have to do a bit of work
|
||
|
|
before I can share it with the world. I feel, a lot of the people might say no, I just publish it,
|
||
|
|
but, you know, there's also the thing is, did I accidentally write my password in this window,
|
||
|
|
I don't think I'll have it. I'm going to say the same sort of mental process is, oh,
|
||
|
|
looks interesting, I haven't got time to check it now, I'll do it later.
|
||
|
|
Yeah, well you know that thing where you're, you've got a command line window and you're,
|
||
|
|
I mean, yeah, I use SSH keys, but still sometimes you have type in the password,
|
||
|
|
and you type in like, you know, for example, if you're doing a pseudo or something like that,
|
||
|
|
yes, you to root, and you know that thing where you're typing, and you don't, you've forgotten
|
||
|
|
which window you're in, and your password goes in plain text into another file, and you don't notice,
|
||
|
|
yes, I've done that so many times, well, maybe two or three times, but you don't notice at the
|
||
|
|
time, it's like, why is my password not working, I'm pressing return, and then later on you're
|
||
|
|
looking at text file, thinking why is there, why are there like ten new lines and my password in
|
||
|
|
plain text in the text file? Oh yes, oh yes, yes, I have paranoia about this, I have to build
|
||
|
|
systems that prevent me being an idiot in order to avoid, just on that subject actually,
|
||
|
|
the I'm using a thing called eChain that comes from Fun2, which lets you manage through a
|
||
|
|
SSH agent, you can set, you give it a passphrase, I've got a SSH passphrase solution I've
|
||
|
|
had before, give it a passphrase at the start of the day, and then it runs all day long as long as
|
||
|
|
your machine's up, and it feeds keys to whoever needs it, and that sort of stuff, so that's made
|
||
|
|
life a lot easier for me, I hardly ever typed my passwords in. All right, okay, and maybe
|
||
|
|
that's something I should look into, that's not as useful. Anyway, we digress a little bit,
|
||
|
|
which is fine, you know, but the other bit that I wanted to talk about, unless there's anything
|
||
|
|
else that you wanted to go over first, I'll talk a bit more about my stuff in the next show,
|
||
|
|
I think, so rather than keep interrupting you. Okay, no problem, the next bit that I wanted to talk
|
||
|
|
about is how I create the index, and this is where we cross paths initially, and this was born
|
||
|
|
of a conversation, I was sat over with a publisher from a book over in Edinburgh, and I live in Glasgow,
|
||
|
|
Edinburgh seems like a long way away, which is where you are today, of course.
|
||
|
|
Yeah, so I'd go over to especially, and we're having this conversation, and he said, you know,
|
||
|
|
my book's full of facts and figures about Scotland, that's what it's about, and he was saying,
|
||
|
|
oh, yes, well, we don't really need an index in this book, and I think my look in my face must have
|
||
|
|
been of utter hoarder, like, as a former academic and nerd, a book with an index, especially one
|
||
|
|
that's factual, it's not like a novel or something, which doesn't need an index, I always like,
|
||
|
|
I wish somebody had taken a picture of me because I was horrified at the suggestion, and I pretty
|
||
|
|
much said to him, I'm horrified, is probably, you know, I tried to modulate, you know, my reactions,
|
||
|
|
but I was really genuinely horrified that he would suggest my book would not have an index,
|
||
|
|
and then he went on to explain the technical difficulties, and I went, and then the
|
||
|
|
in a bit of a broad old character, and I acted, act generating an index like, like that, and it's
|
||
|
|
played a snap of fingers, so, and he went, well, if, if Andrew, if you think you can generate an
|
||
|
|
index that quickly, then then then yeah, let's do it, so the deal was that we'd get through the
|
||
|
|
proofs, I'd make all my corrections, the very final version of the book, just before it went to
|
||
|
|
the printers, they would send me the PDF, and my job was to then create an index. Now, I have talked
|
||
|
|
to other authors, and they sit and they read through the book, the final copy of the book, and they
|
||
|
|
write down a word, and then they write down the page number, I'm not having any of that nonsense,
|
||
|
|
I'm far too lazy for that, so I don't blame you, yeah. Now, so the first thing I checked, as I said
|
||
|
|
from, like the PDF is a text-based PDF, it's not like an image, or that's going to the printers,
|
||
|
|
and it wasn't, it was genuinely a text-based PDF, which is important, of course, because you can't
|
||
|
|
parse an image, well, you have to use optical character recognition, of which, of which actually
|
||
|
|
can, has just released an episode about using some kind of character recognition, hasn't you?
|
||
|
|
Yes, I saw that in the last couple of days, I think I did toy with that once, but it's very
|
||
|
|
difficult to get right, no, much better if it's a text-based PDF, so that was the first win,
|
||
|
|
that it was a text-based PDF, not the figures, but as in the graphs, but all the text-based elements
|
||
|
|
were in fact text, so my first job was, well, how can I turn that into a text file that I can
|
||
|
|
then search, because I want it as text, I want to get rid of anything that's not text, and just
|
||
|
|
keep the words, because then, and the page numbers, you know, I need to know what the words are,
|
||
|
|
and what page is there on, so I actually did quite a lot of hunting of different tools, and eventually,
|
||
|
|
in Slackware, it comes with some PDF tools, I can't remember the package that they're in,
|
||
|
|
but the command that I found on Slackware, pre-installed, part of the Slackware install,
|
||
|
|
was PDF to T.O. text, PDF to T.O. text, and it did everything I wanted, and I had to
|
||
|
|
fatal with the command line switches and read the man page a bit, but essentially what it can do
|
||
|
|
is you can give it a page range, and of the PDF file, it will suck in the PDF file and spit out that
|
||
|
|
page as text, and so that's the ideal thing, because if I can just produce one page of the PDF
|
||
|
|
at a time of text, I know which page this is on, I've got the words, I can then do a script where I
|
||
|
|
search through for a search term, and then I know that search term, let's say the search term is
|
||
|
|
GDP, for example, gross domestic product, GDP, I can then search for GDP in caps,
|
||
|
|
on I find it in that page, I then have an entry for the index, so I essentially just wrote a
|
||
|
|
bash script that working in that principle went, it reads in a set of a text file, I think it's
|
||
|
|
called terms.text, and these are words that should appear in the index like GDP, or economy,
|
||
|
|
or population, that kind of thing, and of course there are times when you might want the word,
|
||
|
|
you might want to find the plural, you might want an acronym, so the way I set it up is that each
|
||
|
|
line of the text file had GDP, maybe I think I used a pipe character, then gross domestic product,
|
||
|
|
and then I think I had pipe, then I had keywords for plural that it would identify a plural,
|
||
|
|
and the code is actually quite simple, but it can distinguish words that should have a plural
|
||
|
|
where it's yes, yes, or yes, that's kind of stuff, so I just literally take a text file, and I
|
||
|
|
in this very simple syntax, write down every search term that I want, and then this bash script
|
||
|
|
uses PDF to text, and then a bit of some kind of regular expression searching to check whether
|
||
|
|
which terms are on each page, and then at the end of course you then have to sort the terms into
|
||
|
|
alphabetical order, and then put the list of page numbers, or page ranges, because if you've got
|
||
|
|
a hit on for a GDP on page 101, 102, 103, you don't want to list all of these individually,
|
||
|
|
you want to be 100, 200 and 3, whatever, 103, you know, is the style that's used in the book,
|
||
|
|
or a conversation to separate the list of pages, and actually there was a few gotchas,
|
||
|
|
there was a few weird characters, the upset and invisible characters, but I was able to catch all
|
||
|
|
of them, you know, just a bit of gripping, search and replace, reg X's crafted for the job,
|
||
|
|
and it worked extremely well with a very small amount of fettling at the end, there was a few
|
||
|
|
times where it really went to tone on certain, like EU was a particular problem, I don't
|
||
|
|
seem to remember, because though you wouldn't think that EU kept appearing inside other words, and
|
||
|
|
I can't remember, there was a problem, I had a problem with EU, and it wasn't reg X's,
|
||
|
|
not the EU's party, I don't mean that, but I remember some reason that generated a huge number
|
||
|
|
of hits to EU, more than made sense, and I couldn't quite get to the bottom of why,
|
||
|
|
so I think it was a very short acronym, and that was basically the problem,
|
||
|
|
so I had to go in and do a bit of fettling and improve the script a little bit,
|
||
|
|
but then I sent within a day, this was all turned around within a day, went back to the publisher,
|
||
|
|
and they were astounded, they never seen an index turned around that fast before, and said,
|
||
|
|
oh, could you share that script with us please? And I'm thinking, really, don't publishers have
|
||
|
|
a standard tool for this job? I mean, I know they're a quite small publisher, these guys, but
|
||
|
|
you know, it was like, as if it did feel like I'd discovered some kind of gold to them,
|
||
|
|
unfortunately though, I couldn't actually get it working the script, wouldn't work reliably
|
||
|
|
on windows at the time, there was a few problems, I don't know what they were, never got it working,
|
||
|
|
and there was also problems with it working in Mac, which surprised me, because I thought
|
||
|
|
that would be closer to Unix, and so those problems I never was quite able to solve, I should
|
||
|
|
go back, but I think the main problem I couldn't solve in the Mac, which they were using is that
|
||
|
|
I didn't have a Mac to test out on, so that would be an interesting project for someone else in the
|
||
|
|
Mac. Yes, yes, so you pointed me out the GitHub repository that contained the tool to do this,
|
||
|
|
which is a Python script, so is that the sort of later development? Yes, sorry, it's a Python
|
||
|
|
script, I said bash script, there is a bash script as well, that was an earlier version of it,
|
||
|
|
and then I found the better way to do it, and as you do, I'm doing it in bash and then going,
|
||
|
|
oh no, I can't get this to work, I need something with it, or oomph to it, so yeah, quite a reasonable
|
||
|
|
thing to do, so yes, I've had a look at you, I've actually tried running your Python script,
|
||
|
|
and it's a great job, really good, nice idea, and as you said, it's fairly simple in concept,
|
||
|
|
but there's a whole bunch of things you need to cater for, but the principle of taking the page,
|
||
|
|
looking for particular keywords, and then keeping a record of what you found, and then consolidating
|
||
|
|
it all and printing it out at the end is great, it's perfect. Yeah, no, it seems to, you know,
|
||
|
|
actually, it's one of these things where when you start it, you think, oh god, I think we've
|
||
|
|
got a bit carried away when I said I could do this, and when you finally get it working, you think,
|
||
|
|
oh, that wasn't so bad, but there wasn't, along the way, there was quite a lot of gotchas,
|
||
|
|
which don't, you know, like you don't see them, and when I look at the script, it doesn't look that
|
||
|
|
complicated, but I have to remind myself, I think a lot of the gotchas I got around by selecting
|
||
|
|
command line options to PDF to text, and I can see there that I've got the URL end of line option,
|
||
|
|
the Unix option, in particular, are the two magic ones that solved a lot of my problems.
|
||
|
|
Yeah, PDF to text is a bit of an odd thing, isn't it? I have used it myself, and not fully understood
|
||
|
|
all the options, bit trial and error was needed. Yes, I think the thing that, I mean,
|
||
|
|
I mean, there was a lot of trial and error, but the trial and error was I tried different tools,
|
||
|
|
and PDF to text was just the one that threw up the least number of problems, they all had problems,
|
||
|
|
but quite a few of them were really couldn't handle boxes, you know, I don't know how, I don't
|
||
|
|
really understand how PDFs work, but some of them really just couldn't handle figures and tables
|
||
|
|
that like broke up the text, and they would just go a bit mental, and throw a wobbly at that point,
|
||
|
|
and the rest of the text, and on that page was garbage, but PDF to text just ignored them,
|
||
|
|
completely, which I quite liked, you know, it said like, this isn't text, I'm not interested in this.
|
||
|
|
On we go. Yeah, yeah, PDF is a strange, strange beast, isn't it? I think, in some forms,
|
||
|
|
it's effectively post-script embedded in a sort of container thing, I think, and that can be
|
||
|
|
pretty hairy. Yeah, post-script, no, that's something I've not tangled with. I mean, post-script is almost
|
||
|
|
like a language, isn't it? Yes, it's trying to complete language. Yeah, I mean, I remember being able
|
||
|
|
to, in my latex days, in the 90s, being able to read post-script and troubleshooting, I mean,
|
||
|
|
not like, I could really understand it, but it always looks really arcane and strange.
|
||
|
|
Yeah, yeah, yeah, it is. It needs a completely different mindset, it's quite fun in its way,
|
||
|
|
you've got nothing better to do. But yeah, yeah, it's, I think it's a great solution.
|
||
|
|
My problem was that I had looked at doing this with EPUB because EPUB is a whole different issue
|
||
|
|
because there aren't any pages as such in, we're not sort of locked down pages, is that right?
|
||
|
|
I mean, it basically is an HTML document in a container, isn't it? Yes, it's just, there's no
|
||
|
|
concept of pages as far as I'm aware by default, and it just reflows the text depending on what
|
||
|
|
size your screen reader wants to display. Now, now, having said that, I've got a feeling that I
|
||
|
|
have read books that somehow have some notion of page numbers inside them, but I don't really
|
||
|
|
understood, it seems to depend on how the EPUB or whatever format was created. So I don't, I
|
||
|
|
wouldn't swear to EPUBs being unable to mark real page numbers. I can see it being a useful thing
|
||
|
|
in a textbook, you might want to refer to an actual physical page in the print book, but might
|
||
|
|
have only access to the EPUB. So it would seem to me that that would be a useful feature for EPUBs
|
||
|
|
to support. It's, having said that, it's probably worth going and unpacking a, you know, a textbook
|
||
|
|
some sort, I'm sure I have some EPUB textbooks knocking around that, and if you, it's a zip
|
||
|
|
g-sip thing, you can just explode it and then look at all the bits. I know, this is another
|
||
|
|
John culpism, by the way, he was the first person I ever heard who explained what was inside
|
||
|
|
any oven, I never knew. So he's done a lot of hacking of EPUBs over the years, I think.
|
||
|
|
Yes, I think you're right. I think the, let me just look it back at my script. I think all this
|
||
|
|
HTML, CSS, OPS, NCX, those are the files that you will find inside an EPUB. And if so, then there
|
||
|
|
will be no way of tracking, at least in the, in the EPUBs I created of my book, you're correct,
|
||
|
|
there will be no way of tracking the pages that in the print version. That would need to be,
|
||
|
|
if that is possible, I don't know if it is, that would be, need some other clever tools to come
|
||
|
|
along and compare the PDF or whatever was generated for print with the EPUB. You know, I don't
|
||
|
|
even know if that's possible. I mean, I could see how to do it, actually, in principle, but I don't
|
||
|
|
know if EPUB in any way supports that. No, no. I think the Pandoc processor for Markdown has got some
|
||
|
|
sort of a, you know, it's got a rough, a cross-referancy type feature to it, I think.
|
||
|
|
And I'm a bit vague about this because I haven't dug into this in detail. But I think the principle
|
||
|
|
is that you put an anchor against a word, which is something you want to index. And then you make
|
||
|
|
an index table effectively that refers to those instances through their anchors. Does that make sense?
|
||
|
|
No, it doesn't really. Well, you've got GDP everywhere. Could you anchor it multiple times?
|
||
|
|
I'll get. I don't know. Actually, it's a good point you bring up. I did look at, when I was
|
||
|
|
researching how to do this at the beginning, I did look at Pandoc. I mean, Pandoc is fantastic.
|
||
|
|
You know, is it the one that calls itself the Swiss Army knife of something or other?
|
||
|
|
You don't know. It's brilliant though. It's very clever. It is brilliant. What's it written in?
|
||
|
|
Is it? It's Haskell. Haskell, that's right. Yes, I had two problems with Pandoc. The first one
|
||
|
|
is it's Haskell. Nothing against Haskell, but just the way Haskell comes in about a bazillion
|
||
|
|
different packages. Yes, install Pandoc from scratch, and you wait a long time for all of the
|
||
|
|
All of the Haskell stuff. And that is one really rare times that I've found the package management
|
||
|
|
on Slackware to be a problem is that with a package like that, which you have so many dependencies
|
||
|
|
just because of the way it's packaged. Anyway, that's one thing. But the other more fundamental thing
|
||
|
|
that I had with that is, you know, you don't want to eat your dinner with a Swiss Army knife.
|
||
|
|
I mean, you could, in principle, with two Swiss Army knives, using it, turn them into knife and fork
|
||
|
|
meat, but you rather just use a knife and fork. So yes, so this is my problem with Pandoc here.
|
||
|
|
Is it did a lot of the things I wanted to do? Perhaps all of them. But I just felt it was
|
||
|
|
cumbersome. It got away from the simplicity, you know, using a knife and fork. That would do me. My
|
||
|
|
project was not complicated enough or tricky enough to deserve Pandoc, I don't think.
|
||
|
|
Had I already been acquainted with Pandoc, I might have used it actually, but I wasn't well enough
|
||
|
|
acquainted with it. I would have to then install everything from scratch in this occasion because
|
||
|
|
I haven't used it for a while. So Pandoc is great, but I just felt, you know, it was, it was too
|
||
|
|
complicated. Yes, not a sledgehammer to crack a knot, but using Swiss Army knife to eat your dinner
|
||
|
|
would be another job of using it. Yeah, you might leave one of those blades out and cut
|
||
|
|
yourself in the nose. Yes, that's it. Yeah, yeah, the bottle probably spring open and puking
|
||
|
|
in the iron thing. So it'd be trimming your nails with the scissors back, so I don't think
|
||
|
|
something like that. Yeah, yeah. No, I take your point. I have been playing with Pandoc for a
|
||
|
|
long time now, so I don't feel too uncomfortable using it, but I can, it is and it changes quite
|
||
|
|
often and you think, oh, that doesn't seem to work. I wonder why and you go and look at it, so we
|
||
|
|
improved. Yeah, okay. The simpler approach is less, less full of surprises. Indeed, yeah.
|
||
|
|
It's just, it's personal preference. It's just how I wanted to do it, you know, I mean,
|
||
|
|
quite like understanding what I'm doing and using the, I really just like this
|
||
|
|
unix philosophy of lots of simple tools that are focused on one particular job.
|
||
|
|
So yeah, I find that works best for me generally. I find a, you know, I'm losing my hair naturally
|
||
|
|
as I age anyway, but it means I don't have to pull any more out while I'm frustrated.
|
||
|
|
No, fair enough, fair enough. Just just a couple of points on the subject of tools.
|
||
|
|
They've had two approaches made apparent to me. One is Yurun, who's a contributor to
|
||
|
|
HBR, who has written quite a number of books, and he has become very, very much enamored
|
||
|
|
of Asky Dock or Asky Dock tour, which is a rewrite of the original Asky Dock, and he
|
||
|
|
reckons that that is better than Markdown, et cetera. I do use it a little bit, but I couldn't
|
||
|
|
say what my opinion was on, you know, bookmaking with it. The second one is my son, who did an
|
||
|
|
open university maths course, which whether you got extra points for submitting your stuff in
|
||
|
|
later, and he's currently doing an MSc in computer science, where you do get a few browny points
|
||
|
|
if you said stuff in that later. So he is really quite knowledgeable about it, and he said,
|
||
|
|
it's easy to make an index with later, which it wasn't in the days then. I used it back in the 1980s
|
||
|
|
or something, the late 80s, but apparently it is. And, you know, it doesn't look like it used to
|
||
|
|
back in the day anymore, because there's lots of whatever they are, extensions that let you
|
||
|
|
produce really nice looking documents. So they're both quite long learning curves, I think. So
|
||
|
|
just for the record, it's worth knowing that these two possibilities may be exist.
|
||
|
|
Yeah, absolutely. I did look at Asky Dock. I don't think that, you know, I think I could have
|
||
|
|
happily used that to the markdown. Why did I go with markdown? I just already knew it,
|
||
|
|
and it was, I mean, there's hardly any formatting that I use for markdown. It's very little,
|
||
|
|
it's mainly just text that I'm using, so it didn't really matter. I didn't want to use HTML,
|
||
|
|
because, you know, that is cumbersome to write compared to the markdown. When I'm writing,
|
||
|
|
I just like to write in plain text. I'm writing the book that the pro is, I don't want writing a
|
||
|
|
plain text and a simple text editor is my preference. As for latex, well, yeah, I used,
|
||
|
|
I mean, I wrote my PhD thesis and latex and lots of papers when I was on academic. If you want to
|
||
|
|
do maths to this day, I really latex is the way to go. It just produces the most beautiful type
|
||
|
|
set maths. There's no maths in my current textbook, well, in the book I'm publishing here.
|
||
|
|
If there had been, I would have gone down the latex route like a like that. It would have been
|
||
|
|
the obvious choice, but because I wasn't, I don't know, I just, I felt, yeah, I could have
|
||
|
|
usually taken, I could really could have done. I haven't really even thought of why it didn't
|
||
|
|
other than there was no maths in it. It was the case that whenever you produce something with
|
||
|
|
latex, it just looked the same. It looked like enormous margins and the font used was,
|
||
|
|
well, I'm talking about my day, which was the early days of latex, perhaps in about, yeah, 87 onwards.
|
||
|
|
But that's, some people got a little bit prejudiced against that because everything they
|
||
|
|
produced looked the same, whether it was a paper or a, you know, shopping list or something.
|
||
|
|
So, but I think that any of those feelings should be, should be reviewed.
|
||
|
|
Yes, yes. I mean, yeah, I know exactly what you mean about latex or having the same look.
|
||
|
|
But, yeah, I'm trying to think why that would have been back then. Why, you didn't see,
|
||
|
|
yeah, you're right. I don't remember ever seeing anyone who used anything other than standard font
|
||
|
|
that it gave you. But it looked really nice, but, you know, everything was the same. You couldn't
|
||
|
|
make a fancy looking book where, you know, you had margin notes or, you know, interesting chapter
|
||
|
|
headings and stuff that you would, you could do it, but it all was in the same font and looked like,
|
||
|
|
oh, that's a latex. Yeah, obviously. You could, I mean, you could do it because I submitted
|
||
|
|
papers to journals and when they, they would, at that even at that time, this is in the 90s,
|
||
|
|
they preferred you, latex was such a smooth process for them. It cut down on all their
|
||
|
|
types of setting overhead. They preferred to use it, but the final product did not look like latex,
|
||
|
|
it looked like the house style for that journal. A few journals ended up like latex, but those
|
||
|
|
ones, those longstanding professional academic journals, no, even then they were able to put it
|
||
|
|
in two column format with a custom fonts and heading styles. So I think it was, it's probably always
|
||
|
|
there, it's just that you needed to, you know, open the door to the the tech underneath it and then
|
||
|
|
be a bit shocked by how complex it is. Yeah, I think that must be the point. Yeah, that's
|
||
|
|
I mean, back then, I would have been very up for doing that, but the only person I ever remember
|
||
|
|
going and getting involved in that was the same person who could pretty much speak,
|
||
|
|
post script. Well, you need a post script for that. I'm just saying that's the level of intellectual
|
||
|
|
geekery that he had ascended to, where, you know, he could really, you know, I always joked
|
||
|
|
with him, but he could go up to a post script printer and speak to it and it ended up in
|
||
|
|
a perfectly formative graph or something. Yes, and how are you, I'm trying to thank you.
|
||
|
|
Very good, very good. Yes, so that was, yeah, so I think that's discovered all the stuff
|
||
|
|
I've done, so we'll look forward to talking to you next time, Dave. Yeah, yeah,
|
||
|
|
what you've been up to for each part. I'll give you a summary of a similar nature, and we can
|
||
|
|
you can maybe sort of kick around the ideas and come up with some different views, etc, etc.
|
||
|
|
That'll be fun, looking forward to it. Okay. Okay. So, well, we'll say goodbye to everybody and
|
||
|
|
see you next time. Great looking bye-bye. Bye-bye.
|
||
|
|
You've been listening to Hacker Public Radio at HackerPublicRadio.org.
|
||
|
|
We are a community podcast network that releases shows every weekday, Monday through Friday.
|
||
|
|
Today's show, like all our shows, was contributed by an HBR listener like yourself.
|
||
|
|
If you ever thought of recording a podcast, then click on our contribute link to find out
|
||
|
|
how easy it really is. Hacker Public Radio was founded by the Digital Dove Pound and the
|
||
|
|
Infonomicom Computer Club, and it's part of the binary revolution at binrev.com.
|
||
|
|
If you have comments on today's show, please email the host directly, leave a comment on the website
|
||
|
|
or record a follow-up episode yourself. Unless otherwise status, today's show is released on the
|
||
|
|
creative comments, attribution, share a like, 3.0 license.
|