Episode: 2667 Title: HPR2667: Create PDF bookmarks with Pdftk Source: https://hub.hackerpublicradio.org/ccdn.php?filename=/eps/hpr2667/hpr2667.mp3 Transcribed: 2025-10-19 07:13:04 --- This is HPR Episode 2667 entitled Create PDF bookmarks with PDFTK. It is posted by Klaatu and in about 22 minutes long and carry my clean flag. The summary is basic in total of your PDFTK functions. This episode of HPR is brought to you by archive.org. Support universal access to all knowledge by heading over to archive.org forward slash donate. Hi everyone, this is Klaatu and you're listening to Hack and Public Radio. This is an episode about PDFs. Now in past episodes I have gone on the record talking about how much I hate PDFs. And I would like to assure you that as not changed I still hate PDF. But that doesn't mean that I don't have to work with them. I tend to get involved in publishing projects of all kinds. And the purpose of PDF originally as I think I've said before was and is to be a pre-flight technology. So that means that you're going out to print. You're going to send post script data to a printing device. Well then you can see exactly what that printer is going to receive by exporting it to a PDF first. And that shows you what your printer is going to produce. And then you send that PDF to the printer and then the printer prints it out. But anyway, I'm digressing before I even get to my point. And the point is that there is a way that you can manipulate PDF metadata to make your PDFs better. And one of those ways is a really handy tool called PDFTK. Now I will I will tell you that PDFTK on Fedora and Red Hat and anything that uses their repositories is not functional right now. I forget the reason there's some some incompatibility somewhere with something else that is in their repositories rendering PDFTK functionally function less. So you cannot use it on those systems, but everything else in the world as far as I know you can. For Fedora and Red Hat if you're doing basic things then PDF dash stapler is something that you should check out. Now obviously that might change by the time you hear this episode depending on when you hear this episode. If you're in 2021 right now, then this this might be a moot point. And for some of the things I do in this episode, you can use PDF's dash stapler and then for others, you would need PDFTK. So just just be aware of that caveat. Here's the problem that I was seeking to solve and that is that I was producing I had large PDFs that were sent to me and I needed to chop them up into parts which would then be distributed on a chapter by chapter basis. So I needed I was extracting portions from a larger PDF and and putting them into smaller smaller bite size chunks. And what was happening would so to do that I can use PDFTK and I will go I guess I might as well go ahead and talk about how that's done because that's vaguely useful. So for instance I can take this this PDF here I'll call it food. PDF and I'm going to do PDFTK and as my input I will put food. PDF first I have to be in the right directory. So PDFTK food. PDF and then I'm going to tell it what I want to do with that with that PDF. So in this case I'm going to chop a PDF up. I'm going to do a cat function and what I want to cat is let's say page one for the cover page two for the credits and I'm just going off of memory. But I could just as easily just look I get open up the food out PDF in ocular or X PDF and just look at what I want. So yeah one is cover two is actually credits three is table of contents don't really want that actually and then it looks like chapter one starts on page six. So then I'm going to do six and then dash and chapter two starts on page forty one. Nope sorry twenty two so that means this one ends on twenty one and then I'll do another space and look for the back cover if there is a back cover yes there is so I will also cat page two hundred and fifty eight. Nope two sixty one. Okay so now I'm going to now that I've cated those things into temporal dimmy plane space. I will do an output of I'll just call it out dot PDF and that might take a little while depending on the size of the PDF that you're chopping up but it does exactly what you're what what it sounds like it's doing. It's going through the input file which was food out PDF and then it is grabbing page one space two space six through twelve twenty one space two sixty one grabbing all those pages and outputting it to a new PDF called out dot PDF. So now if I launch if I if I open up out dot PDF in something like ocular or X PDF or whatever your favorite viewing program is then that's great. So so here I'm using ocular which I don't always use but that's that's set to default sometimes X PDF tends to render faster for me but right now I'm just leaving it on ocular because it's got the nice side panel here so it shows the thumbnails of each page and that's that's useful. But what's missing is from the original original PDF I had a nice contents bar I had I had bookmarks is what they're called in PDF language so if I wanted to go to the cover page I could click on it if I wanted to go to chapter one I can click on it and it tells me page six and in fact if I expand that it shows me the different sections within that PDF and with little labels you know with section names. So that's really really useful and when you chop up a PDF you lose that information now it that's not the only time you lose that information sometimes that information never exists maybe you got a PDF that didn't have bookmarks or maybe you're creating a PDF from some program it doesn't matter from what well it does matter from what because some programs might write bookmarks in for you. But if it's something just a straight you know export to PDF or a print to PDF you know that trick where you where you have something and then instead of exporting it as a PDF you you go to print and then you print as a file and you point some point to someplace on your hard drive and you tell it to save it as you know my great book dot PDF and it produces a PDF and that works it works great I mean you know it does what you want to do but there are no there's certainly no handy human readable. Section names or chapter names or page numbers that you can quickly refer to so that's what we want to solve we want to get those bookmarks or create bookmarks for that matter in the case of their absence so if you're working with something that had bookmarks and you have lost them in your in your transfer all. Process then you can get them you can extract them with PDF TK and then food out PDF and then you do a dump underscore data that's the in so instead of catting pages you're dumping data so dump underscore data and then so in so with cat of course we gave page range a page range of stuff that we wanted to cat in in the case of dump data there is nothing else that it requires so it's just dump underscore data and then and then output so we need to find our output so that's just the word out. And then I'll just call it book dot mark and again that takes a little bit of time but it's scrubbing all of the bookmark metadata and a little bit more actually and now I'll open up book dot mark in my favorite text editor which is Emax which if you want to learn how to use Emax for yourself you can go back and listen to episodes 852 and 856 and a couple of episodes around that time okay so it looks like this has opened and this is just kind of the way that I've been doing it which which isn't probably the best way to to learn how to do PDF bookmarking I mean surely there are there is technical documentation somewhere about what all of this data actually means but it's for the basic usage which is what I've been doing is just creating basic bookmarks as needed this is pretty straightforward so the first 15 lines is the header information for the bookmark data the first first couple of values are pretty obvious and the format you start to get a feel for the format so first is info begin on one line and then info key colon mod date and then info value decolon 2015 01 22 14 1951-05 and so on so info begin again so there's no so in other words you have begin blocks and you have no end block you just have the data of of whatever's in that block followed by a colon followed by a value so key value type almost an eye and I format except without the without the info without the main block being delimited by print brackets so yeah I've got info begin info key mod dates and then info value I've got info begin again info key creation date and a different date is given for that one info begin info key creator so the info value for that I'll change right now and put down that it was PDF TK parentheses Linux info begin info key producer and again that's probably it's talking about a library and I don't actually know what PDF TK uses on the back end I did the I did an LDD in hopes of kind of under uncovering like I don't know a lib lib something or another PDF or something but I it's not there so PDF just does it itself so I'm just going to put PDF on that line as well and then so far every every one that I've looked at every PDF and and I'm talking about I don't know over the course of the past year I've seen lots so I mean you know like a hundred or more I mean it's it's at least one a week and then some so PDF ID zero and ID and PDF ID one I don't know what this is derived from I mean they're clearly MD5 hash MD5 sums here but I don't know what what they're calculated on I don't know if what their function is I don't know if it's unique like if you're concatenating to PDFs so you can either gloss over that and just ignore it or change it if you want I don't know whichever you prefer I have not run into problems with it yet but I don't also know the the actual function so then it's got line 15 that which I consider sort of the the end of the of the header portion is the number of pages and that since I've dumped this from the original PDF will you know it's it differs drastically from the the PDF that I've carved out of the original so I'll open up my PDF again my out dot PDF and they'll scroll down to the bottom here and it looks like it's got 19 pages so I will change this number from one six from 261 to 19 and then it becomes a pretty repetitious process of this sequence so bookmark begin is the the start of the block bookmark begin and then bookmark title colon and then some string so in this case it's cover because that's the first page of this thing is its front cover bookmark level is one bookmark page number is one and that's the block that and then it starts again bookmark begin bookmark title credits bookmark level one bookmark page number two bookmark begin bookmark title table of contents bookmark level one bookmark page number three and so on so it's it's that block over and over and over again so if you if you copy and paste that block you could you could just give yourself as many blocks as you want so what I what I've been doing generally is I'll go to my open PDF which is out dot PDF and see that page one is indeed the cover so I'll skip that one okay great the next one is the credits so or the you know what what we would really call the front matter or at least what I've seen called the front matter and that is indeed page two so I can skip that one credits page two now it starts to differ at this point because I skipped the table of contents because it's kind of useless to have the table contents when it when it's referring to a 261 page book and I'm only delivering a 19 page book so the the third page of the PDF is chapter what I'll call chapter one so I will delete table of contents and type in chapter one bookmark level is one page number is three now if there's a section within within this within this within this uh within this PDF or within this chapter rather and let's let's say for argument say that there is then I could further define I could define a new section so let's say that the sections starts there's a very significant section on page five so here it says forward on page five but I'm not using that I've diverged drastically now so I'll just erase that but I'll reuse the block so bookmark title is we'll call it you know section one bookmark level becomes now two because we want this to be indented one one one space or one one block in in our PDF viewer so chapter and the credits and the cover were all level one so the section a subsection because level two now if there's a section within this section so I'll I'll say that there is for argument say can I'll say subsection one then the level would become three you get the idea and we'll say that that was on page six we'll leave it there okay so you can bookmark pretty much whatever you want to bookmark and if you just kind of go to the I'm just going to go to the very end of my document which is 2,484 lines long by the way and I'll talk about why that is in a moment so I'll delete all that and then I'll just copy some some bookmark begin sections and just kind of overwrite them and we know that our chapter ends on page 18 of this particular PDF because it's a 19 page PDF and the 19th page is the back cover so I'm just going to skip down pretty much to I'll just say back cover this is all just the like the bookmark title colon back cover there's no fancy way that you have to put that it's just back space cover you don't have to escape anything or quote anything it's whatever you want the viewer to see when they're looking at the PDF bookmark level goes back to one because we're no longer in a subsection and the bookmark page number is 19 so there we go we've got a 39 line file defining a bunch of 15s 15 lines defining the header information and then another 4th 15 18 whatever it is defining the bookmarks themselves so I'll save this as book dot mark and then I'll go back to my terminal here and you have PDF tk now my input has changed from food dot PDF which was the original to my new one the one that I want to apply my bookmarks to so PDF space out dot PDF and in this case I'm going to update underscore info now I always get it confused because it's dump underscore data but update underscore info I try to mix those up in all kinds of interesting ways just to don't do that it's update underscore info is the thing that you want to do I mean if you do something accidental like you know if you if you do the wrong command that's not valid it won't do anything it'll just it'll fail but you'll wonder why so it's update on underscore info and now we need to tell it what where the info lives that is of course out book dot mark is the file that I created now it's still once an out a new output file so I'll put output and then I'll I'll make this called I'll call this one chapter one dot PDF so that that happens really fast especially for such a small PDF so now I'll click on my chapter one PDF and sure enough in ocular I have got a table of contents with human readable chapter titles and sections and chapter section ones and the back cover and they're collapsible because I'm looking at it in ocular and it's got normal functionality like that but I also still I still have my thumbnails you know you so you don't lose anything this way you're just updating your info to actually have bookmarks I can also right click on well on your system you might not have to do this but I'm opening it also on in X PDF just just kind of see see what it looks like and yeah it looks pretty much the same so I've got all the bookmarks over here on the left they're collapsible and they are human readable and easy to to understand what what the bookmarks are of and that's it that's PDF TK being used to create bookmarks now there are other there's other information that you can extract from a PDF which I've not really played around with yet and that is for instance I'm going to let me to get another copy of this thing because I I overwrote all of the information so I'll just do a dump underscore data again and we'll do output book dot mark again just to get the full the full thing I mean you can see this for yourself if you ever try this at home but I just want to kind of make mention of it so if you look at a at at the the dump data from a from an actual you know something that with pre-existing bookmarks and stuff you get a lot more information than just bookmarks I mean there's there's a lot of bookmarks I mean like lots but at some point it's it shifts from all those bookmarks to page page media data so for instance page media begin page media number one page media rotation zero page media rect zero zero six twelve seven seven four page media dimension six twelve by seven seven four page media begin again and now we're on the second page and so on and and it it's really it's repetitious it's it's I it never changes from the page media rotation being zero because none of them are rotated page media rect which defines presumably the active space of the page and then the dimensions I have not really messed around with those yet I I did some initial tests on trying to manipulate that data and then applying it to back to a file and it didn't it didn't have kind of what I'd expected drastic results such as rotating a page or or or cropping a page I kind of half expected that to occur but that's not what occurred it just it over it ignored that data actually is what it did so I don't know exactly I it must be when I'm dumping the data I'm getting that data but I don't think applying it as a bookmark file does not does not then make that happen in the PDF so I haven't really played around with that much and I haven't read a thing about any of this it's just something that I've been doing because it was pretty easy to do with PDF decay it was something that I had to do so it gets done one thing I have done is I've been working on some scripts to to do offsets such that I can when extracting large chunks of a PDF I can grab the I can get a data dump from the bigger from the parent PDF if you will extract the bookmarks from that parent paste them into a bookmarks file for the child and then run a script to offset all those pages so that I don't have to go through and manually change them and so on but that's it's pretty it's not a very good script right now because it doesn't deal with changing bookmark levels which I would want it to do so it's it's still a kind of a work in progress based on how how much time I want to spend on a given day working on a script versus working on the PDF project that I'm involved in you know one day or another so there you go that is PDF decay a little bit of an intro it's a great little application and it's it's great to know I think how to make a proper table of contents for your digital paper what is it paperless digital format or whatever it is paperless document format that thing thanks for listening talk to you later you've been listening to hecka public radio as hecka public radio dot org we are a community podcast network that releases shows every weekday Monday through Friday today's show like all our shows was contributed by an hbr listener like yourself if you ever thought of recording a podcast then click on our contributing to find out how easy it really is hecka public radio was founded by the digital dog pound and the infonomicum computer club and it's part of the binary revolution at binrev.com if you have comments on today's show please email the host directly leave a comment on the website or record a follow-up episode yourself unless otherwise status today's show is released on the creative comments attribution share a life 3.0 license