hpr-knowledge-base/hpr_transcripts/hpr1760.txt

Episode: 1760
Title: HPR1760: pdftk: the PDF Toolkit
Source: https://hub.hackerpublicradio.org/ccdn.php?filename=/eps/hpr1760/hpr1760.mp3
Transcribed: 2025-10-18 08:55:33

---

This is HPR episode 1,760 entitled, BDFDK, the BDF Toolkit, it is hosted by John Culpe
and is about 21 minutes long. The summary is, intro to the command line BDF Toolkit.
This episode of HPR is brought to you by an honesthost.com. Get 15% discount on all shared hosting with the offer code, HPR15, that's HPR15.
Better web hosting that's honest and fair at An Honesthost.com.
Hey folks, John Culpe and Lafayette Louisiana here and today I'm going to talk about the command PDFTK.
This is an extremely powerful program that's short for the PDF Toolkit.
Now if you've listened to my previous episodes having to do with eBooks, you probably have heard me say bad things about PDF files.
In as much as I think they are not all that much better than paper in terms of their flexibility. They're completely inflexible.
They don't lend themselves well to different screen sizes and all that. And I still believe all of this stuff.
However, we still have to deal with PDFs a lot of the time and sometimes they really are the best available solution.
For my purposes, the one situation where I can't seem to find a suitable alternative for PDF is in musical scores.
Because right now while there are some options for flexible musical notation that can change line spacing and what not to fit different screens,
it's almost always a proprietary system and someone has to have taken the time to create a source file to make it do this.
So at least for now I am stuck with PDFs for my musical scores.
Now when you're dealing with PDFs, sometimes you need to do things to them to make them more useful to you.
For example, when I am putting together score anthologies for my music history classes, I very frequently have to take a large PDF file and extract a few pages from it to put in the anthology for my students.
Now there used to be a time where I thought you had to have the expensive Adobe Acrobat full edition to do this kind of thing.
But I discovered several years ago that there are ways to do it for free on Linux and apparently on other programs too.
My favorite way to do it is with the PDF tool kit, PDFTK.
And while I was putting together the show notes, I went to the website for the people who make PDFTK and discovered that apparently you can use this program with the same commands and so forth on windows in the same way that you do on Linux.
So that's good news I guess. I mean I don't have a Windows machine anymore, but I'm glad that a tool like this is available for Windows users too.
So there are all kinds of things that you can do with PDFTK.
And if you're interested in seeing what they all are, you should just read the man page and see some of the examples that they have.
The things I do most of the time are things like extract a few pages from a big PDF and save it as a small one or extract say the cover page from an article and then only certain pages from it or maybe I will need to combine a bunch of PDFs that I have that are short and I need to combine them into one big one.
And these are very easy basic commands and I have some examples in the show notes on how to do those for example.
If I want to extract pages three through five from a file called fubar.pdf, I would do the following command PDFTK, space, fubar.pdf, space, cat as in the Linux concatenate command, cat, space.
Three hyphen five space output space and then you would give it a file name that you want it to save as.
And in this example on the show notes I put excerpt.pdf.
PDFTK probably for good reason is very sensitive about any kind of accidental overwriting of your source file.
If you try to save the output as exactly the same thing as your source file it will choke and tell you to try again.
So that's how you could extract just pages three through five from a file.
You can either give a range of pages if all of your pages are going to be right in a row or you could do where it says cat three hyphen five, you could do cat three space four space five and it would have the same result.
Now if I also wanted to grab the let's say the first page and then also pages three through five I would do the same command except for change it like this PDFTK, space, fubar.pdf, space, cat, space one space three dash five space output space excerpt.pdf.
So that would grab the first page then pages three through five and output it as excerpt dot PDF to combine several PDFs together.
You would just list as arguments every PDF that you want to add on to the next one and you do it in the order that you want and then do the cat output thing.
So it would go like this PDFTK space file one dot PDF space file two dot PDF space file three dot PDF space and then keep doing as many as you want.
And then once you have listed all the files you want to combine you add on their cat space output space combined dot PDF.
So those are pretty simple and there are actually some GUIs that people have written to go on front of PDFTK that can make this easier if you're someone who really would prefer to have a graphical interface.
The one that I have installed is called PDF chain and I do have it installed but I don't use it all that often. Normally I just find that it's easier to do the commands and it seems like there are more options when you do it from the command line as well.
Here's a cool example of one that I did for my wife a few years ago. She had scanned an article at the library and it turned out that all of the pages were in reverse order.
Now this is something when we were in graduate school and we were copying articles to read for our seminars or whatever.
We would always start by copying the last page and then progressively work toward the first page because on the photocopier that's how it would come out the right way.
And so she just by instinct copied the article back to front and ended up with a file where it started at page 50 and went progressively down to page one.
And she was annoyed by this and it was afraid she would have to go scan the thing again in order to get it in the right order.
And I said no, no, no, just send me the file and I'll take care of it.
And then I did this command that fixed everything for her in about one second.
PDFTK space wrong order.PDF cat50-1 space output space right order.PDF.
And so what that did, it tells it to cat the file from the last page to the first and then output it as a new file and it worked perfectly.
So there are tons of things that you can do of this sort, you know, extracting, reordering pages, rotating pages, things like that.
And the man page will give you lots of examples on that.
You can also do things for security like you can apply password protection, you can, what was it I was just seeing on the man page here.
You can compress and uncompress the file.
If here's an example where I used an uncompress command recently, I had to take some training and for this training they gave us a PDF of the user manual for their company or whatever.
And on every page of this user manual was an extremely annoying watermark.
And it was not the kind of watermark where you could very easily go into the metadata and remove it.
It was really, really in there.
And I thought, well, I'm going to just try see if I can go inside there with a text editor and remove it.
But to do that, you have to uncompressed the PDF file first.
So I uncompressed it and then opened the file up in VIM and started just going through it to see if I could see which code it was that was making this watermark.
And after some experience experimentation, I found it.
There were, I believe, 513 lines of code that were applying the watermark around every page.
And I think the manual something like 28 pages.
So there were 28 instances of this 513 lines of code in the file.
And so once I figured out exactly what code it was, I just went through and found the beginning of each instance and did in this is something that Dave Morris will probably talk about in the VIM hints.
But I applied the number 513.
Or how did I do? There are two ways that you could do this.
I think I did either 513D to delete the line 500 to delete one line 513 times.
Or I did colon, what would it be?
Period, comma, plus 513D.
That would tell it to start here and go down 513 lines and delete all of them.
But anyway, whatever it was, I did it and then when I was done, I recompressed the PDF file and it was gloriously clean without any watermark at all.
And it was much easier for me to deal with.
So you can edit these things in a text editor if you know what you're doing.
But that's honestly the only time I've ever done that.
But it was very, very satisfying as somebody who likes to be able to customize things.
It was great to be able to do that.
Alright, now one of the most useful things I've ever found how to do with PDFTK is embedding bookmarks in a document.
So basically you can create a table of contents in a PDF file that otherwise would not have one.
This is especially important if you've got a very large PDF with lots of separate parts that you might want to jump around to.
For example, I have for my music history class a score anthology.
So an anthology of musical scores that they are supposed to have access to.
And I concatenated all of the separate little PDF files into one great big one.
And the thing is about a thousand pages long.
So it's really, really big.
And you can imagine how unwieldy that would be if you have to just kind of hunt and pick as it were looking around trying to find what piece you want.
That would be totally unacceptable.
And so I learned how to embed a table of contents.
It's actually not that hard to do.
The first thing you need to do is get the metadata from the file that you want to work on.
And you do that with the dump data command.
So for example, if I wanted to get the metadata from the file FUBAR.PDF, the command would be PDFTK space FUBAR.PDF space dump underscore data underscore UTF8.
And when you run that command, it will send the contents of the metadata out to standard output.
If you want to save this as a file, which I normally do, you would just redirect it to a text file.
Now I used the UTF8 option because some of the special characters that I want to use in the titles of these foreign compositions have to be encoded in UTF8 or else they don't look right.
You can also just do dump underscore data without the underscore UTF8 and it will dump the data from the file for you.
But I'm in the habit of using the UTF8 option.
So when you do that, if you open the file that you got out of there up in a text editor, you'll see a number of things.
Up at the top there will be some of the basic metadata like the date of its creation, the program used to create the file.
It might have things like title and author, but it might not.
But at a certain point, you will see the value number of pages.
Actually, I exaggerated a little bit on my musical score anthology. It does not have 1,000 pages. It's got 765.
Still a very high number.
And right after that number, the number of pages, I don't even know if it matters really where you put this stuff, but I normally put it right after the complete number of pages.
You can start adding bookmarks.
And the way you do it is the each bookmark has four lines of code.
The first one is bookmark begin.
The next line is bookmark title colon.
And that is followed by whatever you want the text to read on the bookmark.
Then bookmark level colon on the next line.
And I so far have only used two levels of bookmark.
The top level is one and then my second level bookmark is two.
And then after the bookmark level line, you complete the bookmark by telling it what page, bookmark page number.
And these are done in camel caps. So the beginning of each word is capitalized.
And there are no spaces in these key words.
And I have an example in the show notes of how I did the bookmarks for Beethoven's symphony number five in my score anthology.
Where the bookmark title will be Beethoven's symphony number five in C minor, opus 67, bookmark level one, bookmark page number 205.
And then another bookmark begins where I have extra bookmarks at a second level of the table of contents where you can jump to the beginning of each of the four movements in the symphony.
And so the only thing that would change there is the bookmark level and the title.
And of course, well, I guess everything changes actually.
The bookmark title I say here Beethoven five, Roman number one Allegro Cumbrio would be the name of the first movement.
And it's bookmark level two page number 205.
And so forth for each of the individual things.
And you can put as many of these things in there as you want.
And once you have your metadata file edited the way you want with all of the bookmarks coded, then you run a new command, excuse me, a new command that will update the info on the PDF file.
And I have the command in the show notes for this example.
It goes PDF TK space Fubar dot PDF space update underscore info underscore UTF eight space Fubar dot info and Fubar to info is the name that I gave to the text file that will have all of this metadata.
So we've got number C PDF TK Fubar dot PDF update info UTF eight Fubar dot info space output space Fubar with table of contents dot PDF.
So it takes your input file updates the metadata with the file Fubar dot info and then outputs a new file that has this table of contents embedded making a much more useful file.
And I've done this not only with my score of my anthology of scores, but also with a number of things that are related to my counterpoint projects.
Like the complete well tempered clavier books one and two of JS Bach.
I've got embedded bookmarks where you can jump to any of the 48 preludes or fugues in any of the book.
I did it with his two part inventions and with a couple of other composers scores.
So that both I and my students can immediately jump right to the piece we want without having to scroll up and down through a document that is often more than 100 pages trying to find what we want.
So anyway, that's about all I want to talk about with PDF TK. I hope you'll try it out. It's really one of the most useful programs that you can ever have for for dealing with PDF files.
I'm going to have at least two links here. One is to the PDF TK Linux man page and the other is to the documentation at PDF labs, which is the creator of PDF TK.
I hope you have found that useful. Go hack yourself some PDFs. Bye.
You've been listening to HECA Public Radio at HECA Public Radio.org.
We are a community podcast network that releases shows every weekday Monday through Friday.
Today's show, like all our shows, was contributed by an HBR listener like yourself.
If you ever thought of recording a podcast, then click on our contributing to find out how easy it really is.
HECA Public Radio was founded by the Digital Dove Pound and the Infonomicon Computer Club, and it's part of the binary revolution at binwreff.com.
If you have comments on today's show, please email the host directly, leave a comment on the website or record a follow-up episode yourself.
Unless otherwise stated, today's show is released on the create of comments, attribution, share a life, 3.0 license.