Episode: 2708 Title: HPR2708: Ghostscript Source: https://hub.hackerpublicradio.org/ccdn.php?filename=/eps/hpr2708/hpr2708.mp3 Transcribed: 2025-10-19 07:54:37 --- This is HBR episode 2007-18 titled Ghost Crypt and is part of the series Privacy and Security. It is hosted by Klaatu and is about 23 minutes long and currently in a clean flag. The summary is Klaatu talks about manipulating BDF with the F and BDF table. This episode of HBR is brought to you by AnanasThost.com. Get 15% discount on all shared hosting with the offer code HBR15. That's HBR15. Better web hosting that's honest and fair at AnanasThost.com. Hello folks, Kay Wisher here to remind you that it's that time of year again. Time for the Hacker Public Radio New Year's Eve Show. For those who don't know, on New Year's Eve December 31, 2018, at 10am UTC, that is 5am Eastern Standard Time. We will have a recording going on the HBR Mumble Server for anyone to come on and say happy New Year and talk about whatever they want. We will leave the recording going until January 1, 2019, 12am UTC. That will be 7am Eastern Standard Time or until the conversation stops. Please visit hackerpublicradio.org to find all the details and links about how to set up the PC Mumble client, your favorite mobile app, the mobile server connection details. Our Etherpad show notes and the live audio stream if you only prefer to listen in on the lively banter. So please stop and say hi and maybe join in the conversation with other HBR listeners and contributors. It's always a good time. You're listening to Hacker Public Radio. My name is Clat 2. In this episode, I want to talk a little bit about PDFs. Specifically, how I manage to live with them. And I've done an episode pretty sure with Lost and Bronx about why PDFs are some of the most important pieces of code to ever come your way. And I feel that way very strongly. However, that doesn't change the fact that I deal with them all the time, whether I'm purchasing them online under the guise of, oh, these are ebooks, which PDFs are not ebooks at all. Or whether it's because I'm using them at work, outputting to PDF at work. Whatever the case, I have to deal with PDFs a lot. And I just kind of want to talk about some of the random observations and tricks that I've come up with when having to do things with PDFs. So the first thing that I want to talk about, and I've talked about this on my show, a new world order before, but I think it deserves not really another mention, but some additional information. And that is the Ghost Script command, or as it is typed, the GS command. So Ghost Script is the free and open-source version of Post Script. Post Script being the syntax and code used to generate how a printer is going to produce whatever it produces. So you might have dealt with Post Script directly as an EPS. That's an encapsulated, I think, Post Script file. Post Script is the back end for PDFs, and it is the back end for many printers. So the vectorized versions and the code that goes into ensuring that what you print is the same thing as the stuff that you see in your PDF, that's Post Script. And you can manipulate that a little bit. We'll look in a moment at just how ugly PDFs are and how difficult it makes it to really do anything useful to it after it's been generated. But there are a couple of quick hacks that you can do to help yourself manage some of the PDFs in your life. So the first problem that I often have to solve, and this I've covered on my show before, but not on Hacker Public Radio. So I might as well talk about it. So the first thing is that a lot of PDFs are really, really large. And that is because PDFs are intended as printer input, really. You send a PDF to a printer, and that produces that PDF as a physical thing, as a physical document. That's what a PDF is, which means that a lot of times when people create a PDF, they go, for instance, if I'm inscribed this, I'll go to export, save as PDF. And if I go to color output intended for screen web, okay, that's one thing. Now I could go to printer. The printer output typically, all the defaults, actually that didn't reset the defaults, but anyway, I can set these defaults. So the resolution for graphics, let's say 300 DPI maximum image resolution, 300 DPI compression method, lossy or loss less. Yeah, let's go loss less. Compression quality, let's go maximum quality. So the defaults get set very high for the typical output of a PDF. The resulting file size is, indeed, for instance, the sample PDF that I did for my episode on scribus is about nine megabytes for the printer version. And that's quite a hefty file size for one page. It's a one-page document. It's nine dot one megabytes. Then the smaller version of that is like 900 kilobytes less than a megabyte. And that's output for the web. So there are a couple of different profiles that post script or ghost script at least. I don't know exactly what the post script terminology is, but ghost script can accept a couple of different profiles for its output. And you can manipulate that yourself for something that already exists. So for instance, if I have this example file from my scribus episode, I can do GS for ghost script. And then dash s, lowercase s, device, all capitals equals PDF right. So I'm just outputting back out to to to the PDF writer. I'm not actually printing dash D. Compatibility level equals. I'm going to set it really low because I like backward compatibility dash D. So that's 1.4. So dash D PDF settings equals. And this is the profile. There are five different profiles that ghost scripts can understand. One is a forward slash screen, which is intended for screen viewing only. So it's 72 DPI maximum images. So anything greater than that at downres. Slash ebook, that's forward slash ebook is a low quality 150 DPI image. So that's not bad, but you wouldn't you probably wouldn't want to print from it. I mean, you honestly probably could, but let's say you're, you know, you wouldn't send it to a professional printer probably. Forward slash printer is high quality 300 DPI. Forward slash pre press is 300 DPI image with color color space being managed. And then forward slash default is something else apparently super similar to screen. I'm not clear on the difference there. So those are the different profiles that you can you can you can leverage. So if we just go for forward slash screen for a nine megabyte file, that should have a pretty dramatic result, which is what I'm looking for for the sake of of this proof of concept. So then I'm going to do another option called dash D and then batch. And these options, I don't I've never seen it typed any other way. So I'm assuming the options can have no space between the option and the attribute or the the the argument. So dash D batch all one word with batch being all capitals. And then dash s output file equals output dot pdf. And then I'm going to point it at this example plus bleed dot pdf, which is in the current directory. The dash D batch makes sure that ghost script does not go. It doesn't drop down into an interactive prompt, which it does by default otherwise. So don't want to don't leave that out. And yes, so here's an output dot pdf at 142 kilobytes, which I mean down from nine megabytes is orders of magnitude literally. And the difference is really only in the in the images. So the only the only optimization that it has available to it is two down res images. That's really all all we can do. Well, there's there's something else. But in this in this case, in what we're speaking about right now, it's just it's just the images. And you know, the text is still text. So you can zoom in on that forever. And it will recalibrate how it's aliasing the text. And it'll look great no matter what. You could print that and be perfectly happy with it. It's just the graphics that got down res. Not a big deal really. So now the other thing that I've done in the past to to shrink the size and complexity of PDFs. And that's kind of that's a big one to be honest. Sometimes I can I can kind of handle a PDF on on several devices, whether it's my little ebook, my eink ebook reader or whether it's a mobile phone or something like it it's a pain because you still have to scroll around to try to read. And you know, it doesn't really it doesn't really do that well. But but the the real problem for me is a lot of times that it'll spend so much time trying to render these graphics on this slow device that it slows down the the reading process to just being too annoying. So half the time my my issue is not even necessarily the resolution of graphics. It is it is the presence of graphics. I just don't need to suspend any cycles on generating the graphics half the time. That's not always true. Sometimes the graphics are integral to what you're reading. So you need them there. But other times you don't. And as it turns out, this is I guess a common enough problem because there is a filter in ghost script to filter out images. And the filter is dash D and then filter image all capitals filter image. Now that filters out very specifically raster images. So if you need to get rid of vector images as well, there's a separate filter for that. I find in practice that I don't really have to deal with the vector images very often. It's it's it's almost always raster images that are in PDFs and they are huge. So adding the dash D filter image to the same command. So I guess I'll read that out again. So that's ghost script or GS space space dash S device equals PDF right. That's where we're going to dash D compatibility level equals one dot four. That's the version of PDF readers that will be able to open this, which is I think as far back as you can go. I've never seen anything earlier than that. I mean, never. I haven't seen recently in recent years anything farther back than that dash D PDF settings equals forward slash screen. I'm just keeping it small dash D, especially because we're not going to even have images in in here. Anyway, it doesn't really matter. Dash D batch dash D filter image dash S output file equals output dot PDF and then the example plus bleed dot PDF, which is the the big nine megabyte file that we're going off of here. So you do that and it processes and dumps output dot PDF into the into the current directory. Now I'm doing LS dash LH on output dot PDF and it is down from nine megabytes to 40 kilobytes. That's a lot more reasonable. And if I open the thing up, then I see on my screen a perfect representation of that PDF except there's just no graphic there. So we're not spending any any file size is on on the graphics and we're not spending any CPU cycles trying to render those graphics for no good reason. So that's a huge one for me. That's that's really saved me from being able to you know not being able to read a PDF on some device to actually being able to read the PDF on a device. It's made all the difference. Now the place where that's also made a difference is when when printing like sometimes I'll have a PDF and I want to print something for reference on actual paper. It does happen sometimes and a lot of times they'll have background images you know for for whatever reason like the for style really. I mean it's a background image to evoke some kind of mood or just to look cool and then some other images here and there and maybe the images I could usually stand but I mean to print 50 pages of background floral prints over my text or behind the text ostensibly it just doesn't make any sense. So if you do this command the go script command and filter out all those images that gets rid of those background images. I mean it gets rid of the foreground ones too which that's a little bit annoying but but really the the background images for me are the ones that really matter but I mean I don't even mind printing without the the foreground images usually. I usually don't want the foreground images or if I do it's just a couple of them and those I could like screenshot and print separately or or maybe not print it all and just have them on a screen as a single file and that sort of thing. So go script filter image really really useful if you like me need to sometimes print a PDF and don't want to spend all of your ink on fanciful background images or if PDFs are simply too large for you. Now in the past in a past episode I've talked about bookmarks retaining and editing and applying bookmarks to a PDF file. I've also done an episode on PDFTK which is the program that I generally use to chop app chop up PDFs when I need to just extract you know a page from a PDF just here or there for whatever reason or maybe I need to extract a couple of pages and then merge them back together you know so basically taking a subset of a of a larger PDF and I I realized that I probably should mention a separate or a related program because I don't think I mentioned it may have but it's called PDF stapler and PDF stapler is an application that sort of takes the place of PDFTK not exactly it doesn't have one-to-one parity of features it doesn't quite have everything that PDFTK does but it's got it's got that magical you know 80 or 90% of stuff and what it doesn't do all that well is the bookmarking stuff actually that's PDFTK really but PDF stapler and I have seen it generally called PDF dash stapler PDF dash stapler is a I think it's Python based as far as I remember and its syntax is similar it's not the same it's actually just similar enough to confuse me half the time but it's it's kind of it's kind of similar to PDFTK so for instance if you're going to cat a bunch of files into one big PDF and a common I think for me a common use case for this at one point I used to have to do this a lot I would take a collection of images and then convert them to PDF and then concatenate them into a into a big PDF that was a fairly typical thing to do for for some artists they would need you know they would want their things in in a PDF but they couldn't figure out the easy and quick way to get you know 100 photos or whatever into one one file and that was very frequently doing a convert command on all you know PNGs or whatever in the current directory make them in resize them and put them you know output them is like jpegs and then run some some command to then concatenate all those things into a big PDF so for instance if I was doing that on with pdf stapler it would be pdf dash stapler space cat for the that's the command and then space and then I guess I would just do a wildcard dot pdf or yeah because I would have done a convert on all those jpegs to pdf and then I would have done wildcard dot pdf and then space and I don't know output dot pdf and and and it puts all of the files that you pointed at into one big pdf that will open and people can flip through so it's a cat or cell for some reason I'm not really sure why they they do that I'm not sure if there's a difference but there's their cat to concatenate pages there's also something called cell s e l for select the given page range and again I'm not 100% sure if if they mean for that to if there's going to be some other function for that or if it's if it's just the same thing I'm not sure but it as far as I can tell it's the same thing but anyway there's also Dell for delete the EL you can delete a page or a range of of pages there's burst or split which is creating one file per page for an input pdf which is something that I've I think people probably would need to do I've I've definitely heard people needing to do that I personally I can't imagine having to do that no I can for a printer spread totally I can I can see doing that and then there's also zip which is merge or collate the given input files interleaved so it's you know odds and evens that sort of thing there's also info which displays pdf metadata but there's nothing as far as I know as far as I've been able to find in the command there's nothing to reapply that image that the metadata to a pdf so if you you can you can get the data from something but whether you can reapply it to your new pdf or to to to another pdf for some reason as far as I can tell there is no way in pdf stapler for that to happen the site that you can download that at is github.com slash hellerbard slash stapler and I will put a link to that in the show notes H-E-L-L-E-R-B-I-R-D-E is the username and it's just called stapler there I don't know if I'm using an older version or if if the command simply has remained pdf dash stapler I'm not really I don't really remember where I got this thing it's just one of those things that I have on my work computer and have been using as is with with great success so that's that's another tool that I use it's really interesting if you if you look at pdf files it's kind of shocking like if I do it you can look in pdf it's kind of interesting if you go to emax space and then output dot pdf I'm just doing output dot pdf because that's what I just did with my go script thing that removed the images then I hit return now in emax it it actually renders the pdf for me which I don't actually want in this at this particular moment so we're going to hit control c control c and that gets me to the source view if you will and you can see what goes into making a pdf a pdf and it is horrible to look at it really is it's honestly just dismal there's there you really can't make heads or tails of it but what's funny is that you kind of get this cadence and there's this there's this line here called stream str em and that appears to it seems to begin a block of binary data that you cannot you know it's not it's nothing that you can actually read and then at the end of all that there's an end stream tag I guess you could call it or declaration and then an end object and then a declaration of the object number which I don't know where the object numbers come from I don't know what's generating those it's it's really not very it's pretty mysterious to look at but what's really funny is if you go into these streams and just start deleting things it's kind of entertaining to see exactly how little effect you have on the pdf output like I just deleted a bunch of stuff from a stream and it took away the v in the word gave and the m in the word fanaticism in the in the pdf that I generated and that's all it did and it was like this huge chunk code that I just got rid of and you can do that and and the pdf still opens it's it's really really kind of kind of frightening in a way because you think what what what could someone just put into a pdf file and post online for people to download because apparently the pdf would just open and you have no idea you know really what's in the pdf it's really really strange I've never seen I don't think I've ever quite seen now there I have broken it enough at one point where it wouldn't open but it doesn't it's not something that's it isn't really something that you find you know you there's a lot of flex it's not very strict is what I'm trying to say you can you can delete all kinds of things sometimes there will be no apparent no visible change whatsoever other times there'll be a little and just little quirks you know like maybe a font will will disappear so you're just left with a normal font instead of something that was supposed to be italicized or whatever so yeah it just kind of depends on on what you're deleting but it is quite interesting to have a look behind the scenes and you like I say you can do that in emax when you open emax it'll render the pdf for you so just hit control c control c to get to the to the the text view and you can kind of poke around and see what's what's in a pdf and and yeah you should it's it's surprising what you can just put into pdf's really is it's very very shocking and it kind of makes me think that maybe maybe a file format with a little bit more sort of more transparency and also a stricter kind of stricter syntax checking would be a good idea because these pdf's as far as I can tell you could just put whatever you wanted into them and then send them around and no one would ever really know I mean I guess it would depend I mean maybe you'd have to put for instance a gpg encoded something or another in in there you know maybe you'd want to encode it but but certainly it wouldn't be the first place for people to look I wouldn't imagine now could you do that you know if there are md5 sums being taken and so on no obviously not but it is is fascinating to see just how lazy the pdf format really is and how bloated apparently it is because I I kid you not I've I've deleted screenfills of information and reopen to the pdf with no apparent change in in display it's pretty shocking so there you go that's that's pdf's for you hopefully I've given you some ways to reduce their size to simplify them to make them a little bit more portable which is funny because I think that's what it used to stand for portable maybe it was paperless all along I forget either way that's pdf's that's ghost script it's pdf stapler hope it's helpful talk to you next time you've been listening to hecka public radio at hecka public radio dot org we are a community podcast network that releases shows every weekday Monday through Friday today's show like all our shows was contributed by an hbr listener like yourself if you ever thought of recording a podcast and click on our contributing to find out how easy it really is hecka public radio was found by the digital dog pound and the infonomican computer club and it's part of the binary revolution at bnw.com if you have comments on today's show please email the host directly leave a comment on the website or record a follow up episode yourself unless otherwise status today's show is released creative comments attribution share a light 3.0 license