Files

244 lines
21 KiB
Plaintext
Raw Permalink Normal View History

Episode: 2708
Title: HPR2708: Ghostscript
Source: https://hub.hackerpublicradio.org/ccdn.php?filename=/eps/hpr2708/hpr2708.mp3
Transcribed: 2025-10-19 07:54:37
---
This is HBR episode 2007-18 titled Ghost Crypt and is part of the series Privacy and Security.
It is hosted by Klaatu and is about 23 minutes long and currently in a clean flag.
The summary is Klaatu talks about manipulating BDF with the F and BDF table.
This episode of HBR is brought to you by AnanasThost.com.
Get 15% discount on all shared hosting with the offer code HBR15.
That's HBR15.
Better web hosting that's honest and fair at AnanasThost.com.
Hello folks, Kay Wisher here to remind you that it's that time of year again.
Time for the Hacker Public Radio New Year's Eve Show.
For those who don't know, on New Year's Eve December 31, 2018, at 10am UTC,
that is 5am Eastern Standard Time.
We will have a recording going on the HBR Mumble Server for anyone to come on and say happy
New Year and talk about whatever they want.
We will leave the recording going until January 1, 2019, 12am UTC.
That will be 7am Eastern Standard Time or until the conversation stops.
Please visit hackerpublicradio.org to find all the details and links
about how to set up the PC Mumble client, your favorite mobile app,
the mobile server connection details.
Our Etherpad show notes and the live audio stream if you only prefer to listen in on the
lively banter. So please stop and say hi and maybe join in the conversation with other HBR
listeners and contributors. It's always a good time.
You're listening to Hacker Public Radio. My name is Clat 2.
In this episode, I want to talk a little bit about PDFs.
Specifically, how I manage to live with them.
And I've done an episode pretty sure with Lost and Bronx about why PDFs are some of
the most important pieces of code to ever come your way.
And I feel that way very strongly.
However, that doesn't change the fact that I deal with them all the time, whether I'm
purchasing them online under the guise of, oh, these are ebooks, which PDFs are not ebooks at all.
Or whether it's because I'm using them at work,
outputting to PDF at work. Whatever the case, I have to deal with PDFs a lot.
And I just kind of want to talk about some of the random observations and tricks that I've come
up with when having to do things with PDFs. So the first thing that I want to talk about,
and I've talked about this on my show, a new world order before, but I think it deserves
not really another mention, but some additional information. And that is the Ghost Script command,
or as it is typed, the GS command. So Ghost Script is the free and open-source version of
Post Script. Post Script being the syntax and code used to generate how a printer is going to
produce whatever it produces. So you might have dealt with Post Script directly as an EPS.
That's an encapsulated, I think, Post Script file. Post Script is the back end for PDFs,
and it is the back end for many printers. So the vectorized versions and the code that goes into
ensuring that what you print is the same thing as the stuff that you see in your PDF,
that's Post Script. And you can manipulate that a little bit. We'll look in a moment at just how
ugly PDFs are and how difficult it makes it to really do anything useful to it after it's been
generated. But there are a couple of quick hacks that you can do to help yourself manage some of
the PDFs in your life. So the first problem that I often have to solve, and this I've covered
on my show before, but not on Hacker Public Radio. So I might as well talk about it. So the first
thing is that a lot of PDFs are really, really large. And that is because PDFs are intended as
printer input, really. You send a PDF to a printer, and that produces that PDF as a physical
thing, as a physical document. That's what a PDF is, which means that a lot of times when people
create a PDF, they go, for instance, if I'm inscribed this, I'll go to export, save as PDF. And if I
go to color output intended for screen web, okay, that's one thing. Now I could go to printer.
The printer output typically, all the defaults, actually that didn't reset the defaults, but anyway,
I can set these defaults. So the resolution for graphics, let's say 300 DPI maximum image resolution,
300 DPI compression method, lossy or loss less. Yeah, let's go loss less. Compression quality,
let's go maximum quality. So the defaults get set very high for the typical output of a PDF.
The resulting file size is, indeed, for instance, the sample PDF that I did for my episode on
scribus is about nine megabytes for the printer version. And that's quite a hefty file size for
one page. It's a one-page document. It's nine dot one megabytes. Then the smaller version of that
is like 900 kilobytes less than a megabyte. And that's output for the web. So there are a
couple of different profiles that post script or ghost script at least. I don't know exactly what
the post script terminology is, but ghost script can accept a couple of different profiles for
its output. And you can manipulate that yourself for something that already exists. So for instance,
if I have this example file from my scribus episode, I can do GS for ghost script. And then dash
s, lowercase s, device, all capitals equals PDF right. So I'm just outputting back out to
to to the PDF writer. I'm not actually printing dash D. Compatibility level equals. I'm going to
set it really low because I like backward compatibility dash D. So that's 1.4. So dash D PDF
settings equals. And this is the profile. There are five different profiles that ghost scripts
can understand. One is a forward slash screen, which is intended for screen viewing only. So it's
72 DPI maximum images. So anything greater than that at downres. Slash ebook, that's forward slash
ebook is a low quality 150 DPI image. So that's not bad, but you wouldn't you probably wouldn't
want to print from it. I mean, you honestly probably could, but let's say you're, you know,
you wouldn't send it to a professional printer probably. Forward slash printer is high quality
300 DPI. Forward slash pre press is 300 DPI image with color color space being managed. And then
forward slash default is something else apparently super similar to screen. I'm not clear on
the difference there. So those are the different profiles that you can you can you can leverage. So
if we just go for forward slash screen for a nine megabyte file, that should have a pretty dramatic
result, which is what I'm looking for for the sake of of this proof of concept. So then I'm going
to do another option called dash D and then batch. And these options, I don't I've never seen it
typed any other way. So I'm assuming the options can have no space between the option and the
attribute or the the the argument. So dash D batch all one word with batch being all capitals.
And then dash s output file equals output dot pdf. And then I'm going to point it at this example
plus bleed dot pdf, which is in the current directory. The dash D batch makes sure that
ghost script does not go. It doesn't drop down into an interactive prompt, which it does by default
otherwise. So don't want to don't leave that out. And yes, so here's an output dot pdf at 142
kilobytes, which I mean down from nine megabytes is orders of magnitude literally. And the difference
is really only in the in the images. So the only the only optimization that it has available to
it is two down res images. That's really all all we can do. Well, there's there's something else. But
in this in this case, in what we're speaking about right now, it's just it's just the images.
And you know, the text is still text. So you can zoom in on that forever. And it will recalibrate
how it's aliasing the text. And it'll look great no matter what. You could print that and be
perfectly happy with it. It's just the graphics that got down res. Not a big deal really. So now the
other thing that I've done in the past to to shrink the size and complexity of PDFs. And that's
kind of that's a big one to be honest. Sometimes I can I can kind of handle a PDF on on several
devices, whether it's my little ebook, my eink ebook reader or whether it's a mobile phone or
something like it it's a pain because you still have to scroll around to try to read. And you know,
it doesn't really it doesn't really do that well. But but the the real problem for me is a lot of
times that it'll spend so much time trying to render these graphics on this slow device that it
slows down the the reading process to just being too annoying. So half the time my my issue is not
even necessarily the resolution of graphics. It is it is the presence of graphics. I just don't
need to suspend any cycles on generating the graphics half the time. That's not always true.
Sometimes the graphics are integral to what you're reading. So you need them there. But other times
you don't. And as it turns out, this is I guess a common enough problem because there is a filter
in ghost script to filter out images. And the filter is dash D and then filter image all
capitals filter image. Now that filters out very specifically raster images. So if you need to get
rid of vector images as well, there's a separate filter for that. I find in practice that I don't
really have to deal with the vector images very often. It's it's it's almost always raster images
that are in PDFs and they are huge. So adding the dash D filter image to the same command. So I
guess I'll read that out again. So that's ghost script or GS space space dash S device equals PDF
right. That's where we're going to dash D compatibility level equals one dot four. That's the
version of PDF readers that will be able to open this, which is I think as far back as you can go.
I've never seen anything earlier than that. I mean, never. I haven't seen recently in recent years
anything farther back than that dash D PDF settings equals forward slash screen. I'm just keeping
it small dash D, especially because we're not going to even have images in in here. Anyway,
it doesn't really matter. Dash D batch dash D filter image dash S output file equals output dot
PDF and then the example plus bleed dot PDF, which is the the big nine megabyte file that we're
going off of here. So you do that and it processes and dumps output dot PDF into the into the
current directory. Now I'm doing LS dash LH on output dot PDF and it is down from nine megabytes
to 40 kilobytes. That's a lot more reasonable. And if I open the thing up, then I see on my screen
a perfect representation of that PDF except there's just no graphic there. So we're not spending any
any file size is on on the graphics and we're not spending any CPU cycles trying to render those
graphics for no good reason. So that's a huge one for me. That's that's really saved me from being
able to you know not being able to read a PDF on some device to actually being able to read the
PDF on a device. It's made all the difference. Now the place where that's also made a difference
is when when printing like sometimes I'll have a PDF and I want to print something for reference
on actual paper. It does happen sometimes and a lot of times they'll have background images you know
for for whatever reason like the for style really. I mean it's a background image to evoke
some kind of mood or just to look cool and then some other images here and there and maybe
the images I could usually stand but I mean to print 50 pages of background floral prints over my
text or behind the text ostensibly it just doesn't make any sense. So if you do this command the
go script command and filter out all those images that gets rid of those background images. I mean
it gets rid of the foreground ones too which that's a little bit annoying but but really the the
background images for me are the ones that really matter but I mean I don't even mind printing without
the the foreground images usually. I usually don't want the foreground images or if I do it's just
a couple of them and those I could like screenshot and print separately or or maybe not print it all
and just have them on a screen as a single file and that sort of thing. So go script filter image
really really useful if you like me need to sometimes print a PDF and don't want to spend all
of your ink on fanciful background images or if PDFs are simply too large for you. Now in the past
in a past episode I've talked about bookmarks retaining and editing and applying bookmarks to a
PDF file. I've also done an episode on PDFTK which is the program that I generally use to chop
app chop up PDFs when I need to just extract you know a page from a PDF just here or there for
whatever reason or maybe I need to extract a couple of pages and then merge them back together
you know so basically taking a subset of a of a larger PDF and I I realized that I probably
should mention a separate or a related program because I don't think I mentioned it may have
but it's called PDF stapler and PDF stapler is an application that sort of takes the place of
PDFTK not exactly it doesn't have one-to-one parity of features it doesn't quite have everything
that PDFTK does but it's got it's got that magical you know 80 or 90% of stuff and what it doesn't
do all that well is the bookmarking stuff actually that's PDFTK really but PDF stapler and I have
seen it generally called PDF dash stapler PDF dash stapler is a I think it's Python based as far
as I remember and its syntax is similar it's not the same it's actually just similar enough to confuse
me half the time but it's it's kind of it's kind of similar to PDFTK so for instance if you're
going to cat a bunch of files into one big PDF and a common I think for me a common use case
for this at one point I used to have to do this a lot I would take a collection of images and then
convert them to PDF and then concatenate them into a into a big PDF that was a fairly typical thing
to do for for some artists they would need you know they would want their things in in a PDF but
they couldn't figure out the easy and quick way to get you know 100 photos or whatever into one
one file and that was very frequently doing a convert command on all you know PNGs or whatever
in the current directory make them in resize them and put them you know output them is like jpegs
and then run some some command to then concatenate all those things into a big PDF so for instance if
I was doing that on with pdf stapler it would be pdf dash stapler space cat for the that's the
command and then space and then I guess I would just do a wildcard dot pdf or yeah because I would
have done a convert on all those jpegs to pdf and then I would have done wildcard dot pdf and then
space and I don't know output dot pdf and and and it puts all of the files that you pointed at into
one big pdf that will open and people can flip through so it's a cat or cell for some reason I'm
not really sure why they they do that I'm not sure if there's a difference but there's their cat
to concatenate pages there's also something called cell s e l for select the given page range
and again I'm not 100% sure if if they mean for that to if there's going to be some other function
for that or if it's if it's just the same thing I'm not sure but it as far as I can tell it's the
same thing but anyway there's also Dell for delete the EL you can delete a page or a range of
of pages there's burst or split which is creating one file per page for an input pdf which is
something that I've I think people probably would need to do I've I've definitely heard people
needing to do that I personally I can't imagine having to do that no I can for a printer spread
totally I can I can see doing that and then there's also zip which is merge or collate the given
input files interleaved so it's you know odds and evens that sort of thing there's also info
which displays pdf metadata but there's nothing as far as I know as far as I've been able to
find in the command there's nothing to reapply that image that the metadata to a pdf so if you
you can you can get the data from something but whether you can reapply it to your new pdf or to
to to another pdf for some reason as far as I can tell there is no way in pdf stapler for that to
happen the site that you can download that at is github.com slash hellerbard slash stapler and I
will put a link to that in the show notes H-E-L-L-E-R-B-I-R-D-E is the username and it's just called
stapler there I don't know if I'm using an older version or if if the command simply has
remained pdf dash stapler I'm not really I don't really remember where I got this thing it's just
one of those things that I have on my work computer and have been using as is with with great success
so that's that's another tool that I use it's really interesting if you if you look at pdf files
it's kind of shocking like if I do it you can look in pdf it's kind of interesting if you go to
emax space and then output dot pdf I'm just doing output dot pdf because that's what I just did
with my go script thing that removed the images then I hit return now in emax it it actually renders
the pdf for me which I don't actually want in this at this particular moment so we're going to
hit control c control c and that gets me to the source view if you will and you can see what goes
into making a pdf a pdf and it is horrible to look at it really is it's honestly just dismal
there's there you really can't make heads or tails of it but what's funny is that you kind of get
this cadence and there's this there's this line here called stream str em and that appears to
it seems to begin a block of binary data that you cannot you know it's not it's nothing that you
can actually read and then at the end of all that there's an end stream tag I guess you could
call it or declaration and then an end object and then a declaration of the object number which
I don't know where the object numbers come from I don't know what's generating those it's it's
really not very it's pretty mysterious to look at but what's really funny is if you go into these
streams and just start deleting things it's kind of entertaining to see exactly how little
effect you have on the pdf output like I just deleted a bunch of stuff from a stream and it took
away the v in the word gave and the m in the word fanaticism in the in the pdf that I generated
and that's all it did and it was like this huge chunk code that I just got rid of and you can do
that and and the pdf still opens it's it's really really kind of kind of frightening in a way
because you think what what what could someone just put into a pdf file and post online for
people to download because apparently the pdf would just open and you have no idea you know
really what's in the pdf it's really really strange I've never seen I don't think I've ever
quite seen now there I have broken it enough at one point where it wouldn't open but it doesn't
it's not something that's it isn't really something that you find you know you there's a lot of
flex it's not very strict is what I'm trying to say you can you can delete all kinds of things
sometimes there will be no apparent no visible change whatsoever other times there'll be a little
and just little quirks you know like maybe a font will will disappear so you're just left with
a normal font instead of something that was supposed to be italicized or whatever so yeah it just
kind of depends on on what you're deleting but it is quite interesting to have a look behind the
scenes and you like I say you can do that in emax when you open emax it'll render the pdf for you
so just hit control c control c to get to the to the the text view and you can kind of poke around
and see what's what's in a pdf and and yeah you should it's it's surprising what you can just
put into pdf's really is it's very very shocking and it kind of makes me think that maybe maybe
a file format with a little bit more sort of more transparency and also a stricter kind of stricter
syntax checking would be a good idea because these pdf's as far as I can tell you could just put
whatever you wanted into them and then send them around and no one would ever really know I mean
I guess it would depend I mean maybe you'd have to put for instance a gpg encoded something or
another in in there you know maybe you'd want to encode it but but certainly it wouldn't be the
first place for people to look I wouldn't imagine now could you do that you know if there are
md5 sums being taken and so on no obviously not but it is is fascinating to see just how lazy
the pdf format really is and how bloated apparently it is because I I kid you not I've I've deleted
screenfills of information and reopen to the pdf with no apparent change in in display it's pretty
shocking so there you go that's that's pdf's for you hopefully I've given you some ways to reduce
their size to simplify them to make them a little bit more portable which is funny because I think
that's what it used to stand for portable maybe it was paperless all along I forget either way
that's pdf's that's ghost script it's pdf stapler hope it's helpful talk to you next time
you've been listening to hecka public radio at hecka public radio dot org we are a community podcast
network that releases shows every weekday Monday through Friday today's show like all our shows
was contributed by an hbr listener like yourself if you ever thought of recording a podcast
and click on our contributing to find out how easy it really is hecka public radio was found
by the digital dog pound and the infonomican computer club and it's part of the binary revolution
at bnw.com if you have comments on today's show please email the host directly leave a comment on
the website or record a follow up episode yourself unless otherwise status today's show is released
creative comments attribution share a light 3.0 license