244 lines
21 KiB
Plaintext
244 lines
21 KiB
Plaintext
|
|
Episode: 2708
|
||
|
|
Title: HPR2708: Ghostscript
|
||
|
|
Source: https://hub.hackerpublicradio.org/ccdn.php?filename=/eps/hpr2708/hpr2708.mp3
|
||
|
|
Transcribed: 2025-10-19 07:54:37
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
This is HBR episode 2007-18 titled Ghost Crypt and is part of the series Privacy and Security.
|
||
|
|
It is hosted by Klaatu and is about 23 minutes long and currently in a clean flag.
|
||
|
|
The summary is Klaatu talks about manipulating BDF with the F and BDF table.
|
||
|
|
This episode of HBR is brought to you by AnanasThost.com.
|
||
|
|
Get 15% discount on all shared hosting with the offer code HBR15.
|
||
|
|
That's HBR15.
|
||
|
|
Better web hosting that's honest and fair at AnanasThost.com.
|
||
|
|
Hello folks, Kay Wisher here to remind you that it's that time of year again.
|
||
|
|
Time for the Hacker Public Radio New Year's Eve Show.
|
||
|
|
For those who don't know, on New Year's Eve December 31, 2018, at 10am UTC,
|
||
|
|
that is 5am Eastern Standard Time.
|
||
|
|
We will have a recording going on the HBR Mumble Server for anyone to come on and say happy
|
||
|
|
New Year and talk about whatever they want.
|
||
|
|
We will leave the recording going until January 1, 2019, 12am UTC.
|
||
|
|
That will be 7am Eastern Standard Time or until the conversation stops.
|
||
|
|
Please visit hackerpublicradio.org to find all the details and links
|
||
|
|
about how to set up the PC Mumble client, your favorite mobile app,
|
||
|
|
the mobile server connection details.
|
||
|
|
Our Etherpad show notes and the live audio stream if you only prefer to listen in on the
|
||
|
|
lively banter. So please stop and say hi and maybe join in the conversation with other HBR
|
||
|
|
listeners and contributors. It's always a good time.
|
||
|
|
You're listening to Hacker Public Radio. My name is Clat 2.
|
||
|
|
In this episode, I want to talk a little bit about PDFs.
|
||
|
|
Specifically, how I manage to live with them.
|
||
|
|
And I've done an episode pretty sure with Lost and Bronx about why PDFs are some of
|
||
|
|
the most important pieces of code to ever come your way.
|
||
|
|
And I feel that way very strongly.
|
||
|
|
However, that doesn't change the fact that I deal with them all the time, whether I'm
|
||
|
|
purchasing them online under the guise of, oh, these are ebooks, which PDFs are not ebooks at all.
|
||
|
|
Or whether it's because I'm using them at work,
|
||
|
|
outputting to PDF at work. Whatever the case, I have to deal with PDFs a lot.
|
||
|
|
And I just kind of want to talk about some of the random observations and tricks that I've come
|
||
|
|
up with when having to do things with PDFs. So the first thing that I want to talk about,
|
||
|
|
and I've talked about this on my show, a new world order before, but I think it deserves
|
||
|
|
not really another mention, but some additional information. And that is the Ghost Script command,
|
||
|
|
or as it is typed, the GS command. So Ghost Script is the free and open-source version of
|
||
|
|
Post Script. Post Script being the syntax and code used to generate how a printer is going to
|
||
|
|
produce whatever it produces. So you might have dealt with Post Script directly as an EPS.
|
||
|
|
That's an encapsulated, I think, Post Script file. Post Script is the back end for PDFs,
|
||
|
|
and it is the back end for many printers. So the vectorized versions and the code that goes into
|
||
|
|
ensuring that what you print is the same thing as the stuff that you see in your PDF,
|
||
|
|
that's Post Script. And you can manipulate that a little bit. We'll look in a moment at just how
|
||
|
|
ugly PDFs are and how difficult it makes it to really do anything useful to it after it's been
|
||
|
|
generated. But there are a couple of quick hacks that you can do to help yourself manage some of
|
||
|
|
the PDFs in your life. So the first problem that I often have to solve, and this I've covered
|
||
|
|
on my show before, but not on Hacker Public Radio. So I might as well talk about it. So the first
|
||
|
|
thing is that a lot of PDFs are really, really large. And that is because PDFs are intended as
|
||
|
|
printer input, really. You send a PDF to a printer, and that produces that PDF as a physical
|
||
|
|
thing, as a physical document. That's what a PDF is, which means that a lot of times when people
|
||
|
|
create a PDF, they go, for instance, if I'm inscribed this, I'll go to export, save as PDF. And if I
|
||
|
|
go to color output intended for screen web, okay, that's one thing. Now I could go to printer.
|
||
|
|
The printer output typically, all the defaults, actually that didn't reset the defaults, but anyway,
|
||
|
|
I can set these defaults. So the resolution for graphics, let's say 300 DPI maximum image resolution,
|
||
|
|
300 DPI compression method, lossy or loss less. Yeah, let's go loss less. Compression quality,
|
||
|
|
let's go maximum quality. So the defaults get set very high for the typical output of a PDF.
|
||
|
|
The resulting file size is, indeed, for instance, the sample PDF that I did for my episode on
|
||
|
|
scribus is about nine megabytes for the printer version. And that's quite a hefty file size for
|
||
|
|
one page. It's a one-page document. It's nine dot one megabytes. Then the smaller version of that
|
||
|
|
is like 900 kilobytes less than a megabyte. And that's output for the web. So there are a
|
||
|
|
couple of different profiles that post script or ghost script at least. I don't know exactly what
|
||
|
|
the post script terminology is, but ghost script can accept a couple of different profiles for
|
||
|
|
its output. And you can manipulate that yourself for something that already exists. So for instance,
|
||
|
|
if I have this example file from my scribus episode, I can do GS for ghost script. And then dash
|
||
|
|
s, lowercase s, device, all capitals equals PDF right. So I'm just outputting back out to
|
||
|
|
to to the PDF writer. I'm not actually printing dash D. Compatibility level equals. I'm going to
|
||
|
|
set it really low because I like backward compatibility dash D. So that's 1.4. So dash D PDF
|
||
|
|
settings equals. And this is the profile. There are five different profiles that ghost scripts
|
||
|
|
can understand. One is a forward slash screen, which is intended for screen viewing only. So it's
|
||
|
|
72 DPI maximum images. So anything greater than that at downres. Slash ebook, that's forward slash
|
||
|
|
ebook is a low quality 150 DPI image. So that's not bad, but you wouldn't you probably wouldn't
|
||
|
|
want to print from it. I mean, you honestly probably could, but let's say you're, you know,
|
||
|
|
you wouldn't send it to a professional printer probably. Forward slash printer is high quality
|
||
|
|
300 DPI. Forward slash pre press is 300 DPI image with color color space being managed. And then
|
||
|
|
forward slash default is something else apparently super similar to screen. I'm not clear on
|
||
|
|
the difference there. So those are the different profiles that you can you can you can leverage. So
|
||
|
|
if we just go for forward slash screen for a nine megabyte file, that should have a pretty dramatic
|
||
|
|
result, which is what I'm looking for for the sake of of this proof of concept. So then I'm going
|
||
|
|
to do another option called dash D and then batch. And these options, I don't I've never seen it
|
||
|
|
typed any other way. So I'm assuming the options can have no space between the option and the
|
||
|
|
attribute or the the the argument. So dash D batch all one word with batch being all capitals.
|
||
|
|
And then dash s output file equals output dot pdf. And then I'm going to point it at this example
|
||
|
|
plus bleed dot pdf, which is in the current directory. The dash D batch makes sure that
|
||
|
|
ghost script does not go. It doesn't drop down into an interactive prompt, which it does by default
|
||
|
|
otherwise. So don't want to don't leave that out. And yes, so here's an output dot pdf at 142
|
||
|
|
kilobytes, which I mean down from nine megabytes is orders of magnitude literally. And the difference
|
||
|
|
is really only in the in the images. So the only the only optimization that it has available to
|
||
|
|
it is two down res images. That's really all all we can do. Well, there's there's something else. But
|
||
|
|
in this in this case, in what we're speaking about right now, it's just it's just the images.
|
||
|
|
And you know, the text is still text. So you can zoom in on that forever. And it will recalibrate
|
||
|
|
how it's aliasing the text. And it'll look great no matter what. You could print that and be
|
||
|
|
perfectly happy with it. It's just the graphics that got down res. Not a big deal really. So now the
|
||
|
|
other thing that I've done in the past to to shrink the size and complexity of PDFs. And that's
|
||
|
|
kind of that's a big one to be honest. Sometimes I can I can kind of handle a PDF on on several
|
||
|
|
devices, whether it's my little ebook, my eink ebook reader or whether it's a mobile phone or
|
||
|
|
something like it it's a pain because you still have to scroll around to try to read. And you know,
|
||
|
|
it doesn't really it doesn't really do that well. But but the the real problem for me is a lot of
|
||
|
|
times that it'll spend so much time trying to render these graphics on this slow device that it
|
||
|
|
slows down the the reading process to just being too annoying. So half the time my my issue is not
|
||
|
|
even necessarily the resolution of graphics. It is it is the presence of graphics. I just don't
|
||
|
|
need to suspend any cycles on generating the graphics half the time. That's not always true.
|
||
|
|
Sometimes the graphics are integral to what you're reading. So you need them there. But other times
|
||
|
|
you don't. And as it turns out, this is I guess a common enough problem because there is a filter
|
||
|
|
in ghost script to filter out images. And the filter is dash D and then filter image all
|
||
|
|
capitals filter image. Now that filters out very specifically raster images. So if you need to get
|
||
|
|
rid of vector images as well, there's a separate filter for that. I find in practice that I don't
|
||
|
|
really have to deal with the vector images very often. It's it's it's almost always raster images
|
||
|
|
that are in PDFs and they are huge. So adding the dash D filter image to the same command. So I
|
||
|
|
guess I'll read that out again. So that's ghost script or GS space space dash S device equals PDF
|
||
|
|
right. That's where we're going to dash D compatibility level equals one dot four. That's the
|
||
|
|
version of PDF readers that will be able to open this, which is I think as far back as you can go.
|
||
|
|
I've never seen anything earlier than that. I mean, never. I haven't seen recently in recent years
|
||
|
|
anything farther back than that dash D PDF settings equals forward slash screen. I'm just keeping
|
||
|
|
it small dash D, especially because we're not going to even have images in in here. Anyway,
|
||
|
|
it doesn't really matter. Dash D batch dash D filter image dash S output file equals output dot
|
||
|
|
PDF and then the example plus bleed dot PDF, which is the the big nine megabyte file that we're
|
||
|
|
going off of here. So you do that and it processes and dumps output dot PDF into the into the
|
||
|
|
current directory. Now I'm doing LS dash LH on output dot PDF and it is down from nine megabytes
|
||
|
|
to 40 kilobytes. That's a lot more reasonable. And if I open the thing up, then I see on my screen
|
||
|
|
a perfect representation of that PDF except there's just no graphic there. So we're not spending any
|
||
|
|
any file size is on on the graphics and we're not spending any CPU cycles trying to render those
|
||
|
|
graphics for no good reason. So that's a huge one for me. That's that's really saved me from being
|
||
|
|
able to you know not being able to read a PDF on some device to actually being able to read the
|
||
|
|
PDF on a device. It's made all the difference. Now the place where that's also made a difference
|
||
|
|
is when when printing like sometimes I'll have a PDF and I want to print something for reference
|
||
|
|
on actual paper. It does happen sometimes and a lot of times they'll have background images you know
|
||
|
|
for for whatever reason like the for style really. I mean it's a background image to evoke
|
||
|
|
some kind of mood or just to look cool and then some other images here and there and maybe
|
||
|
|
the images I could usually stand but I mean to print 50 pages of background floral prints over my
|
||
|
|
text or behind the text ostensibly it just doesn't make any sense. So if you do this command the
|
||
|
|
go script command and filter out all those images that gets rid of those background images. I mean
|
||
|
|
it gets rid of the foreground ones too which that's a little bit annoying but but really the the
|
||
|
|
background images for me are the ones that really matter but I mean I don't even mind printing without
|
||
|
|
the the foreground images usually. I usually don't want the foreground images or if I do it's just
|
||
|
|
a couple of them and those I could like screenshot and print separately or or maybe not print it all
|
||
|
|
and just have them on a screen as a single file and that sort of thing. So go script filter image
|
||
|
|
really really useful if you like me need to sometimes print a PDF and don't want to spend all
|
||
|
|
of your ink on fanciful background images or if PDFs are simply too large for you. Now in the past
|
||
|
|
in a past episode I've talked about bookmarks retaining and editing and applying bookmarks to a
|
||
|
|
PDF file. I've also done an episode on PDFTK which is the program that I generally use to chop
|
||
|
|
app chop up PDFs when I need to just extract you know a page from a PDF just here or there for
|
||
|
|
whatever reason or maybe I need to extract a couple of pages and then merge them back together
|
||
|
|
you know so basically taking a subset of a of a larger PDF and I I realized that I probably
|
||
|
|
should mention a separate or a related program because I don't think I mentioned it may have
|
||
|
|
but it's called PDF stapler and PDF stapler is an application that sort of takes the place of
|
||
|
|
PDFTK not exactly it doesn't have one-to-one parity of features it doesn't quite have everything
|
||
|
|
that PDFTK does but it's got it's got that magical you know 80 or 90% of stuff and what it doesn't
|
||
|
|
do all that well is the bookmarking stuff actually that's PDFTK really but PDF stapler and I have
|
||
|
|
seen it generally called PDF dash stapler PDF dash stapler is a I think it's Python based as far
|
||
|
|
as I remember and its syntax is similar it's not the same it's actually just similar enough to confuse
|
||
|
|
me half the time but it's it's kind of it's kind of similar to PDFTK so for instance if you're
|
||
|
|
going to cat a bunch of files into one big PDF and a common I think for me a common use case
|
||
|
|
for this at one point I used to have to do this a lot I would take a collection of images and then
|
||
|
|
convert them to PDF and then concatenate them into a into a big PDF that was a fairly typical thing
|
||
|
|
to do for for some artists they would need you know they would want their things in in a PDF but
|
||
|
|
they couldn't figure out the easy and quick way to get you know 100 photos or whatever into one
|
||
|
|
one file and that was very frequently doing a convert command on all you know PNGs or whatever
|
||
|
|
in the current directory make them in resize them and put them you know output them is like jpegs
|
||
|
|
and then run some some command to then concatenate all those things into a big PDF so for instance if
|
||
|
|
I was doing that on with pdf stapler it would be pdf dash stapler space cat for the that's the
|
||
|
|
command and then space and then I guess I would just do a wildcard dot pdf or yeah because I would
|
||
|
|
have done a convert on all those jpegs to pdf and then I would have done wildcard dot pdf and then
|
||
|
|
space and I don't know output dot pdf and and and it puts all of the files that you pointed at into
|
||
|
|
one big pdf that will open and people can flip through so it's a cat or cell for some reason I'm
|
||
|
|
not really sure why they they do that I'm not sure if there's a difference but there's their cat
|
||
|
|
to concatenate pages there's also something called cell s e l for select the given page range
|
||
|
|
and again I'm not 100% sure if if they mean for that to if there's going to be some other function
|
||
|
|
for that or if it's if it's just the same thing I'm not sure but it as far as I can tell it's the
|
||
|
|
same thing but anyway there's also Dell for delete the EL you can delete a page or a range of
|
||
|
|
of pages there's burst or split which is creating one file per page for an input pdf which is
|
||
|
|
something that I've I think people probably would need to do I've I've definitely heard people
|
||
|
|
needing to do that I personally I can't imagine having to do that no I can for a printer spread
|
||
|
|
totally I can I can see doing that and then there's also zip which is merge or collate the given
|
||
|
|
input files interleaved so it's you know odds and evens that sort of thing there's also info
|
||
|
|
which displays pdf metadata but there's nothing as far as I know as far as I've been able to
|
||
|
|
find in the command there's nothing to reapply that image that the metadata to a pdf so if you
|
||
|
|
you can you can get the data from something but whether you can reapply it to your new pdf or to
|
||
|
|
to to another pdf for some reason as far as I can tell there is no way in pdf stapler for that to
|
||
|
|
happen the site that you can download that at is github.com slash hellerbard slash stapler and I
|
||
|
|
will put a link to that in the show notes H-E-L-L-E-R-B-I-R-D-E is the username and it's just called
|
||
|
|
stapler there I don't know if I'm using an older version or if if the command simply has
|
||
|
|
remained pdf dash stapler I'm not really I don't really remember where I got this thing it's just
|
||
|
|
one of those things that I have on my work computer and have been using as is with with great success
|
||
|
|
so that's that's another tool that I use it's really interesting if you if you look at pdf files
|
||
|
|
it's kind of shocking like if I do it you can look in pdf it's kind of interesting if you go to
|
||
|
|
emax space and then output dot pdf I'm just doing output dot pdf because that's what I just did
|
||
|
|
with my go script thing that removed the images then I hit return now in emax it it actually renders
|
||
|
|
the pdf for me which I don't actually want in this at this particular moment so we're going to
|
||
|
|
hit control c control c and that gets me to the source view if you will and you can see what goes
|
||
|
|
into making a pdf a pdf and it is horrible to look at it really is it's honestly just dismal
|
||
|
|
there's there you really can't make heads or tails of it but what's funny is that you kind of get
|
||
|
|
this cadence and there's this there's this line here called stream str em and that appears to
|
||
|
|
it seems to begin a block of binary data that you cannot you know it's not it's nothing that you
|
||
|
|
can actually read and then at the end of all that there's an end stream tag I guess you could
|
||
|
|
call it or declaration and then an end object and then a declaration of the object number which
|
||
|
|
I don't know where the object numbers come from I don't know what's generating those it's it's
|
||
|
|
really not very it's pretty mysterious to look at but what's really funny is if you go into these
|
||
|
|
streams and just start deleting things it's kind of entertaining to see exactly how little
|
||
|
|
effect you have on the pdf output like I just deleted a bunch of stuff from a stream and it took
|
||
|
|
away the v in the word gave and the m in the word fanaticism in the in the pdf that I generated
|
||
|
|
and that's all it did and it was like this huge chunk code that I just got rid of and you can do
|
||
|
|
that and and the pdf still opens it's it's really really kind of kind of frightening in a way
|
||
|
|
because you think what what what could someone just put into a pdf file and post online for
|
||
|
|
people to download because apparently the pdf would just open and you have no idea you know
|
||
|
|
really what's in the pdf it's really really strange I've never seen I don't think I've ever
|
||
|
|
quite seen now there I have broken it enough at one point where it wouldn't open but it doesn't
|
||
|
|
it's not something that's it isn't really something that you find you know you there's a lot of
|
||
|
|
flex it's not very strict is what I'm trying to say you can you can delete all kinds of things
|
||
|
|
sometimes there will be no apparent no visible change whatsoever other times there'll be a little
|
||
|
|
and just little quirks you know like maybe a font will will disappear so you're just left with
|
||
|
|
a normal font instead of something that was supposed to be italicized or whatever so yeah it just
|
||
|
|
kind of depends on on what you're deleting but it is quite interesting to have a look behind the
|
||
|
|
scenes and you like I say you can do that in emax when you open emax it'll render the pdf for you
|
||
|
|
so just hit control c control c to get to the to the the text view and you can kind of poke around
|
||
|
|
and see what's what's in a pdf and and yeah you should it's it's surprising what you can just
|
||
|
|
put into pdf's really is it's very very shocking and it kind of makes me think that maybe maybe
|
||
|
|
a file format with a little bit more sort of more transparency and also a stricter kind of stricter
|
||
|
|
syntax checking would be a good idea because these pdf's as far as I can tell you could just put
|
||
|
|
whatever you wanted into them and then send them around and no one would ever really know I mean
|
||
|
|
I guess it would depend I mean maybe you'd have to put for instance a gpg encoded something or
|
||
|
|
another in in there you know maybe you'd want to encode it but but certainly it wouldn't be the
|
||
|
|
first place for people to look I wouldn't imagine now could you do that you know if there are
|
||
|
|
md5 sums being taken and so on no obviously not but it is is fascinating to see just how lazy
|
||
|
|
the pdf format really is and how bloated apparently it is because I I kid you not I've I've deleted
|
||
|
|
screenfills of information and reopen to the pdf with no apparent change in in display it's pretty
|
||
|
|
shocking so there you go that's that's pdf's for you hopefully I've given you some ways to reduce
|
||
|
|
their size to simplify them to make them a little bit more portable which is funny because I think
|
||
|
|
that's what it used to stand for portable maybe it was paperless all along I forget either way
|
||
|
|
that's pdf's that's ghost script it's pdf stapler hope it's helpful talk to you next time
|
||
|
|
you've been listening to hecka public radio at hecka public radio dot org we are a community podcast
|
||
|
|
network that releases shows every weekday Monday through Friday today's show like all our shows
|
||
|
|
was contributed by an hbr listener like yourself if you ever thought of recording a podcast
|
||
|
|
and click on our contributing to find out how easy it really is hecka public radio was found
|
||
|
|
by the digital dog pound and the infonomican computer club and it's part of the binary revolution
|
||
|
|
at bnw.com if you have comments on today's show please email the host directly leave a comment on
|
||
|
|
the website or record a follow up episode yourself unless otherwise status today's show is released
|
||
|
|
creative comments attribution share a light 3.0 license
|