Episode: 2767
Title: HPR2767: Djvu and other paperless document formats
Source: https://hub.hackerpublicradio.org/ccdn.php?filename=/eps/hpr2767/hpr2767.mp3
Transcribed: 2025-10-19 16:34:38

---

This is HPR episode 2007-167 entitled BKVU and other paperless document formats.
It is posted by Klaatu and is about 32 minutes long and carrying a clean flag.
The summary is a tutorial on how to read and generate BKVU files.
This episode of HPR is brought to you by archive.org.
Support universal access to all knowledge by heading over to archive.org forward slash donate.
Hey everybody, you're listening to Agra Public Radio. This is Klaatu and this is an episode
about Deja Vu.
Hey everybody, you're listening to Agra Public Radio. This is Klaatu and this is an episode
about an interesting file format called Deja Vu.
With a brief mention of CBZ as well. As it happens, I was looking for a file format that would
allow me to take a series of, for instance, scanned images and dump them into a single file.
So I wanted to bundle these things up. I didn't want them just to be in a directory.
I wanted them in a single file that was sort of user-facing, if you will. In other words,
I didn't want to just put them into a document and then send the person the document full of images.
I wanted the document to be a collection of images.
The first thing that came to mind was the CBZ format, which is pretty popular among comic books.
I think CBZ actually stands for comic book archive. It's a great format and it's super easy to make.
So for instance, if you've never made one, this will be an easy demo.
So if you go into a folder with a bunch of images in it, which I happen to have,
because that's a project that I've been working on lately, and you do a zip,
and then you create, let's say, mybook.zip.
And I'm doing a .zip because the zip terminal command will complain if you're not going out to zip.
It doesn't understand that you might want to go to a different file extension.
So we'll rename that later and then we'll do .r and then we'll just do,
or actually I don't even think we have to do a .r, we'll just do asterisk.jpeg,
or whatever file format, whatever image format you're putting into this thing.
Hit return, let it bring all of those files into the zip archive, and once it is finished,
do a move, mvspace, mybook.zip, space, mybook.cbz.
And now you have a comic book archive that you can open with a comic book reading application,
or sometimes just any random document viewer on Linux.
Ocular, for example, opens up comic book archives quite easily in KDE.
So I thought of that at first, and then I made a couple of archives, and I realized that the archives
were almost always as large as the sum of their parts, which I know sounds obvious.
But I was, I think, I guess I was looking for something with a little bit more compression.
And to be fair, I could manually compress the files myself, and I did that a couple of times.
And that worked a little bit better for me.
But I was still missing certain features, specifically with regards to metadata.
So for instance, if you have, say, an ePub, you can have a table of contents.
You can make annotations in some, in certain clients, you can do annotations in the ePub, and so on.
Whereas the comic book archives, they tend not to really specialize in that.
They really are just a very convenient way of looking at a zip file in a way.
So I was looking for something with a little bit more features towards finding stuff
within a potentially large document with a lot of text in it.
But not necessarily searchable text, because my use case has been so far, at least,
to simply scan pages of whatever it might be.
It might be an old unix manual from AT&T, or it might be a comic book,
or it might be a historical document, I guess.
So a book published at the turn of the century, the previous century,
that's getting pretty old, and probably could use some preservation, that sort of thing.
I'm not going to sit there and transcribe all that text.
I'm not going to have the ASCII text of the content, but I do have the pages,
and I might want to refer to it.
So I might want to find chapter 5, for instance.
And I might not want to scroll through a bunch of thumbnails,
or just flip through a bunch of pages until I find it chapter 5.
Seems like since it's on a computer, I should have that kind of data available to me more easily,
than actually manually going through it and looking for it.
So I thought, well, why not use an EPUB?
That seems like a good, this seems like a great idea.
It's not a bloated format, it's quite nice.
It's a good format, I like it, I've never had a problem with it.
It is a little bit weird, though, to take a bunch of images,
and put it into an EPUB, and then get all the overhead of the EPUB.
When the EPUB itself, I think, generally, expects to contain text.
That's not to say that I couldn't abuse it, and just put images into it.
I'm sure it's probably been done, but I felt like that wasn't the best,
potentially the best use of the EPUB format,
and I didn't feel, even more importantly, that the clients,
the EPUB applications that I'm using to view the resulting documents,
will really expect me to have a bunch of images in it.
And this isn't, like, I don't want to offend someone, you know, it's not,
I don't care what the clients, the applications expect,
I don't care what EPUB is intended for.
I'm just saying, if I want to zoom in quickly on an image,
because I want to inspect some detail of some art,
then I don't want to have to work around the fact that this EPUB
really thought all I was ever going to ask for was a larger font,
and so I have to sort of hack around it to zoom in on an image easily and conveniently.
Well, the answer that I eventually fell upon was deja vu.
deja vu is a digital document format designed with compression included
and the ability to contain metadata.
In other words, it is exactly what I was looking for,
so I sat down to figure out how I could leverage deja vu
in my everyday computing life, and this is what I came up with.
So first of all, I will warn you that deja vu is highly dependent upon your use case.
It's a pretty flexible format in a sense,
because depending on what you need to put into it
will determine how you generate them.
But before we get into generating a deja vu file,
let's talk a little bit about reading a deja vu file
to make sure that it's a format that's well enough supported for us
to resort to on a daily basis.
Turns out it's actually quite well supported,
so if you're on a Linux desktop and why wouldn't you be,
then you have a deja vu reader probably installed already.
Certainly, the GNOME desktop ships with, I think, events.
Is there default document reader?
And it reads deja vu files.
KDE ships with Oculus, which I've already said reads CBZ files,
comic book files.
It also reads deja vu files.
If you don't have either of those available to you for some reason,
then you can go get DJ View, that is DJ and VIEW.
It's part of a package called deja libre,
which is the tool set for all the deja vu stuff
that you'll be doing if you start using that as a format.
Should be in your repository, and if not, it's on source forage.
DJ View should probably also be in your repository,
maybe not, again, it's on source forage.
It is cross-platform, so if you're not on Linux,
then this might be perfect for you.
It's easy to compile.
It does require the cute framework.
Either cute for or cute five.
I compiled it on Slackware with cute five with no problems.
And it works quite well.
Now, if none of those things are available to you for whatever reason,
maybe you're on a computer that just doesn't permit
that level of application management for you for whatever reason,
then there are options for your web browser as well.
There is, well, there are online sites
where you can upload documents and look at them.
There's a JavaScript library called deja vu.js
that you can check out, deja vu.js.org.
And then, finally, there's a Firefox browser plugin
or rather, yeah, plugin and add-on extension,
whatever they call them these days,
called deja vu.js, which is a local copy of deja vu.js
so that it runs as a plugin in your browser.
You just, you click on the icon.
It presents you with an empty tab.
You drag your deja vu file onto the tab.
Then it opens it up in your browser.
It doesn't upload it to the internet or anything.
It's local. It's just using your browser as the engine.
So that's pretty easy as well.
On mobile, there are document viewers for your mobile phone as well.
I don't really know anything about the iPhone platform
so I can't really even, I can't begin to guess
what might be available for it.
But certainly on Android, from FDroid,
even, you can get an application called document viewer,
which is a viewer from many document formats.
And it supports deja vu, it supports ePub,
it supports comic book, the CBZ,
fiction book, FB2, and a couple of others.
In other words, there are lots of options
for reading deja vu files.
And no matter what kind of device you're on,
the chances are really high
that there is a deja vu viewer for you.
You should go get a deja vu file
and test some of these things out to see if you like it.
I think you probably will.
If you need a good demo deja vu file,
you can go to deja vu.org,
go to the Downloads and Resources section.
And at the bottom of that page,
they have some white papers and tech documents.
And any of those you can download
and look at, they're all in the deja vu format.
I thought it was kind of cool of the project actually
to put their reference document
and their specification document in deja vu.
So you have to have deja vu
in order to read about deja vu, it's quite slick.
So there you go, that's the consumption side of things.
It's pretty easy, it's a lot more available
than you might first have thought.
If you're not really aware of deja vu,
these probably kind of just passed you by unnoticed.
But they are there, they're there,
and they work quite well,
and they're highly compatible
with lots of different platforms and devices.
So do check them out.
So now let's talk about how to create a deja vu file,
because certainly if you're going to use this,
and certainly the reason I'm using deja vu
is because you want to put documents into that format.
Now I'll admit, this can be a little bit tricky
in some ways, and by that I mean
that the process isn't actually difficult,
but there are certain conveniences
that just don't exist.
For instance, if you're trying to quickly export something
from, I don't know, Google Docs or something,
you're not gonna go up to the export menu
and find a deja vu format,
at least I don't think you are, I don't know,
but I'm assuming you're not gonna go to deja vu
or Google Docs and find an export format of deja vu.
You'll probably find other formats like ODT,
I know that's in there, PDF, that's definitely in there,
and maybe some other stuff,
but deja vu isn't gonna just,
you're not going to just inherit the capability
to export as a deja vu, generally speaking.
It's gonna depend on the application obviously,
but I'm just saying, in the real world out there,
obviously a lot of us had never even heard of deja vu
before this episode, so it's kind of self-evident
that it's not just, it's not gonna fall into your lap.
You will have to decide, I'm going to be a deja vu user,
and then you have to go get the tools
to generate deja vu,
and then you may have to work around some workflow
that you have already established
to create deja vu files.
Luckily, I have some answers for that,
but it's still, it's gonna be a little bit different, right?
There's not gonna be very rarely
are you going to find a file menu
where you can go to file, print, print to deja vu.
That just doesn't exist,
whereas if you go to file, print to PDF, that exists.
The difference is, of course,
that PDF is a horrible format,
and deja vu is actually quite nice.
Let's look at it, shall we?
So the deja vu toolset, as I've said,
is deja vu libre is the,
that's the open source implementation
of the deja vu spec.
deja vu libre is, as its name suggests,
free and open source software.
So you can grab it and use it,
and it is completely open.
You can learn all about everything that you need to know
about deja vu from both deja vu.org and deja vu libre.
There's some really good documentation
in the deja vu libre source package,
or maybe it's the deja vu source package.
One of those two, it has some good documentation
that kind of, it gives you an overview
of all the different commands
that come with deja vu libre.
And I have to say the commands, there are many.
And that is, again, because the way that you want
to build a deja vu file will control
or will dictate rather how you do that,
what tools you use for the job.
I'm not gonna go through all of them.
I will go through some of the major ones.
First of all, we need a series of documents
that we want to convert to deja vu.
Now deja vu is interesting.
Now remember, I'm saying that it's a document format
into which you can put lots of images, for instance.
And then you'll have this file that seems like a book,
and so it'll be like a paperless book.
And that's great, but that's only one use of deja vu.
deja vu itself is perfectly happy to be a single file,
like a single thing, a single entity.
So for instance, if I have any random photo
from a phone or something, then I can convert that.
I'm gonna go over to my pictures, graphics,
whatever it's called, graphic folder here.
And yeah, here's a TIFF.
So I'm gonna go, I'm gonna, so a TIFF is a file,
it's a pretty high quality, or potentially high quality,
graphic file, and it is not necessarily,
but generally speaking, it's like a color document,
probably fairly high detail.
So for that, we would want to use
the sort of the high end converter for deja vu,
which is called C44.
If I type in C44-H of space-H for help,
then I get a little bit of a blur about it.
It says it's image compression utility
using IW44 wavelets.
Now I don't know what that means.
There are a couple of different options here.
The only ones I care about is the dash DPI,
because it sets the image resolution.
So I'm gonna do C44, and C44 again,
is included with the deja vu libre package
that you presumably downloaded and installed,
or got from your repository on Slackware,
it's already installed.
So C44, and then it says in the help,
it says to do options, okay, so that's dash DPI,
and I'm gonna keep this at, let's say 300 DPI,
and then it says to give a PNM or JPEG file.
Okay, so it only accepts PNM or JPEG.
This is a quirk about the toolset
that I never really understood or got used to,
but apparently for higher quality documents,
the PNM or JPEG formats are supported,
but for lower quality documents,
the formats traditionally associated
with high quality graphics are also supported, so TIFF.
So for this, I cannot use a TIFF file,
so I'm gonna zero in on a different,
well actually I don't have a different one,
so I'm just gonna convert this thing to a JPEG,
I'll convert it actually to a PNM,
so I'm gonna do a convert, that's an image magic command.
If you don't have image magic or graphics magic installed,
just install that, and then you can do convert,
or GM convert if you got graphics magic.
So convert in MK1, that's the name of this image,
I should probably look at this image
to see what on earth it is.
Okay, so it is a, actually it is a really basic black
and white logo, that's funny.
Okay, so I'm gonna actually convert a different one,
this penguin picture that I have.
So I'm gonna do a convert penguin dot PNG
to penguin dot, what am I saying, oh PNM, right?
Okay, and that happened, that's done, that was very quick.
And now I'm gonna use the C44 tool,
C44-DPI 300, and I'm gonna feed it,
the penguin dot PNM file, and then it tells me
to define a deja vu file into which this,
the conversion should be placed,
so I'll just do penguin dot, DJVU is the file extension
by default, and that's finished, that's done.
So now I'll open up a graphical file manager here,
just for testing to see what happens
when I click on things, and it looks like I've got,
actually I'm gonna look at file sizes as well,
so the source of this penguin was 43.8 kilobytes.
When I converted it up to the PNM format,
I got a 1.8 megabyte file,
so that's obviously going sort of in the wrong direction,
right, if compression is one of the things
that we care about going from 43 kilobytes to 1.8 megabyte,
not a good thing, but wait a minute,
the deja vu version, looking at that,
is 18.1 kilobytes, that's quite a lot smaller
than the original PNG, 43.8, so I'll click on that
and try and take a look at it, it looks good,
looks really nice, no problems really,
no complaints about this, except a couple of different things,
and that is that the background is black,
and that's because I brought it in from a PNG,
so I'm gonna go back up to my convert command,
and rather than doing the convert from PNG to PNM,
I'm gonna, or not rather, but in addition,
I'm gonna add background, quote white, and then flatten.
So that will take any kind of alpha channel
that I inherit from the PNG, it will cause it to be,
it will cause the background behind that alpha channel,
if you think of it that way, it to be white,
and then dash flatten, flatten the image,
so that there is no alpha channel, so now we'll do that,
and then we'll do the c44-dpi300 penguin PNM,
actually you know what I'm gonna even do,
I'm gonna drop the dpi300 and let the c44 thing
go with its defaults.
Okay, so now we have a deja vu file,
which when I open is, yeah, it has white in the background,
so that's good, that's a little bit better.
So that works, now of course that's only a single file,
that's one single image, and it's not really a document,
it's just an image, but we can create digital books,
we can create sort of e-books out of a deja vu
by combining several deja vues into one bigger deja vu.
Now for that, we'll need another deja vu file,
and I do have this tiff, and it is a black and white logo
completely coincidentally.
It turns out that if you have a low quality image,
or I should say a simple image,
it doesn't have to be a low quality,
but it has to be, it is expected to be simple,
and in fact it is expected to be by,
what do they say, bytonal I think,
and the tools that deja vu Libre provides for that
is CJB2, I realize that neither of these commands,
C44 or CJB2 make any sense,
or have anything apparently to do with deja vu,
and that doesn't annoy me,
but it's just one of those things
that you kind of remember after a while,
or you put it into a script,
and you never have to remember it yourself.
So CJB2-H gives me a couple of options,
and again, I can specify my DPI.
It says it defaults to 300.
That seems reasonable to me.
There's some cleaning up that you can do.
A dash clean apparently cleans up the image
by removing small fly specs.
You can make it lossy, and you can set the loss level.
I'm not gonna do anything of that fancy.
I'm just gonna give it an input file,
and it says that the input that it accepts
is either a PBM or a TIFF.
Okay, so CJB2 in MK1.TIFF,
and then I'll do in MK1.DJZU,
and that converted it pretty quickly as well.
So once again, the TIFF, the source TIFF was 182 kilobytes.
The deja vu version of that is 3.8,
so quite the difference.
I'll open it up here in Ocula, it looks fine,
it looks like a very accurate representation
of the simple graphic, and that's good.
Okay, so now if we wanna make a deja vu file
that contains both of these images as a page one and a page two,
we can do that with the command DJVM,
and I'll just do a dash H again,
and it spells it out for me.
So it says to compose a multi-page document,
you can do DJVM dash C for create,
and then the file, the destination file.
So I'm just gonna call this output.DJVU,
and then finally you end with all of the pages
that you want to put into this document.
So alphabetically, it looks like in MK should come first,
so I'm gonna just, I'm gonna do something crazy here,
and set Penguin first, and then in MK1, deja vu.
So I've got output.dajavu is my target,
and then penguin.dajavu, and then in MK1.dajavu.
Return, and it produces an output file for me,
and just because it's always fascinating to look at file sizes,
it does look like this is about 25.6 kilobytes.
So again, just keeping track of these things,
I've got this penguin that was 43 kilobytes,
and I've got this logo that was 182 kilobytes
in one document at 25 kilobytes.
So literally both of them combined in a deja vu file
is smaller than either of them separate, pretty cool.
And it does look as if though the images are in the order
that I defined, so the penguin comes first,
and then the in MK1 comes second.
Now if I had just done like a wild card,
deja vu files in the directory
would have just done in MK1, and then penguin
because that's alphabetical.
But I did wanna demonstrate that you could set that,
you can manually set the order of the pages in your command.
Okay, so now we've got this document,
and it's the self-contained document,
you can open it up in ocular, or in DJ view,
or in deja vu.js, and read it, and look at it,
and it's great, but how can you find stuff in that document?
Well, it turns out that doing metadata is pretty easy,
and we can create a bookmarks file for this.
I'm just gonna make one called book.marks in the same folder.
It's just a text file, and you open it up with a parentheses,
with an opening parentheses, or a bracket, whatever you call it.
It's a circle, half circle, so that thing.
And then book marks, that's the word bookmarks.
Next line, I'm gonna do another parentheses,
and I have not closed the parentheses.
So we've got an open parentheses, bookmarks,
and then next line, open another parentheses,
and I'm gonna put in, let's do the word penguin.
Quote penguin, closed quote space, quote hash one,
closed quote, closed parentheses once.
Okay, next line, open parentheses, quote,
what was the other one, oh yeah, a logo,
closed quote space, quote hash two,
closed quote, closed parentheses,
and then finally closing the main parentheses,
the big parentheses.
So it's bookmarks, penguin logo,
or it's bookmarks, and then penguin,
and then the page number, or the deja vu page number,
and then the next line, the next thing that you want to locate,
and then the deja vu page number,
and then you close out the whole parentheses.
The parentheses delineate the level of everything.
So bookmarks is your main, that's the main entity, right?
So you don't close that bookmark,
you don't close that parentheses
until the very end of your bookmarks.
That makes sense.
Then with each line itself is a new entry,
and it needs to have a human readable title,
and then the reference to the deja vu page number.
Now if you don't know what that is off hand,
you can just open up the deja vu file in a viewer and look,
because you're doing this separate.
You're doing this in a text file.
So you look at that and you say, oh, the penguin is on page one.
OK, so quote hash one, closed quote, closed parentheses.
Now if this was a more complex document,
and we wanted sub headings, then we wouldn't,
then just don't close penguin, and have like logo page two,
and then close the parentheses, and then close the bookmark.
So if you leave a parentheses open,
then everything below it, or everything within that,
becomes sub headings, which is handy,
because if you have a chapter and then a section,
and then maybe a subsection, and then you close, close, close,
and then you go back to a chapter level.
So that's your level setting.
OK, once you have your bookmarks defined in a text file,
you use a command called deja used,
or maybe it's, or it knows, deja of used.
Maybe it's deja vu said, I don't know.
It's DJVUSED, and then dash e, and then quote set dash outline,
and then the name of the bookmarks file.
So that's book.marks, close quote, and then dash s for save.
If I don't have a dash s, it's a dry run.
It'll apply the outline.
It'll sort of validate the outline, really,
is what it's doing.
And then it will not save what is just done,
and your output dot deja vu will not have bookmarks.
So dash s means save.
So dash s output dot deja vu, because that's
the name of the file that I gave it, output dot deja vu.
All right, so now let's take a look.
It doesn't take any time to apply an outline.
That's really a fast one.
There we go.
And now, yeah, it looks like in Ocula,
I've got a table of contents on the left with penguin
and a logo, logo being a child of penguin.
I left the parentheses open so I can click on the little
disclosure symbol there and get to see logo.
And it's got the page number that it corresponds to
over on the right.
Now, that's just been a very basic example.
In real life, I have found that the file size savings
amount has varied pretty wildly.
It really depends on where your images are coming from,
what you're converting from, and how much you're
willing to compress them in the deja vu document.
I would say, typically, I see about maybe a 20% savings.
That's what I would guess.
It's just a little bit shaved off the top.
It starts to add up the more you do it.
But I wouldn't expect, for instance, to take a PDF that
you downloaded from somewhere, and then you convert it
to deja vu, I wouldn't expect it to be 50% smaller,
or 60% smaller, anything like that.
It would more likely be either the same size or 10% or 20%.
The benefit, possibly, for you, is that deja vu is a sane
and open format.
It's fun to manipulate, and it's easy to find information on
because all of its specs and information are open and online.
There aren't really any hidden glitches disguised as features
or features disguised as glitches in deja vu.
Not at least in the way that there are in PDFs, which
sometimes just are so confusing that even when you figure
something out, you can't really figure out
if it should even be opening.
Like, should it still be working?
Shouldn't I have just broken this?
So yeah, I'm really enjoying deja vu.
Now, you can also embed text, and I've never done that yet.
I have not had the occasion to, I've not converted a document,
for instance, to which I have embedded text,
like the way that PDFs have.
Those are not documents that I have bothered converting.
Or if I have, it's been for quick reference on the go.
And it's not one of those things where I'm thinking,
oh, I need to select this text.
I need to search for this exact string in the document.
Obviously, for that sort of thing, I would
want to have the text there, but so far for the way
that I'm using it, the text just isn't available
and the things that I'm converting to deja vu.
If I wanted text to be embedded in the deja vu,
I would have to transcribe it, looking at the screen,
typing it all out, and that would be silly.
So I've not gone in that direction yet.
I'm not saying I never will.
I may well do that.
And maybe I'll look into easy and quick ways
to take a PDF with embedded text and convert it
to deja vu, while retaining the embedded text.
Who knows, maybe, I've done crazier things.
So if I ever do that, I'm sure you will hear more
about it on hacker public radio.
But until such a time, this has been an introduction
to deja vu.
Hopefully it's been informative to you.
Maybe it's even useful.
I suggest you try it out.
If you've never used deja vu, give it a go.
Make a document or get one from online.
See what it's like.
It's actually quite nice.
I think you'll probably like it.
And if you intend to use it seriously,
then sit down and kind of think about your workflow too.
Because I know that for some people,
PDF is a very easy, well, for most everyone.
PDF is a very easy output target.
Because as I said earlier, it's probably
in your file menu.
It's like two clicks away.
So if that stands in your way of using deja vu,
then sit down and kind of think of what ways
you might be able to pull a couple of commands together
to make that process easier.
And frankly, I don't know that it is
a great format.
It might not be your first format for a paper
that you've just written in LibreOffice.
Would it make sense to go out to deja vu?
I mean, arguably it would, but arguably not at all.
Because you're really, you would just
be basically generating a raster file
of a representation of text and then embedding text.
And that seems really odd.
So there are probably better formats
if you're just typing stuff up and you
want something on the go.
Maybe EPUB is the best answer for you.
But if you're doing archival work or your scanning documents,
I mean, archival makes it sound fancy.
If you're scanning stuff in, because you like them,
but you want to throw the physical copy of it out.
Or maybe you like it and you see that the physical copy
is decaying, so you want to preserve it,
scan it in, throw it into deja vu file,
and see how it treats you.
Thank you for listening.
I will talk to you next time.
You've been listening to HECCA Public Radio at HECCA Public Radio.org.
We are a community podcast network that
releases shows every weekday Monday through Friday.
Today's show, like all our shows, was contributed
by an HBR listener like yourself.
If you ever thought of recording a podcast,
then click on our contributing to find out
how easy it really is.
HECCA Public Radio was founded by the Digital Dog
Pound and the Infonomicon Computer Club.
And it's part of the binary revolution at binwreff.com.
If you have comments on today's show, please email the host directly.
Leave a comment on the website or record
a follow-up episode yourself.
Unless otherwise status, today's show is released
under Creative Commons, Attribution, ShareLife, 3.0 license.