Episode: 2767 Title: HPR2767: Djvu and other paperless document formats Source: https://hub.hackerpublicradio.org/ccdn.php?filename=/eps/hpr2767/hpr2767.mp3 Transcribed: 2025-10-19 16:34:38 --- This is HPR episode 2007-167 entitled BKVU and other paperless document formats. It is posted by Klaatu and is about 32 minutes long and carrying a clean flag. The summary is a tutorial on how to read and generate BKVU files. This episode of HPR is brought to you by archive.org. Support universal access to all knowledge by heading over to archive.org forward slash donate. Hey everybody, you're listening to Agra Public Radio. This is Klaatu and this is an episode about Deja Vu. Hey everybody, you're listening to Agra Public Radio. This is Klaatu and this is an episode about an interesting file format called Deja Vu. With a brief mention of CBZ as well. As it happens, I was looking for a file format that would allow me to take a series of, for instance, scanned images and dump them into a single file. So I wanted to bundle these things up. I didn't want them just to be in a directory. I wanted them in a single file that was sort of user-facing, if you will. In other words, I didn't want to just put them into a document and then send the person the document full of images. I wanted the document to be a collection of images. The first thing that came to mind was the CBZ format, which is pretty popular among comic books. I think CBZ actually stands for comic book archive. It's a great format and it's super easy to make. So for instance, if you've never made one, this will be an easy demo. So if you go into a folder with a bunch of images in it, which I happen to have, because that's a project that I've been working on lately, and you do a zip, and then you create, let's say, mybook.zip. And I'm doing a .zip because the zip terminal command will complain if you're not going out to zip. It doesn't understand that you might want to go to a different file extension. So we'll rename that later and then we'll do .r and then we'll just do, or actually I don't even think we have to do a .r, we'll just do asterisk.jpeg, or whatever file format, whatever image format you're putting into this thing. Hit return, let it bring all of those files into the zip archive, and once it is finished, do a move, mvspace, mybook.zip, space, mybook.cbz. And now you have a comic book archive that you can open with a comic book reading application, or sometimes just any random document viewer on Linux. Ocular, for example, opens up comic book archives quite easily in KDE. So I thought of that at first, and then I made a couple of archives, and I realized that the archives were almost always as large as the sum of their parts, which I know sounds obvious. But I was, I think, I guess I was looking for something with a little bit more compression. And to be fair, I could manually compress the files myself, and I did that a couple of times. And that worked a little bit better for me. But I was still missing certain features, specifically with regards to metadata. So for instance, if you have, say, an ePub, you can have a table of contents. You can make annotations in some, in certain clients, you can do annotations in the ePub, and so on. Whereas the comic book archives, they tend not to really specialize in that. They really are just a very convenient way of looking at a zip file in a way. So I was looking for something with a little bit more features towards finding stuff within a potentially large document with a lot of text in it. But not necessarily searchable text, because my use case has been so far, at least, to simply scan pages of whatever it might be. It might be an old unix manual from AT&T, or it might be a comic book, or it might be a historical document, I guess. So a book published at the turn of the century, the previous century, that's getting pretty old, and probably could use some preservation, that sort of thing. I'm not going to sit there and transcribe all that text. I'm not going to have the ASCII text of the content, but I do have the pages, and I might want to refer to it. So I might want to find chapter 5, for instance. And I might not want to scroll through a bunch of thumbnails, or just flip through a bunch of pages until I find it chapter 5. Seems like since it's on a computer, I should have that kind of data available to me more easily, than actually manually going through it and looking for it. So I thought, well, why not use an EPUB? That seems like a good, this seems like a great idea. It's not a bloated format, it's quite nice. It's a good format, I like it, I've never had a problem with it. It is a little bit weird, though, to take a bunch of images, and put it into an EPUB, and then get all the overhead of the EPUB. When the EPUB itself, I think, generally, expects to contain text. That's not to say that I couldn't abuse it, and just put images into it. I'm sure it's probably been done, but I felt like that wasn't the best, potentially the best use of the EPUB format, and I didn't feel, even more importantly, that the clients, the EPUB applications that I'm using to view the resulting documents, will really expect me to have a bunch of images in it. And this isn't, like, I don't want to offend someone, you know, it's not, I don't care what the clients, the applications expect, I don't care what EPUB is intended for. I'm just saying, if I want to zoom in quickly on an image, because I want to inspect some detail of some art, then I don't want to have to work around the fact that this EPUB really thought all I was ever going to ask for was a larger font, and so I have to sort of hack around it to zoom in on an image easily and conveniently. Well, the answer that I eventually fell upon was deja vu. deja vu is a digital document format designed with compression included and the ability to contain metadata. In other words, it is exactly what I was looking for, so I sat down to figure out how I could leverage deja vu in my everyday computing life, and this is what I came up with. So first of all, I will warn you that deja vu is highly dependent upon your use case. It's a pretty flexible format in a sense, because depending on what you need to put into it will determine how you generate them. But before we get into generating a deja vu file, let's talk a little bit about reading a deja vu file to make sure that it's a format that's well enough supported for us to resort to on a daily basis. Turns out it's actually quite well supported, so if you're on a Linux desktop and why wouldn't you be, then you have a deja vu reader probably installed already. Certainly, the GNOME desktop ships with, I think, events. Is there default document reader? And it reads deja vu files. KDE ships with Oculus, which I've already said reads CBZ files, comic book files. It also reads deja vu files. If you don't have either of those available to you for some reason, then you can go get DJ View, that is DJ and VIEW. It's part of a package called deja libre, which is the tool set for all the deja vu stuff that you'll be doing if you start using that as a format. Should be in your repository, and if not, it's on source forage. DJ View should probably also be in your repository, maybe not, again, it's on source forage. It is cross-platform, so if you're not on Linux, then this might be perfect for you. It's easy to compile. It does require the cute framework. Either cute for or cute five. I compiled it on Slackware with cute five with no problems. And it works quite well. Now, if none of those things are available to you for whatever reason, maybe you're on a computer that just doesn't permit that level of application management for you for whatever reason, then there are options for your web browser as well. There is, well, there are online sites where you can upload documents and look at them. There's a JavaScript library called deja vu.js that you can check out, deja vu.js.org. And then, finally, there's a Firefox browser plugin or rather, yeah, plugin and add-on extension, whatever they call them these days, called deja vu.js, which is a local copy of deja vu.js so that it runs as a plugin in your browser. You just, you click on the icon. It presents you with an empty tab. You drag your deja vu file onto the tab. Then it opens it up in your browser. It doesn't upload it to the internet or anything. It's local. It's just using your browser as the engine. So that's pretty easy as well. On mobile, there are document viewers for your mobile phone as well. I don't really know anything about the iPhone platform so I can't really even, I can't begin to guess what might be available for it. But certainly on Android, from FDroid, even, you can get an application called document viewer, which is a viewer from many document formats. And it supports deja vu, it supports ePub, it supports comic book, the CBZ, fiction book, FB2, and a couple of others. In other words, there are lots of options for reading deja vu files. And no matter what kind of device you're on, the chances are really high that there is a deja vu viewer for you. You should go get a deja vu file and test some of these things out to see if you like it. I think you probably will. If you need a good demo deja vu file, you can go to deja vu.org, go to the Downloads and Resources section. And at the bottom of that page, they have some white papers and tech documents. And any of those you can download and look at, they're all in the deja vu format. I thought it was kind of cool of the project actually to put their reference document and their specification document in deja vu. So you have to have deja vu in order to read about deja vu, it's quite slick. So there you go, that's the consumption side of things. It's pretty easy, it's a lot more available than you might first have thought. If you're not really aware of deja vu, these probably kind of just passed you by unnoticed. But they are there, they're there, and they work quite well, and they're highly compatible with lots of different platforms and devices. So do check them out. So now let's talk about how to create a deja vu file, because certainly if you're going to use this, and certainly the reason I'm using deja vu is because you want to put documents into that format. Now I'll admit, this can be a little bit tricky in some ways, and by that I mean that the process isn't actually difficult, but there are certain conveniences that just don't exist. For instance, if you're trying to quickly export something from, I don't know, Google Docs or something, you're not gonna go up to the export menu and find a deja vu format, at least I don't think you are, I don't know, but I'm assuming you're not gonna go to deja vu or Google Docs and find an export format of deja vu. You'll probably find other formats like ODT, I know that's in there, PDF, that's definitely in there, and maybe some other stuff, but deja vu isn't gonna just, you're not going to just inherit the capability to export as a deja vu, generally speaking. It's gonna depend on the application obviously, but I'm just saying, in the real world out there, obviously a lot of us had never even heard of deja vu before this episode, so it's kind of self-evident that it's not just, it's not gonna fall into your lap. You will have to decide, I'm going to be a deja vu user, and then you have to go get the tools to generate deja vu, and then you may have to work around some workflow that you have already established to create deja vu files. Luckily, I have some answers for that, but it's still, it's gonna be a little bit different, right? There's not gonna be very rarely are you going to find a file menu where you can go to file, print, print to deja vu. That just doesn't exist, whereas if you go to file, print to PDF, that exists. The difference is, of course, that PDF is a horrible format, and deja vu is actually quite nice. Let's look at it, shall we? So the deja vu toolset, as I've said, is deja vu libre is the, that's the open source implementation of the deja vu spec. deja vu libre is, as its name suggests, free and open source software. So you can grab it and use it, and it is completely open. You can learn all about everything that you need to know about deja vu from both deja vu.org and deja vu libre. There's some really good documentation in the deja vu libre source package, or maybe it's the deja vu source package. One of those two, it has some good documentation that kind of, it gives you an overview of all the different commands that come with deja vu libre. And I have to say the commands, there are many. And that is, again, because the way that you want to build a deja vu file will control or will dictate rather how you do that, what tools you use for the job. I'm not gonna go through all of them. I will go through some of the major ones. First of all, we need a series of documents that we want to convert to deja vu. Now deja vu is interesting. Now remember, I'm saying that it's a document format into which you can put lots of images, for instance. And then you'll have this file that seems like a book, and so it'll be like a paperless book. And that's great, but that's only one use of deja vu. deja vu itself is perfectly happy to be a single file, like a single thing, a single entity. So for instance, if I have any random photo from a phone or something, then I can convert that. I'm gonna go over to my pictures, graphics, whatever it's called, graphic folder here. And yeah, here's a TIFF. So I'm gonna go, I'm gonna, so a TIFF is a file, it's a pretty high quality, or potentially high quality, graphic file, and it is not necessarily, but generally speaking, it's like a color document, probably fairly high detail. So for that, we would want to use the sort of the high end converter for deja vu, which is called C44. If I type in C44-H of space-H for help, then I get a little bit of a blur about it. It says it's image compression utility using IW44 wavelets. Now I don't know what that means. There are a couple of different options here. The only ones I care about is the dash DPI, because it sets the image resolution. So I'm gonna do C44, and C44 again, is included with the deja vu libre package that you presumably downloaded and installed, or got from your repository on Slackware, it's already installed. So C44, and then it says in the help, it says to do options, okay, so that's dash DPI, and I'm gonna keep this at, let's say 300 DPI, and then it says to give a PNM or JPEG file. Okay, so it only accepts PNM or JPEG. This is a quirk about the toolset that I never really understood or got used to, but apparently for higher quality documents, the PNM or JPEG formats are supported, but for lower quality documents, the formats traditionally associated with high quality graphics are also supported, so TIFF. So for this, I cannot use a TIFF file, so I'm gonna zero in on a different, well actually I don't have a different one, so I'm just gonna convert this thing to a JPEG, I'll convert it actually to a PNM, so I'm gonna do a convert, that's an image magic command. If you don't have image magic or graphics magic installed, just install that, and then you can do convert, or GM convert if you got graphics magic. So convert in MK1, that's the name of this image, I should probably look at this image to see what on earth it is. Okay, so it is a, actually it is a really basic black and white logo, that's funny. Okay, so I'm gonna actually convert a different one, this penguin picture that I have. So I'm gonna do a convert penguin dot PNG to penguin dot, what am I saying, oh PNM, right? Okay, and that happened, that's done, that was very quick. And now I'm gonna use the C44 tool, C44-DPI 300, and I'm gonna feed it, the penguin dot PNM file, and then it tells me to define a deja vu file into which this, the conversion should be placed, so I'll just do penguin dot, DJVU is the file extension by default, and that's finished, that's done. So now I'll open up a graphical file manager here, just for testing to see what happens when I click on things, and it looks like I've got, actually I'm gonna look at file sizes as well, so the source of this penguin was 43.8 kilobytes. When I converted it up to the PNM format, I got a 1.8 megabyte file, so that's obviously going sort of in the wrong direction, right, if compression is one of the things that we care about going from 43 kilobytes to 1.8 megabyte, not a good thing, but wait a minute, the deja vu version, looking at that, is 18.1 kilobytes, that's quite a lot smaller than the original PNG, 43.8, so I'll click on that and try and take a look at it, it looks good, looks really nice, no problems really, no complaints about this, except a couple of different things, and that is that the background is black, and that's because I brought it in from a PNG, so I'm gonna go back up to my convert command, and rather than doing the convert from PNG to PNM, I'm gonna, or not rather, but in addition, I'm gonna add background, quote white, and then flatten. So that will take any kind of alpha channel that I inherit from the PNG, it will cause it to be, it will cause the background behind that alpha channel, if you think of it that way, it to be white, and then dash flatten, flatten the image, so that there is no alpha channel, so now we'll do that, and then we'll do the c44-dpi300 penguin PNM, actually you know what I'm gonna even do, I'm gonna drop the dpi300 and let the c44 thing go with its defaults. Okay, so now we have a deja vu file, which when I open is, yeah, it has white in the background, so that's good, that's a little bit better. So that works, now of course that's only a single file, that's one single image, and it's not really a document, it's just an image, but we can create digital books, we can create sort of e-books out of a deja vu by combining several deja vues into one bigger deja vu. Now for that, we'll need another deja vu file, and I do have this tiff, and it is a black and white logo completely coincidentally. It turns out that if you have a low quality image, or I should say a simple image, it doesn't have to be a low quality, but it has to be, it is expected to be simple, and in fact it is expected to be by, what do they say, bytonal I think, and the tools that deja vu Libre provides for that is CJB2, I realize that neither of these commands, C44 or CJB2 make any sense, or have anything apparently to do with deja vu, and that doesn't annoy me, but it's just one of those things that you kind of remember after a while, or you put it into a script, and you never have to remember it yourself. So CJB2-H gives me a couple of options, and again, I can specify my DPI. It says it defaults to 300. That seems reasonable to me. There's some cleaning up that you can do. A dash clean apparently cleans up the image by removing small fly specs. You can make it lossy, and you can set the loss level. I'm not gonna do anything of that fancy. I'm just gonna give it an input file, and it says that the input that it accepts is either a PBM or a TIFF. Okay, so CJB2 in MK1.TIFF, and then I'll do in MK1.DJZU, and that converted it pretty quickly as well. So once again, the TIFF, the source TIFF was 182 kilobytes. The deja vu version of that is 3.8, so quite the difference. I'll open it up here in Ocula, it looks fine, it looks like a very accurate representation of the simple graphic, and that's good. Okay, so now if we wanna make a deja vu file that contains both of these images as a page one and a page two, we can do that with the command DJVM, and I'll just do a dash H again, and it spells it out for me. So it says to compose a multi-page document, you can do DJVM dash C for create, and then the file, the destination file. So I'm just gonna call this output.DJVU, and then finally you end with all of the pages that you want to put into this document. So alphabetically, it looks like in MK should come first, so I'm gonna just, I'm gonna do something crazy here, and set Penguin first, and then in MK1, deja vu. So I've got output.dajavu is my target, and then penguin.dajavu, and then in MK1.dajavu. Return, and it produces an output file for me, and just because it's always fascinating to look at file sizes, it does look like this is about 25.6 kilobytes. So again, just keeping track of these things, I've got this penguin that was 43 kilobytes, and I've got this logo that was 182 kilobytes in one document at 25 kilobytes. So literally both of them combined in a deja vu file is smaller than either of them separate, pretty cool. And it does look as if though the images are in the order that I defined, so the penguin comes first, and then the in MK1 comes second. Now if I had just done like a wild card, deja vu files in the directory would have just done in MK1, and then penguin because that's alphabetical. But I did wanna demonstrate that you could set that, you can manually set the order of the pages in your command. Okay, so now we've got this document, and it's the self-contained document, you can open it up in ocular, or in DJ view, or in deja vu.js, and read it, and look at it, and it's great, but how can you find stuff in that document? Well, it turns out that doing metadata is pretty easy, and we can create a bookmarks file for this. I'm just gonna make one called book.marks in the same folder. It's just a text file, and you open it up with a parentheses, with an opening parentheses, or a bracket, whatever you call it. It's a circle, half circle, so that thing. And then book marks, that's the word bookmarks. Next line, I'm gonna do another parentheses, and I have not closed the parentheses. So we've got an open parentheses, bookmarks, and then next line, open another parentheses, and I'm gonna put in, let's do the word penguin. Quote penguin, closed quote space, quote hash one, closed quote, closed parentheses once. Okay, next line, open parentheses, quote, what was the other one, oh yeah, a logo, closed quote space, quote hash two, closed quote, closed parentheses, and then finally closing the main parentheses, the big parentheses. So it's bookmarks, penguin logo, or it's bookmarks, and then penguin, and then the page number, or the deja vu page number, and then the next line, the next thing that you want to locate, and then the deja vu page number, and then you close out the whole parentheses. The parentheses delineate the level of everything. So bookmarks is your main, that's the main entity, right? So you don't close that bookmark, you don't close that parentheses until the very end of your bookmarks. That makes sense. Then with each line itself is a new entry, and it needs to have a human readable title, and then the reference to the deja vu page number. Now if you don't know what that is off hand, you can just open up the deja vu file in a viewer and look, because you're doing this separate. You're doing this in a text file. So you look at that and you say, oh, the penguin is on page one. OK, so quote hash one, closed quote, closed parentheses. Now if this was a more complex document, and we wanted sub headings, then we wouldn't, then just don't close penguin, and have like logo page two, and then close the parentheses, and then close the bookmark. So if you leave a parentheses open, then everything below it, or everything within that, becomes sub headings, which is handy, because if you have a chapter and then a section, and then maybe a subsection, and then you close, close, close, and then you go back to a chapter level. So that's your level setting. OK, once you have your bookmarks defined in a text file, you use a command called deja used, or maybe it's, or it knows, deja of used. Maybe it's deja vu said, I don't know. It's DJVUSED, and then dash e, and then quote set dash outline, and then the name of the bookmarks file. So that's book.marks, close quote, and then dash s for save. If I don't have a dash s, it's a dry run. It'll apply the outline. It'll sort of validate the outline, really, is what it's doing. And then it will not save what is just done, and your output dot deja vu will not have bookmarks. So dash s means save. So dash s output dot deja vu, because that's the name of the file that I gave it, output dot deja vu. All right, so now let's take a look. It doesn't take any time to apply an outline. That's really a fast one. There we go. And now, yeah, it looks like in Ocula, I've got a table of contents on the left with penguin and a logo, logo being a child of penguin. I left the parentheses open so I can click on the little disclosure symbol there and get to see logo. And it's got the page number that it corresponds to over on the right. Now, that's just been a very basic example. In real life, I have found that the file size savings amount has varied pretty wildly. It really depends on where your images are coming from, what you're converting from, and how much you're willing to compress them in the deja vu document. I would say, typically, I see about maybe a 20% savings. That's what I would guess. It's just a little bit shaved off the top. It starts to add up the more you do it. But I wouldn't expect, for instance, to take a PDF that you downloaded from somewhere, and then you convert it to deja vu, I wouldn't expect it to be 50% smaller, or 60% smaller, anything like that. It would more likely be either the same size or 10% or 20%. The benefit, possibly, for you, is that deja vu is a sane and open format. It's fun to manipulate, and it's easy to find information on because all of its specs and information are open and online. There aren't really any hidden glitches disguised as features or features disguised as glitches in deja vu. Not at least in the way that there are in PDFs, which sometimes just are so confusing that even when you figure something out, you can't really figure out if it should even be opening. Like, should it still be working? Shouldn't I have just broken this? So yeah, I'm really enjoying deja vu. Now, you can also embed text, and I've never done that yet. I have not had the occasion to, I've not converted a document, for instance, to which I have embedded text, like the way that PDFs have. Those are not documents that I have bothered converting. Or if I have, it's been for quick reference on the go. And it's not one of those things where I'm thinking, oh, I need to select this text. I need to search for this exact string in the document. Obviously, for that sort of thing, I would want to have the text there, but so far for the way that I'm using it, the text just isn't available and the things that I'm converting to deja vu. If I wanted text to be embedded in the deja vu, I would have to transcribe it, looking at the screen, typing it all out, and that would be silly. So I've not gone in that direction yet. I'm not saying I never will. I may well do that. And maybe I'll look into easy and quick ways to take a PDF with embedded text and convert it to deja vu, while retaining the embedded text. Who knows, maybe, I've done crazier things. So if I ever do that, I'm sure you will hear more about it on hacker public radio. But until such a time, this has been an introduction to deja vu. Hopefully it's been informative to you. Maybe it's even useful. I suggest you try it out. If you've never used deja vu, give it a go. Make a document or get one from online. See what it's like. It's actually quite nice. I think you'll probably like it. And if you intend to use it seriously, then sit down and kind of think about your workflow too. Because I know that for some people, PDF is a very easy, well, for most everyone. PDF is a very easy output target. Because as I said earlier, it's probably in your file menu. It's like two clicks away. So if that stands in your way of using deja vu, then sit down and kind of think of what ways you might be able to pull a couple of commands together to make that process easier. And frankly, I don't know that it is a great format. It might not be your first format for a paper that you've just written in LibreOffice. Would it make sense to go out to deja vu? I mean, arguably it would, but arguably not at all. Because you're really, you would just be basically generating a raster file of a representation of text and then embedding text. And that seems really odd. So there are probably better formats if you're just typing stuff up and you want something on the go. Maybe EPUB is the best answer for you. But if you're doing archival work or your scanning documents, I mean, archival makes it sound fancy. If you're scanning stuff in, because you like them, but you want to throw the physical copy of it out. Or maybe you like it and you see that the physical copy is decaying, so you want to preserve it, scan it in, throw it into deja vu file, and see how it treats you. Thank you for listening. I will talk to you next time. You've been listening to HECCA Public Radio at HECCA Public Radio.org. We are a community podcast network that releases shows every weekday Monday through Friday. Today's show, like all our shows, was contributed by an HBR listener like yourself. If you ever thought of recording a podcast, then click on our contributing to find out how easy it really is. HECCA Public Radio was founded by the Digital Dog Pound and the Infonomicon Computer Club. And it's part of the binary revolution at binwreff.com. If you have comments on today's show, please email the host directly. Leave a comment on the website or record a follow-up episode yourself. Unless otherwise status, today's show is released under Creative Commons, Attribution, ShareLife, 3.0 license.