hpr-knowledge-base/hpr_transcripts/hpr2133.txt

Episode: 2133
Title: HPR2133: Compression technology part 1
Source: https://hub.hackerpublicradio.org/ccdn.php?filename=/eps/hpr2133/hpr2133.mp3
Transcribed: 2025-10-18 14:43:21

---

This is HPR episode 2,133 entitled Compression Technology Part 1.
It is posted by first time post-the-wishup and is about 20 minutes long.
The summary is introduction to data reduction methods, run length and coding.
This episode of HPR is brought to you by an honesthost.com.
Get 15% discount on all shared hosting with the offer code HPR15 that's HPR15.
Better web hosting that's honest and fair at An honesthost.com.
Howdy folks, this is 5150.
If you're going to attend the Ohio Linux Fest this weekend, October 7th through 8th,
be sure to seek out the Linux podcasters booth.
It is a collaboration between hacker public radio, the pod nuts network,
chronopathic oncast, Linux logcast, and your other favorite shows.
The show hacked of the new single board computer and virtual private server show
is graciously providing swag in the form of mugs, stickers, and t-shirts.
And I am bringing an unwarranted t-shirt from Kansas Linux Fest 2015, an upgrade year.
That's going to be first come, first serve, under the desk, so you're going to have to
ask for ends of stickers.
And we'd love to meet all our fans in Columbus.
See you there.
Hello, HPR listeners.
My name is Subishov and this is my first podcast for HPR.
I come from Berlin in Germany and I want to speak about compression technology.
The plan is to make more than one episode out of it because it's a wide field.
And in this first episode, I want to lay out some foundations about data, theory, redundancy,
transformations, and give an example about a very simple compressor from the old times
called RLE or run length encoder, run length encoding.
Okay, let's go on.
First a short bit on information theory.
The first word is entropy.
This means chaos in data.
For instance, if you take bytes out of the random generator, then you get very chaotic
or optimally totally chaotic data, which has no structure and every bit has a space.
This is uncompressible by any method.
But nowadays we have many data, which is not full entropy.
So there is second word, redundancy.
This means, for instance, for written text in a language that the words repeat from time
to time like this and that.
And there are some rules how the language is built and this counts for redundancy.
And redundancy is what compressors are aimed to reduce, to make more efficient data storage
or like less transfer over the internet with compressed data versus the raw data.
Then next term, lossless versus lossy compression.
One code or text has to be compressed lossless so that the decompressor can reconstruct the
origin in the data bit by bit because programs would just crash if some bits are flipped.
And for other data types like the color information of pictures or like the audio wave from
your receiver, lossy compression like an MP3 or JPEG images can be okay because this
lossy compression exploits the function of the human ear and the human eye that cannot
recognize every different bit in a sample, in a sound or in a picture.
And I'm talking about ordered data stream, usually when you compress a file, the file starts
at position zero and then you have a longer stream of bytes or bits and at some point the
stream ends.
So the compressor usually creates a bit or byte stream with the compressed result and
the decompressor's aim is to parse the compressed stream and reconstruct the original data.
Okay, transformations, there are sometimes it can be useful to transform the data before
compressing to make the compressor more efficient. For instance, when you have a temperature
sensor and you record the sensor data, usually over the day the temperatures don't spike and
drop so fast. So when you use a delta transformation, you take a temperature reading and the next
temperature reading and subtract them from each other, then you have the delta, the difference
between these two values. So and then you record the first temperature value and then you only
transmit or record the differences to the last reading. So usually you get much zeros plus one
and minus one because temperature ramps rather slowly. And so you have only three different
values and this can be compressed much more efficient like when you record 40 degrees, 41 degrees
and so on and so on. For image or audio compression, it's good. You can exploit much more
compressibility by using a fast Fourier transformation like what is done in JPEG and MP3 files.
And because you change this structure of the data, the compressor can spare more bits and then
the decompressor reconstructs the the data field and then the inverse fast Fourier transformation
reconstructs the next two original sample or part of the picture.
With that like MP3 can usually cut off the data size by 1 to 10 and JPEG pictures can also really
pictures saved in JPEG format are usually much much smaller like the original BMP style file on the
disk. And then which is good for general date program or specially text compression,
this is Boros Wheeler transformation or VWT which is exploited by the BZP2 program you may be
seen before file ending BZP2. The special property of the Boros Wheeler transformation is said
it does not work on bitstream or bytestream but on whole blocks of data at a time and it is
lossless as well. Okay now going more to the RLE part of this talk for example fax machine you may
know them they're not so common anymore where we have internet but still they are used to send
documents to other companies or something and with a standard fax resolution of 100 dots per
inch a whole A4 sheet of paper is amounts to 826 by 1169 pixel that makes next to a million of
pixel because fax machines only transmit black and white every pixel is one bit so that makes about
120 kW and the fax modems transmits communicate with other fax machines at 9600 BPS so that would be
around 125 seconds or even more so protocol if you would transmit these fax pictures uncompressed
there the fax G3 compression was invented and so standard pages like letters with not so much
text and many white fields on the paper on the paper can be transmitted in 10 to 20 seconds per page
but in the worst case when you put the piece of paper into the fax machine with the gray
with the gray noise pattern the transmission can even last much longer than 125 seconds
because for very noisy data the G3 compression even enlarges the compressed
the enlarged piece the G3 compression will even increase the file size by compressing
because it's not made for such chaotic data but unusual stuff you put in the fax machine it works
quite well okay another example in all times on windows 3.x and windows 95 98 and so
there was a startup logo in 16 colors in 640 by 480 pixel size and this boot or startup logo
at much black area around and in the middle of the screen there was this windows Microsoft logo
uncompressed this picture took 300 nibbles the 300 kilo nibbles or
invites that would be 150 kilobytes to put it on a disk in a file and with the RLE compression
these logos were only like 10 or 20 kilobytes big because RLE on like the fax compression
shines when you have long run of identical bytes so okay now to the after some introduction
going on to the part with the RLE compression explaining the RLE compression algorithms
where there are many different brands but I will talk about a simple RLE with prefix
and a simple RLE compression without prefix codes in my example I work with
I work with 8 bit symbols so every byte in the data stream is one symbol so I have 256 different symbols
and every of these symbols can arise in the input data so and because I want to make one
output stream out of the input stream so I use inband signaling I have to use a trick to put in
the repeat count which realizes the data compression okay let's first let's try an RLE compression
without prefix so I take the first byte from the input and I put it out because the first byte
I can't really do something on that I carry it on and then I look at the second byte
and if the second byte is the same like for instance first we had we start with this windows
boot logo it starts the first first lines are big runs of zero bytes because the logo is in the
middle of the screen so I see zero byte I put it out and I see a second zero byte I put it out
and then I start counting okay by three by four by five la la la by 99 by 100 okay
by the way when I am at byte 100 the counter is at 98 because I don't count the first two bytes
because they are already put out okay and now for instance I see a different byte okay and then
I write out the byte with the counter and then I start over okay I read this different byte at
position 101 and put it out and I read byte number 102 and this is different 201 okay I put it
out as well and byte 103 is the same as 102 okay now I put it out and start counting again like before
and so on and when I hit end of file I write out the remaining counter or the last byte I didn't
put out and then I'm finished and the decompressor then reads the file byte by byte and when he hits
two constitutive identical bytes the next byte is read as a counter and then this the former byte is
repeated that much times okay so far so good but for instance when I use RLE on ASCII text
in many languages we have WL or WS letters or similar or WT and in this case
I like for instance I have in Lossi I have two S I read the S put it out I read the second S put it
out and start counting and then I see the Y okay so my counter is still 0 and I lost the byte in
compression because I have to put a length byte after two identical bytes this can hurt the
compression if I don't use RLE for very regular data but there's a different approach on RLE
which I called RLE with prefix that means I have two types of counters like I use one bit
as a flag in the counter byte I use one bit as a flag if compressible data follows or
uncompressible data follows and the remaining seven bit I use as a counter from 1228
so when the compressor reads the byte reads the first bytes
it checks if there are identical bytes and if there are more than two or more than three or
more than four depends on the layout identical bytes are found the former bytes which were
uncompressible are prefixed in the to the output the prefixed byte is written with the flag
non compressible and count bytes and then the bytes are put from input to output and then
it counts the identical bytes to be compressed and stops at the first non-identical byte again
and then prefixed with the flag repetition is put out and then the byte of the data to repeat it
to be repeated so that means under bad circumstances uncompressible data the
compressed data will be up to 128th of the file size bigger but it's not so much and usually
you use RLE on good compressibility a very good very systematic data so you usually don't hit this
128th file size increase
yeah that's just so far what I wanted to say already it's very simple maybe try and you're
scripting programming language of your choice to to write a simple RLE compressor and T compressor
and try it out with different data sets and get a feeling how RLE works and my pro tip is
then take a hex a hex viewer or hex editor and look into the original data and in the compressed data
and yeah that should be very easy to understand yeah and next time I will cover the next
advanced compression algorithm so far have fun and hear you later
you've been listening to hecka public radio at hecka public radio dot org we are a community podcast
network that releases shows every weekday Monday through Friday today's show like all our shows
was contributed by an hbr listener like yourself if you ever thought of recording a podcast
and click on our contributing to find out how easy it really is hecka public radio was founded by
the digital dog pound and the infonomican computer club and it's part of the binary revolution at
binwreff.com if you have comments on today's show please email the host directly leave a comment on
the website or record a follow-up episode yourself unless otherwise stated today's show is released
create of comments attribution share a light 3.0 license