- MCP server with stdio transport for local use - Search episodes, transcripts, hosts, and series - 4,511 episodes with metadata and transcripts - Data loader with in-memory JSON storage 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
155 lines
12 KiB
Plaintext
155 lines
12 KiB
Plaintext
Episode: 2133
|
|
Title: HPR2133: Compression technology part 1
|
|
Source: https://hub.hackerpublicradio.org/ccdn.php?filename=/eps/hpr2133/hpr2133.mp3
|
|
Transcribed: 2025-10-18 14:43:21
|
|
|
|
---
|
|
|
|
This is HPR episode 2,133 entitled Compression Technology Part 1.
|
|
It is posted by first time post-the-wishup and is about 20 minutes long.
|
|
The summary is introduction to data reduction methods, run length and coding.
|
|
This episode of HPR is brought to you by an honesthost.com.
|
|
Get 15% discount on all shared hosting with the offer code HPR15 that's HPR15.
|
|
Better web hosting that's honest and fair at An honesthost.com.
|
|
Howdy folks, this is 5150.
|
|
If you're going to attend the Ohio Linux Fest this weekend, October 7th through 8th,
|
|
be sure to seek out the Linux podcasters booth.
|
|
It is a collaboration between hacker public radio, the pod nuts network,
|
|
chronopathic oncast, Linux logcast, and your other favorite shows.
|
|
The show hacked of the new single board computer and virtual private server show
|
|
is graciously providing swag in the form of mugs, stickers, and t-shirts.
|
|
And I am bringing an unwarranted t-shirt from Kansas Linux Fest 2015, an upgrade year.
|
|
That's going to be first come, first serve, under the desk, so you're going to have to
|
|
ask for ends of stickers.
|
|
And we'd love to meet all our fans in Columbus.
|
|
See you there.
|
|
Hello, HPR listeners.
|
|
My name is Subishov and this is my first podcast for HPR.
|
|
I come from Berlin in Germany and I want to speak about compression technology.
|
|
The plan is to make more than one episode out of it because it's a wide field.
|
|
And in this first episode, I want to lay out some foundations about data, theory, redundancy,
|
|
transformations, and give an example about a very simple compressor from the old times
|
|
called RLE or run length encoder, run length encoding.
|
|
Okay, let's go on.
|
|
First a short bit on information theory.
|
|
The first word is entropy.
|
|
This means chaos in data.
|
|
For instance, if you take bytes out of the random generator, then you get very chaotic
|
|
or optimally totally chaotic data, which has no structure and every bit has a space.
|
|
This is uncompressible by any method.
|
|
But nowadays we have many data, which is not full entropy.
|
|
So there is second word, redundancy.
|
|
This means, for instance, for written text in a language that the words repeat from time
|
|
to time like this and that.
|
|
And there are some rules how the language is built and this counts for redundancy.
|
|
And redundancy is what compressors are aimed to reduce, to make more efficient data storage
|
|
or like less transfer over the internet with compressed data versus the raw data.
|
|
Then next term, lossless versus lossy compression.
|
|
One code or text has to be compressed lossless so that the decompressor can reconstruct the
|
|
origin in the data bit by bit because programs would just crash if some bits are flipped.
|
|
And for other data types like the color information of pictures or like the audio wave from
|
|
your receiver, lossy compression like an MP3 or JPEG images can be okay because this
|
|
lossy compression exploits the function of the human ear and the human eye that cannot
|
|
recognize every different bit in a sample, in a sound or in a picture.
|
|
And I'm talking about ordered data stream, usually when you compress a file, the file starts
|
|
at position zero and then you have a longer stream of bytes or bits and at some point the
|
|
stream ends.
|
|
So the compressor usually creates a bit or byte stream with the compressed result and
|
|
the decompressor's aim is to parse the compressed stream and reconstruct the original data.
|
|
Okay, transformations, there are sometimes it can be useful to transform the data before
|
|
compressing to make the compressor more efficient. For instance, when you have a temperature
|
|
sensor and you record the sensor data, usually over the day the temperatures don't spike and
|
|
drop so fast. So when you use a delta transformation, you take a temperature reading and the next
|
|
temperature reading and subtract them from each other, then you have the delta, the difference
|
|
between these two values. So and then you record the first temperature value and then you only
|
|
transmit or record the differences to the last reading. So usually you get much zeros plus one
|
|
and minus one because temperature ramps rather slowly. And so you have only three different
|
|
values and this can be compressed much more efficient like when you record 40 degrees, 41 degrees
|
|
and so on and so on. For image or audio compression, it's good. You can exploit much more
|
|
compressibility by using a fast Fourier transformation like what is done in JPEG and MP3 files.
|
|
And because you change this structure of the data, the compressor can spare more bits and then
|
|
the decompressor reconstructs the the data field and then the inverse fast Fourier transformation
|
|
reconstructs the next two original sample or part of the picture.
|
|
With that like MP3 can usually cut off the data size by 1 to 10 and JPEG pictures can also really
|
|
pictures saved in JPEG format are usually much much smaller like the original BMP style file on the
|
|
disk. And then which is good for general date program or specially text compression,
|
|
this is Boros Wheeler transformation or VWT which is exploited by the BZP2 program you may be
|
|
seen before file ending BZP2. The special property of the Boros Wheeler transformation is said
|
|
it does not work on bitstream or bytestream but on whole blocks of data at a time and it is
|
|
lossless as well. Okay now going more to the RLE part of this talk for example fax machine you may
|
|
know them they're not so common anymore where we have internet but still they are used to send
|
|
documents to other companies or something and with a standard fax resolution of 100 dots per
|
|
inch a whole A4 sheet of paper is amounts to 826 by 1169 pixel that makes next to a million of
|
|
pixel because fax machines only transmit black and white every pixel is one bit so that makes about
|
|
120 kW and the fax modems transmits communicate with other fax machines at 9600 BPS so that would be
|
|
around 125 seconds or even more so protocol if you would transmit these fax pictures uncompressed
|
|
there the fax G3 compression was invented and so standard pages like letters with not so much
|
|
text and many white fields on the paper on the paper can be transmitted in 10 to 20 seconds per page
|
|
but in the worst case when you put the piece of paper into the fax machine with the gray
|
|
with the gray noise pattern the transmission can even last much longer than 125 seconds
|
|
because for very noisy data the G3 compression even enlarges the compressed
|
|
the enlarged piece the G3 compression will even increase the file size by compressing
|
|
because it's not made for such chaotic data but unusual stuff you put in the fax machine it works
|
|
quite well okay another example in all times on windows 3.x and windows 95 98 and so
|
|
there was a startup logo in 16 colors in 640 by 480 pixel size and this boot or startup logo
|
|
at much black area around and in the middle of the screen there was this windows Microsoft logo
|
|
uncompressed this picture took 300 nibbles the 300 kilo nibbles or
|
|
invites that would be 150 kilobytes to put it on a disk in a file and with the RLE compression
|
|
these logos were only like 10 or 20 kilobytes big because RLE on like the fax compression
|
|
shines when you have long run of identical bytes so okay now to the after some introduction
|
|
going on to the part with the RLE compression explaining the RLE compression algorithms
|
|
where there are many different brands but I will talk about a simple RLE with prefix
|
|
and a simple RLE compression without prefix codes in my example I work with
|
|
I work with 8 bit symbols so every byte in the data stream is one symbol so I have 256 different symbols
|
|
and every of these symbols can arise in the input data so and because I want to make one
|
|
output stream out of the input stream so I use inband signaling I have to use a trick to put in
|
|
the repeat count which realizes the data compression okay let's first let's try an RLE compression
|
|
without prefix so I take the first byte from the input and I put it out because the first byte
|
|
I can't really do something on that I carry it on and then I look at the second byte
|
|
and if the second byte is the same like for instance first we had we start with this windows
|
|
boot logo it starts the first first lines are big runs of zero bytes because the logo is in the
|
|
middle of the screen so I see zero byte I put it out and I see a second zero byte I put it out
|
|
and then I start counting okay by three by four by five la la la by 99 by 100 okay
|
|
by the way when I am at byte 100 the counter is at 98 because I don't count the first two bytes
|
|
because they are already put out okay and now for instance I see a different byte okay and then
|
|
I write out the byte with the counter and then I start over okay I read this different byte at
|
|
position 101 and put it out and I read byte number 102 and this is different 201 okay I put it
|
|
out as well and byte 103 is the same as 102 okay now I put it out and start counting again like before
|
|
and so on and when I hit end of file I write out the remaining counter or the last byte I didn't
|
|
put out and then I'm finished and the decompressor then reads the file byte by byte and when he hits
|
|
two constitutive identical bytes the next byte is read as a counter and then this the former byte is
|
|
repeated that much times okay so far so good but for instance when I use RLE on ASCII text
|
|
in many languages we have WL or WS letters or similar or WT and in this case
|
|
I like for instance I have in Lossi I have two S I read the S put it out I read the second S put it
|
|
out and start counting and then I see the Y okay so my counter is still 0 and I lost the byte in
|
|
compression because I have to put a length byte after two identical bytes this can hurt the
|
|
compression if I don't use RLE for very regular data but there's a different approach on RLE
|
|
which I called RLE with prefix that means I have two types of counters like I use one bit
|
|
as a flag in the counter byte I use one bit as a flag if compressible data follows or
|
|
uncompressible data follows and the remaining seven bit I use as a counter from 1228
|
|
so when the compressor reads the byte reads the first bytes
|
|
it checks if there are identical bytes and if there are more than two or more than three or
|
|
more than four depends on the layout identical bytes are found the former bytes which were
|
|
uncompressible are prefixed in the to the output the prefixed byte is written with the flag
|
|
non compressible and count bytes and then the bytes are put from input to output and then
|
|
it counts the identical bytes to be compressed and stops at the first non-identical byte again
|
|
and then prefixed with the flag repetition is put out and then the byte of the data to repeat it
|
|
to be repeated so that means under bad circumstances uncompressible data the
|
|
compressed data will be up to 128th of the file size bigger but it's not so much and usually
|
|
you use RLE on good compressibility a very good very systematic data so you usually don't hit this
|
|
128th file size increase
|
|
yeah that's just so far what I wanted to say already it's very simple maybe try and you're
|
|
scripting programming language of your choice to to write a simple RLE compressor and T compressor
|
|
and try it out with different data sets and get a feeling how RLE works and my pro tip is
|
|
then take a hex a hex viewer or hex editor and look into the original data and in the compressed data
|
|
and yeah that should be very easy to understand yeah and next time I will cover the next
|
|
advanced compression algorithm so far have fun and hear you later
|
|
you've been listening to hecka public radio at hecka public radio dot org we are a community podcast
|
|
network that releases shows every weekday Monday through Friday today's show like all our shows
|
|
was contributed by an hbr listener like yourself if you ever thought of recording a podcast
|
|
and click on our contributing to find out how easy it really is hecka public radio was founded by
|
|
the digital dog pound and the infonomican computer club and it's part of the binary revolution at
|
|
binwreff.com if you have comments on today's show please email the host directly leave a comment on
|
|
the website or record a follow-up episode yourself unless otherwise stated today's show is released
|
|
create of comments attribution share a light 3.0 license
|