Initial commit: HPR Knowledge Base MCP Server
- MCP server with stdio transport for local use - Search episodes, transcripts, hosts, and series - 4,511 episodes with metadata and transcripts - Data loader with in-memory JSON storage 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
This commit is contained in:
154
hpr_transcripts/hpr2133.txt
Normal file
154
hpr_transcripts/hpr2133.txt
Normal file
@@ -0,0 +1,154 @@
|
||||
Episode: 2133
|
||||
Title: HPR2133: Compression technology part 1
|
||||
Source: https://hub.hackerpublicradio.org/ccdn.php?filename=/eps/hpr2133/hpr2133.mp3
|
||||
Transcribed: 2025-10-18 14:43:21
|
||||
|
||||
---
|
||||
|
||||
This is HPR episode 2,133 entitled Compression Technology Part 1.
|
||||
It is posted by first time post-the-wishup and is about 20 minutes long.
|
||||
The summary is introduction to data reduction methods, run length and coding.
|
||||
This episode of HPR is brought to you by an honesthost.com.
|
||||
Get 15% discount on all shared hosting with the offer code HPR15 that's HPR15.
|
||||
Better web hosting that's honest and fair at An honesthost.com.
|
||||
Howdy folks, this is 5150.
|
||||
If you're going to attend the Ohio Linux Fest this weekend, October 7th through 8th,
|
||||
be sure to seek out the Linux podcasters booth.
|
||||
It is a collaboration between hacker public radio, the pod nuts network,
|
||||
chronopathic oncast, Linux logcast, and your other favorite shows.
|
||||
The show hacked of the new single board computer and virtual private server show
|
||||
is graciously providing swag in the form of mugs, stickers, and t-shirts.
|
||||
And I am bringing an unwarranted t-shirt from Kansas Linux Fest 2015, an upgrade year.
|
||||
That's going to be first come, first serve, under the desk, so you're going to have to
|
||||
ask for ends of stickers.
|
||||
And we'd love to meet all our fans in Columbus.
|
||||
See you there.
|
||||
Hello, HPR listeners.
|
||||
My name is Subishov and this is my first podcast for HPR.
|
||||
I come from Berlin in Germany and I want to speak about compression technology.
|
||||
The plan is to make more than one episode out of it because it's a wide field.
|
||||
And in this first episode, I want to lay out some foundations about data, theory, redundancy,
|
||||
transformations, and give an example about a very simple compressor from the old times
|
||||
called RLE or run length encoder, run length encoding.
|
||||
Okay, let's go on.
|
||||
First a short bit on information theory.
|
||||
The first word is entropy.
|
||||
This means chaos in data.
|
||||
For instance, if you take bytes out of the random generator, then you get very chaotic
|
||||
or optimally totally chaotic data, which has no structure and every bit has a space.
|
||||
This is uncompressible by any method.
|
||||
But nowadays we have many data, which is not full entropy.
|
||||
So there is second word, redundancy.
|
||||
This means, for instance, for written text in a language that the words repeat from time
|
||||
to time like this and that.
|
||||
And there are some rules how the language is built and this counts for redundancy.
|
||||
And redundancy is what compressors are aimed to reduce, to make more efficient data storage
|
||||
or like less transfer over the internet with compressed data versus the raw data.
|
||||
Then next term, lossless versus lossy compression.
|
||||
One code or text has to be compressed lossless so that the decompressor can reconstruct the
|
||||
origin in the data bit by bit because programs would just crash if some bits are flipped.
|
||||
And for other data types like the color information of pictures or like the audio wave from
|
||||
your receiver, lossy compression like an MP3 or JPEG images can be okay because this
|
||||
lossy compression exploits the function of the human ear and the human eye that cannot
|
||||
recognize every different bit in a sample, in a sound or in a picture.
|
||||
And I'm talking about ordered data stream, usually when you compress a file, the file starts
|
||||
at position zero and then you have a longer stream of bytes or bits and at some point the
|
||||
stream ends.
|
||||
So the compressor usually creates a bit or byte stream with the compressed result and
|
||||
the decompressor's aim is to parse the compressed stream and reconstruct the original data.
|
||||
Okay, transformations, there are sometimes it can be useful to transform the data before
|
||||
compressing to make the compressor more efficient. For instance, when you have a temperature
|
||||
sensor and you record the sensor data, usually over the day the temperatures don't spike and
|
||||
drop so fast. So when you use a delta transformation, you take a temperature reading and the next
|
||||
temperature reading and subtract them from each other, then you have the delta, the difference
|
||||
between these two values. So and then you record the first temperature value and then you only
|
||||
transmit or record the differences to the last reading. So usually you get much zeros plus one
|
||||
and minus one because temperature ramps rather slowly. And so you have only three different
|
||||
values and this can be compressed much more efficient like when you record 40 degrees, 41 degrees
|
||||
and so on and so on. For image or audio compression, it's good. You can exploit much more
|
||||
compressibility by using a fast Fourier transformation like what is done in JPEG and MP3 files.
|
||||
And because you change this structure of the data, the compressor can spare more bits and then
|
||||
the decompressor reconstructs the the data field and then the inverse fast Fourier transformation
|
||||
reconstructs the next two original sample or part of the picture.
|
||||
With that like MP3 can usually cut off the data size by 1 to 10 and JPEG pictures can also really
|
||||
pictures saved in JPEG format are usually much much smaller like the original BMP style file on the
|
||||
disk. And then which is good for general date program or specially text compression,
|
||||
this is Boros Wheeler transformation or VWT which is exploited by the BZP2 program you may be
|
||||
seen before file ending BZP2. The special property of the Boros Wheeler transformation is said
|
||||
it does not work on bitstream or bytestream but on whole blocks of data at a time and it is
|
||||
lossless as well. Okay now going more to the RLE part of this talk for example fax machine you may
|
||||
know them they're not so common anymore where we have internet but still they are used to send
|
||||
documents to other companies or something and with a standard fax resolution of 100 dots per
|
||||
inch a whole A4 sheet of paper is amounts to 826 by 1169 pixel that makes next to a million of
|
||||
pixel because fax machines only transmit black and white every pixel is one bit so that makes about
|
||||
120 kW and the fax modems transmits communicate with other fax machines at 9600 BPS so that would be
|
||||
around 125 seconds or even more so protocol if you would transmit these fax pictures uncompressed
|
||||
there the fax G3 compression was invented and so standard pages like letters with not so much
|
||||
text and many white fields on the paper on the paper can be transmitted in 10 to 20 seconds per page
|
||||
but in the worst case when you put the piece of paper into the fax machine with the gray
|
||||
with the gray noise pattern the transmission can even last much longer than 125 seconds
|
||||
because for very noisy data the G3 compression even enlarges the compressed
|
||||
the enlarged piece the G3 compression will even increase the file size by compressing
|
||||
because it's not made for such chaotic data but unusual stuff you put in the fax machine it works
|
||||
quite well okay another example in all times on windows 3.x and windows 95 98 and so
|
||||
there was a startup logo in 16 colors in 640 by 480 pixel size and this boot or startup logo
|
||||
at much black area around and in the middle of the screen there was this windows Microsoft logo
|
||||
uncompressed this picture took 300 nibbles the 300 kilo nibbles or
|
||||
invites that would be 150 kilobytes to put it on a disk in a file and with the RLE compression
|
||||
these logos were only like 10 or 20 kilobytes big because RLE on like the fax compression
|
||||
shines when you have long run of identical bytes so okay now to the after some introduction
|
||||
going on to the part with the RLE compression explaining the RLE compression algorithms
|
||||
where there are many different brands but I will talk about a simple RLE with prefix
|
||||
and a simple RLE compression without prefix codes in my example I work with
|
||||
I work with 8 bit symbols so every byte in the data stream is one symbol so I have 256 different symbols
|
||||
and every of these symbols can arise in the input data so and because I want to make one
|
||||
output stream out of the input stream so I use inband signaling I have to use a trick to put in
|
||||
the repeat count which realizes the data compression okay let's first let's try an RLE compression
|
||||
without prefix so I take the first byte from the input and I put it out because the first byte
|
||||
I can't really do something on that I carry it on and then I look at the second byte
|
||||
and if the second byte is the same like for instance first we had we start with this windows
|
||||
boot logo it starts the first first lines are big runs of zero bytes because the logo is in the
|
||||
middle of the screen so I see zero byte I put it out and I see a second zero byte I put it out
|
||||
and then I start counting okay by three by four by five la la la by 99 by 100 okay
|
||||
by the way when I am at byte 100 the counter is at 98 because I don't count the first two bytes
|
||||
because they are already put out okay and now for instance I see a different byte okay and then
|
||||
I write out the byte with the counter and then I start over okay I read this different byte at
|
||||
position 101 and put it out and I read byte number 102 and this is different 201 okay I put it
|
||||
out as well and byte 103 is the same as 102 okay now I put it out and start counting again like before
|
||||
and so on and when I hit end of file I write out the remaining counter or the last byte I didn't
|
||||
put out and then I'm finished and the decompressor then reads the file byte by byte and when he hits
|
||||
two constitutive identical bytes the next byte is read as a counter and then this the former byte is
|
||||
repeated that much times okay so far so good but for instance when I use RLE on ASCII text
|
||||
in many languages we have WL or WS letters or similar or WT and in this case
|
||||
I like for instance I have in Lossi I have two S I read the S put it out I read the second S put it
|
||||
out and start counting and then I see the Y okay so my counter is still 0 and I lost the byte in
|
||||
compression because I have to put a length byte after two identical bytes this can hurt the
|
||||
compression if I don't use RLE for very regular data but there's a different approach on RLE
|
||||
which I called RLE with prefix that means I have two types of counters like I use one bit
|
||||
as a flag in the counter byte I use one bit as a flag if compressible data follows or
|
||||
uncompressible data follows and the remaining seven bit I use as a counter from 1228
|
||||
so when the compressor reads the byte reads the first bytes
|
||||
it checks if there are identical bytes and if there are more than two or more than three or
|
||||
more than four depends on the layout identical bytes are found the former bytes which were
|
||||
uncompressible are prefixed in the to the output the prefixed byte is written with the flag
|
||||
non compressible and count bytes and then the bytes are put from input to output and then
|
||||
it counts the identical bytes to be compressed and stops at the first non-identical byte again
|
||||
and then prefixed with the flag repetition is put out and then the byte of the data to repeat it
|
||||
to be repeated so that means under bad circumstances uncompressible data the
|
||||
compressed data will be up to 128th of the file size bigger but it's not so much and usually
|
||||
you use RLE on good compressibility a very good very systematic data so you usually don't hit this
|
||||
128th file size increase
|
||||
yeah that's just so far what I wanted to say already it's very simple maybe try and you're
|
||||
scripting programming language of your choice to to write a simple RLE compressor and T compressor
|
||||
and try it out with different data sets and get a feeling how RLE works and my pro tip is
|
||||
then take a hex a hex viewer or hex editor and look into the original data and in the compressed data
|
||||
and yeah that should be very easy to understand yeah and next time I will cover the next
|
||||
advanced compression algorithm so far have fun and hear you later
|
||||
you've been listening to hecka public radio at hecka public radio dot org we are a community podcast
|
||||
network that releases shows every weekday Monday through Friday today's show like all our shows
|
||||
was contributed by an hbr listener like yourself if you ever thought of recording a podcast
|
||||
and click on our contributing to find out how easy it really is hecka public radio was founded by
|
||||
the digital dog pound and the infonomican computer club and it's part of the binary revolution at
|
||||
binwreff.com if you have comments on today's show please email the host directly leave a comment on
|
||||
the website or record a follow-up episode yourself unless otherwise stated today's show is released
|
||||
create of comments attribution share a light 3.0 license
|
||||
Reference in New Issue
Block a user