hpr_transcripts/hpr3309.txt

Episode: 3309
Title: HPR3309: Linux Inlaws S01E27: The Big Uncertainties in Life and beyond
Source: https://hub.hackerpublicradio.org/ccdn.php?filename=/eps/hpr3309/hpr3309.mp3
Transcribed: 2025-10-24 20:35:39

---

This is Hacker Public Radio Episode 3309 for Thursday, the 8th of April 2021.
Today's show is entitled, Linux in laws S0127, the big uncertainties in life and beyond.
It is hosted by Monochromic and is about 57 minutes long and carries an explicit flag.
The summary is, the two chaps discuss uncertainties and beyond in this episode on probabilistic data structure.
This episode of HPR is brought to you by Ananasthost.com.
Get 15% discount on all shared hosting with the offer code HPR15. That's HPR15.
Better web hosting that's honest and fair at Ananasthost.com.
This is Linux in laws, a podcast on topics around free and open source software, any associated
contraband, communism, the revolution in general and whatever else, fans is ethical.
Please note that this and other episodes may contain strong language, offensive humor and
other certainly not politically correct language. You have been warned.
Our parents insisted on this disclaimer. Happy mum?
That's the content is not suitable for consumption in the workplace, especially when
played back on a speaker in an open plan office or similar environments. Any miners under the
age of 35 or any pets including fluffy little killer bunnies, your trusted guide dog,
unless on speed and qt rexes or other associated dinosaurs.
Welcome to Linux in laws, season one episode. I can't even remember what episode is it?
Episode where did it see vaguely? Maybe it's episode 40 something.
Which is of course a circuit to today's subject, there may probabilistic data structures.
And for this subject we have no other than or below it, Elena, you probably remember her from
the peasant girl, sorry, back in the Halloween special of something called Linux in laws.
If you haven't listened to it, the listeners should please go back, you find the link on the
on the website. It's an episode not to be necessarily understand. Yes, exactly.
So when Elena is not working as a voice double, she's actually working in at a company called
Linux. I thought maybe Olga, especially Elena, why don't you introduce yourself properly?
Wow, you already know Olga, so I'm just going to summon my Elena persona. Please do yes.
Okay, and the else we should be aware of here.
That's that's that's me, my Elena persona is pretty, pretty stable. Although I have different roles,
well, I have the role, my professional role, of course, I have the role of a mom, I have a role
of a friend, but in this context, let's talk about my, the professional Elena. So I work for
Redis Labs as a technical enablement. Oh, no, sorry, now senior technical enablement architect.
Well done, by the way, before I forget this. Thank you, thank you very much.
Technical enablement architect basically means we help the technical field, everyone working
everyone in the field in technical roles. To first of all, get up to speed with the system when
we all architecture and they join the company. And then also maintain their knowledge
fresh by releasing new trainings about all different technical aspects of our system.
So that's what I do professionally. Outside of that, I think that's really close to my heart is
the women in tech question, but I'm personally, I am more of the type of women who,
instead of going on Twitter and complaining about what someone did, I like to
roll up my sleeves and get some work done and we'll do some coding or just become a better
engineer in order to be a stronger woman in tech, a woman in technology. So
I had done some work previously on that field and a while ago, I was invited from the woman
in tech global organization to be the global education program director.
So now we're working on that. We have some partnerships with Cisco,
Microsoft, and we're going to be doing some free courses for women and people who identified
as female or minority technology. Wow. So yeah, coding. That's my thing.
Excellent. Okay. For those of people, our first, I really met Elena. I think it was in November,
December 2019 in London, when you give a presentation on something called probabilistic data structures
for no sequel, for in memory, no sequel to the way it's called Redis.
Yes. And this would be today's subject as in what are really probabilistic data structures
and why are they important? Not just in Redis context, but generally speaking, but before we go
into these technical details, maybe it's worth explaining what Redis is in an in memory,
no sequel to context, given the fact that two thirds of today's speakers on this podcast
will still work for Redis one defected quite some time ago. So let's do this jointly.
Redis is about 10 years old. It's a it has the spot. I think it has the seventh
to rank on a website called DB engines, if I'm not completely mistaken.
And it's a stack of those anything to go by. It's the most beloved database for I think at least
four years in a row is in terms of voted for the main differentiation. It's probably the
best term what I'm looking for is actually that in contrast to other databases, Redis does it
all in main memory in terms of it doesn't it yes, it supports persistence, but it's the main
focus of the processing of data is actually doing this in memory and hence this kind of
playground of real-time performance, but why should I do all the explaining?
Elena Martin could we to chime in? Well, we started really well. Redis is
very fast, he's very loved by developers and I can talk about myself at least one of the things
I loved that made me fall in love in Redis even before I knew about Redis' life. So before I joined
Redis Labs was the efficiency and the part that it doesn't necessarily stick to all the academic
talk, it just uses many times it uses just some approximations to get things done
and make them work very well in 99% of the cases for people. So this is a
episode of probabilistic data structures, so even in Redis itself we have a lot of approximations
like if we talk about the LFU, LRU, the vision policies in Redis, they use approximations too.
And I kind of like that that efficiency being smaller, small memory footprint very fast,
it does things well and fast. Martin, you think too bad? Yeah, I think the one thing I think
for people not familiar with Redis is that it's basically a bunch of data structures that you
use for different purposes, which as opposed to your relational database, which has
table and those kind of structures, it's a bunch of different data structures which are very close
to programming paradigms instead. So it's more of a build of this type of piece of technology,
right? Interesting observation there. Yes. A bunch of Legos for us to play with.
Yeah, that's actually very good image there. Okay, but enough about Redis details.
Elena, what exactly are probabilistic data structures and why are they so important?
Okay, so what exactly are probabilistic data structures? It's a group of data structures that
give a reasonable approximation, but using just the fraction of the usual time and
memory that the deterministic data structure would use. So they use hash functions usually
to randomize and and completely represent the set of items and then collisions are ignored,
which leads to usually leads to some margin of error. Before we go any, sorry, I mean before we're
going further, I think we cannot assume that everybody knows what hash functions are and what
collisions are. So hash functions are either way, okay, how do we? So let's say you get a value,
you do your hash it, I don't know how deep should we go into explaining how hash functions work,
but you have that value and many different values can have a similar, can have the same hash.
And that would be a hash collision. I like to use this analogy. Well, if we have people that
don't know what what the hash function and I like to use this analogy, if the object that we have
is let's say the real life object, a hash of that is its shadow. Okay, so the shadow of that
object would be its hash. It kind of represents, it kind of is a silhouette of what was there.
We can kind of tell something about it, but you cannot know what is it and many different objects,
let's say a ball, someone's head or a lamp can have the same shape of a shadow, but in that
way, different objects. Very much so. I like the image, I like the comparison there.
That's I like the analogy. Let's put it this way. Beautiful. Thank you. So yeah, so we're
in probabilistic data structures, you use that kind of, you represent the elements with their shadows,
basically. Now, by the way, before I forget, very practical uses of hash's, I actually,
if you look into a Linux Linux system, because the passwords stored in ETC password WD or in ETC
shadow are actually sold at hashes. I won't go into that near details, but this is the primary
touch point. If you use a Linux based system, when you log in, cap in your password, the password
is not stored in clear text in ETC password WD or in ETC shadow, but rather as a hash.
So the one thing to do with the probabilistic data structures here.
Thank you, Mark. So let's cut the short and this is just another application over hash. So
the idea is basically you enter password, this password is then converted by a hash function
to a hash. And then the hashes are compared. Should these has just match essentially. Sorry, but
yeah, but it's still connected. Yeah, not nothing to do with probabilistic data structures, but it's
still they use a deterministic function. A deterministic hash function means that every time you run it
with the same input, it's going to give you the same result, the same output.
So actually, I'm not sure if that is true for passwords to be honest, depending on the system.
It should be at least for PDS. Can I use PDS? Probably data structure. Thank you. Yes.
It is definitely true and crucial to their functioning. Now, I mean, why would someone use
more probabilistic data structures? You would ask, okay, well, why would I want you to sacrifice
precision, right? We are developers. We work in with exact thing. It's an exact science. I wanted to
have my results always correct. But there are cases where you would sacrifice some precision
if some accuracy, if you can gain space or time. So there is this thing called
in this triangle, the triangle of space time accuracy in data processing, where you have
space accuracy and time and component on the three edges of the time of the triangle.
And you can choose to. You cannot have three with the data structure. So either you're going to
have accuracy and low memory, but then you sacrifice time. So you don't get real time result.
Or you have space and time in the case of probabilistic data structures. You save one space,
space, you have good time performance, but you sacrifice accuracy. You cannot have all three.
Exactly. This is also known as the Kolevska theory, right? If I'm not completely mistaken.
No, I don't think so. Maybe I'm wrong. I don't know. Maybe there are some Kolevska who lived
long before me from mysterious. There is a mysterious people in the PDS world.
But we can talk about that, the Bloom guy, one of the most famous structures.
After Bloom, he's completely mysterious. You can't find anything about him online. You can
know photos. We don't know if he's still alive or not. He's mysterious.
Yes, let's let's practice for about two and a half minutes before we go into the subject.
But okay, it mustn't any any thoughts on on PDS is before we move on.
Well, we're not moving off PDS, are we?
I'm not sure if I was about in terms of the principles and stuff and whatever. The theory,
that is. Yeah, no, that's very well put about the trade-offs, right? It's it's a very different
approach in terms of a, you know, if you look at it from different database of different technologies,
you're effectively at storage time. You compute a data structure rather than
restoring all the data as is and then calculating a a result out of those. Which of these?
I mean, that goes especially for use cases where you quickly run into either time or space
problems in terms of having to start terabytes, gigabytes, petabytes, exabytes of memory,
where simply main memory never mind secondary storage doesn't measure up. Same goes for run time
and this is where PDS is if you are willing to trade in accuracy versus these kind of metrics,
really makes sense. Exactly, exactly. And this is it's becoming even even more relevant because
nowadays we have the rise of big data. So we need to store a lot of data, but we also want it
real time. And now how can how can we do it? I have an idea. You use GPU database.
Oh, dear. Anyway, let's not go that today.
Full disclosure, Martin works at Berkeley, the GPU database shop. Richard, if you're listening,
the email address is sponsor at Linus in lost out of you. But Martin will send you to the details.
Okay, sorry. Okay, but enough of the end of commercial break and then please do continue.
So yeah, and also another use case is where you the exact opposite is not about big data now,
but you have very limited memory on some devices like routers or maybe some IoT devices.
And you have a very very limited memory space, so you cannot store a bunch of data there so you can
compare things, let's say, if a member is present in a set, for example.
And I guess also for some questions, you don't need the exact answer,
you don't need to know if it was one million and five and 23 years care of, it's a million or a
hundred thousand. Yeah, exactly. We can, I mean, we can, if you want, we can already,
we can talk maybe about one filter, which are one of the simpler ones, but they're pretty cool,
and they have some really nice use cases, so it becomes a little bit more obvious to the audience.
Why all means? And there's no right air, yes. Okay, so we can, there are four main
families of probabilistic data structures. Membership, cardinality, frequency, similarity.
Membership, it is asking, so is this member present in a set? That's all it does.
Now, what would be, and Bloom filter is one of the probabilistic data structures, maybe together with
Cucucu filters and that different for riles of Bloom filters, and that are the most prominent in
this family. So, a Bloom filter can answer your question like, for example, in the financial
vertical. Has this user paid from this location before? Or has this credit card been reported
as stolen? So imagine how many transactions are done in the world every day, every minute,
every second, with how many different card numbers? Everyone of those, I'm assuming that everyone
of those transactions needs to be checked, if that card has actually been reported as stolen.
So if all of those things need to go to some main database, some relational database,
first of all, if you want to imagine the size of the database where we stole all those credit cards,
then that database has to be shared between different payment processors in the world.
It needs to be updated all the time. In real time, essentially, right? Because you're trying to
understand. And then you also on top of that, you even have the problem of storing those,
so the problem of security, you need to actually store those credit card numbers so you can compare
if this new credit card that just paid is a part of those numbers or not.
So in this specific use case, Bloom's filters are a perfect match because you can populate one
Bloom filter by adding, you can have your list of all the cards that have been reported as stolen,
you take them, hash them, put them in the Bloom filter. And now, every time someone wants to check
if a card number has been stolen, just takes that number, has it, and compares it to what we have
in the Bloom filter. The Bloom filter can give you two responses, yes or no? If it gives you the
response, no, you can definitely trust that response. That means that that card has definitely not
been stolen. If it gives you a response, yes, it means that, okay, this might have been stolen,
but maybe not. If we go back to the shadows analogy here, it can say, okay, well, you are asking,
okay, if this ball present in the set, and then the filter looks and finds a shadow that looks
like ball and say, yeah, I see a shadow of a ball, I think it is, I think it is present,
but actually it was not a ball, it was a lamp, or it was someone's hand. Or I had or something,
exactly. But it cannot find any shape that looks like a ball, it can definitely say, no, for sure,
there is no balls in this set. So no false negatives, if you throw a triangle at the set,
it will definitely say sorry, not present. Exactly. So in this case, for this specific use case,
when you're checking if a card has been stolen, the most valuable response to you is no, because in
most of the cases, in most of the transactions that you're going to make, the response is going to
be no, right? Okay. And just by knowing that for sure, if your Bloom filter answers no,
that prevents you from having to go to the main database to check.
So we save a lot of load on the main, some main database somewhere, or do it 100 or so database.
Another use case would be Netflix, right? Netflix, it's one of these hipster video networks,
yes, because Netflix uses caches all over the place, fun enough. And if I understand this correctly,
essentially what Netflix does in that particular context, there's such a chance of video information
in local caches. So if an endpoint, like a mobile device or browser basically goes back to the
middle tier asking, hey, I need this chunk of video, is it present in your cache? All you have to do
is basically apply a Bloom filter. And if it's not present, you just simply go upstream and get it,
but otherwise you can basically directly stream it from the cache, exactly.
I have a question. How would that look up compared to, say, an index look up on a
radix tree in terms of time cost? It's all one. I remember correctly. Because you just have to
compute the hash function, right? Exactly. Because it doesn't depend on anything. You just need
to compute the hash function. And not just that, the memory needed to store the Bloom filter versus
all of the cards that have been stolen in full string or number, whatever it is, is very big.
And imagine, then if you also need this, then propagative and further, because your full
data structure containing all of the stolen cards, if it's two megabytes or two hundred gigabytes,
and you need to sync them that between regions. So everyone can have a good latency. It makes,
it really does make a huge difference, right? Make sense, make sense, Martin?
Yes, of course. Yes, sorry. Sorry. It's a lot of stolen cards, by the way.
Martin Pernies, Martin Pernies punny the possibility of going into Craycut Fraud Hall, say,
extra, Martin, keep it going. Got to know the variables here.
So this is proprietary. I'm afraid. Okay, very, very interesting and useful information there,
and especially with regards to use case, because fraud detection is becoming much more of a problem
than, say, 10, 15, 20 years ago, because of the rapid move into online business, never mind,
high-knit cards, but what is the other third beer there? Corona, sorry, yes. Corona or not.
So more and more people are buying stuff online, and it is, say, most of the time they would use
some sort of credit cards, debit cards payment method. And so fraud detection is becoming more and
more important, as we speak at this very part of the time, because if you're emerging,
if you're currently a processing company, you want to make sure that you are not being ripped off.
Goes without saying. Yes, exactly. And there are many other use cases.
Mostly, I think it's the better fit when the meaningful
answer to your question is no. So whatever is going to prevent you from going,
if you get a yes from the bloom filter, that's still okay. You just know that, okay, well,
in this case, I'm going to go and have to pin the main database, whatever that name means.
But all those notes, all those no answers, prevented you from going to the main database
all the time. Just better latency, of course. Just curious, Alina, because the whole thing
writes on the hashing function, but what would be the typical probabilities, let's put it this way,
of your average BDS implementation, let's put it this way, with regards to accuracy.
As I said, at the end of the day, it depends on the implementation of the hashing function.
But what is your experiencing? What is your experience with regards to the probabilities coming
back that you have experienced so far, if you can talk about this? Yeah, of course. That should
be totally configurable. Everyone can implement their own bloom filter. That's really not a problem.
The algorithm is out there. People can do it. And the accuracy is going to depend on the number
of hash functions you're going to use, which is of course, then we're going to influence CPU usage,
and the size of your bloom filter. So the bigger the bit array, the more precise, the more accurate
the filter you have, the smaller the bit array. Yeah, we should probably explain why a bit array now
comes into different contexts. A bit array is then the field, the canvas on which we are projecting
the shadows. If our canvas is one by one meter, and we are projecting 20 shadows on that one by
one meter, it's going to get blurry, and after a while you're not going to be able to recognize
anything. Everything is going to look like everything. If you're projecting the same 20 elements
on a 10 by 10 canvas, they're going to be much more spaced out, and you're going to be able to
recognize them much more nicely. Okay. Yeah, I was just to make the point clear, the size of the
bit array and this analogy is the size of the canvas. Okay. You alluded to the whole thing earlier
on, but why are they called bloom filters, and why is bloom that much of a mystery?
For me, I don't know why he chose, or if he ever chose to be a mystery, for me, he's a mystery
because I couldn't find anything about him on Google. He's the chap who invented essentially
the algorithm behind this. Yeah. Yeah. I think in the 70s or something like that.
First, it was used for a dictionary, actually, to check words against the dictionary.
Yeah. Full details, of course, including the quantum superposition of the bloom filter
will, of course, be in the show notes, but to carry on a line up.
Well, that's it. I don't have anything else about the mysterious hours, I think, how are bloom?
How bloom? Okay. Is it still alive or is he?
I don't know. No one knows. Okay.
When I was researching that last year or a year and a half ago, I couldn't find out anything.
But he's a virus, is he? Yeah. But he's a virus. He's a virus,
is he? No, I'm responding at all. I don't know. I don't know. Maybe even I don't know
if I ever knew that. Actually, I don't know. On the off chance.
On the off chance. On the off chance, Mr. Bloom, if you're listening,
the email addresses feedback and little's in-laws. E.U. please get in touch.
So we can plan you in for an episode coming along in the future.
Okay. Going back to the corresponding value simplification now.
The way I see it, essentially, is that there are a couple of implementations of bloom filters
maybe you can elaborate on these. We have we have one implementation of bloom filters,
although we do have something quite specific to us. I know if it's patented, but one of the
the people who worked Ariel and Dr. Carlos Paquero who worked on that.
They really, really stuck paper. It's scalable bloom filters with this scalable bloom filters.
You could just specify an error rate. You want to keep. So you would say, okay,
I want my error rate to always stay below this value. And then as you add more elements to the
bloom filter, it's going to just scale up. It's going to keep stacking filters one on top of the other,
and so it can keep your error rate to what you request it. So that's pretty specific to
Redis Labs to the bloom, Redis Bloom implementation. But just to go one step back,
first of all, when the Redis Bloom module is the module that for Redis, Redis
implements module API, where anyone can go ahead and create a module you see and extend the
data structures. What's a module, please be more specific. It's a piece of code
written in C that uses the Redis API, the Redis module API, and then you can
use Redis Baby Glen. You can also hijack even some commands that are sent to the Redis server.
So you load your Redis server with your module and you can implement your own data structures.
So in Redis, you have strings, you have hashes, leads, you have a few data structures.
With the Redis module, you can implement your own. And in the case of Redis Bloom module,
we have four new data structures that were implemented to that module. It's the bloom filters,
the cuckoo filter, which is similar to Bloom, but implemented quite differently, let's say,
but a similar use case, it's from the membership family. We have the count means catch and the
top K heavy keeper. Those are the four data structures that we have. You mentioned that there were
the four different types of different purposes. So the bloom filters are membership or take it.
What are the others for? Yes. So cardinality, that is another
families. A cardinality is to estimate the cardinality of the sets. So membership is to
to determine if a member is present in a set. No, for cardinality determines the cardinality of a set.
Frequency, obviously, the frequency of elements in a string and similarity to determine how
the grade of similarity between elements.
That's not the term anything.
When would you use the other ones?
When would you use?
Have you some examples?
Okay, so for cardinality, in the Redis Bloom module, we don't have anything for cardinality,
but we have the hyperlog log, which is a native Redis data structure.
We can talk a little bit about that because it's another super cool one. With the hyperlog log,
in only 12k, 12 kilobytes of memory, you can estimate cardinality of huge sets.
Let's explain that with the use case.
Let's go with the YouTube video, for example.
In YouTube videos, you have a number of views.
With these views, how would that work conceptually?
Well, let's go with IP addresses or whatever kind of unique user identifier.
We have maybe user ID, maybe some combination of some IP with whatever identifier, browser,
whatever cookie, or something. So you have any kind of unique user ID.
And then every time that user ID views a video, you need to decide if that
user has viewed that video before.
So in order for you to be able to do that, you need to think, well, how can I know if they viewed
this video before? Unless I have a list of everyone who's ever viewed this video before.
And then I have, I compare to that list, and then I know the more viewers I have, the longer this list
becomes, it becomes pretty much maintainable in years. What happens with the hyperlog log log,
every time you have a new user coming, you get its ID, you just stick it in the hyperlog log
data set. And then every time you query, you ask, you query that the data structure is okay,
how many unique elements do you have inside, and it's going to tell you.
And no matter, you can add huge numbers. At the moment, fortunately, I don't remember the numbers,
the still guarantee low errors. I think the margin of error was 0.1%, but it's going to tell you,
okay, I've seen this many unique elements. And it's very interesting how it's implemented inside,
but I think I'm not sure I will be able to explain it very well, it'll be just words and not
visual explanation. Yeah, I think we're going to need this for the listeners to our podcasts as well.
The details, of course, will be in the show notes.
To keep track of all of them. There is actually, I do have a presentation I gave on
UbuntuCon, where I explained, I have a few minutes explaining how the hyperlog log works internally,
maybe we can link that to you if someone's interested. Yes, the details, of course, will be in the show
notes to your listeners. Okay, so that's about the cardinality family, frequency family,
estimates with what frequency some elements of cures in a set without having to store all the
elements that have ever occurred in that set. And in the Redis domain, we have the
company sketch and the top K, in its heavy keeper, and the top K is the more performant one.
What it does is, okay, you watch all the elements that show up in some stream and you can see
what are the elements that appear the most, the top five, top 10, what are the elements that
appear most. And this can be a nice use case for this, can be maybe in gaming, what are the
key players with high score, if it doesn't matter, if it doesn't, if you can sacrifice some
precision. And so you have the flow is the incoming game score, and then you can even store a separate
sorted set during the top K users. Every time a user scores points, it's added to the top K list.
Or then trending hashtags, so for social media platforms, the new distribution networks,
you can say, what are the K hashtags people have mentioned the most in the last X hours.
And then imagine how the flow of information, how many hashtags someone would have to store.
But if you post every hashtag, if you pass every new post and every new hashtag through this,
top K have a keeper, then the top K is going to actually store and give you the top K hashtag that
have appeared. Now I think what is important in this context is actually the observation that all
these, all these hashing algorithms, or sorry, all these properties, as well as other are
scalable, because essentially what they just store are hashes. So never mind whether you throw
10K values at them, or a petabyte of data, the data and NCB use like a consumption is predictable
to some extent. This is a property you do not, you do not necessarily have with all new
registers. And the sets, and the sorted sets that Elena just mentioned are probably the two
primary examples here. And sorry, sorted sets are not just a redisnturnal data structure,
but it essentially is set that where each and every element has a score attached to it.
Think of it like a leadership board or a recommendation list. It gives you a sequence of the elements
in a set, just for the few dismissals who do not know what a sorted set is. Okay.
Sorry, sorry, sorry, I have a problem. Yeah, it's going to ask, so does this one have an error rate
attached to it as well? The top K, yeah, yeah, it does. It's an estimation, because there is also,
with the top K, there is also a decay algorithm where it, so the top K, it's not, it's not the top
case since the top K was differentiated. It's the heavy hitters in the last whatever time,
whatever. There is a decay algorithm at the moment, I don't remember how exactly it was working.
I would have to read my own article, because it's been over half a year that I haven't checked.
But the analogy that I had for this is, imagine, so they call it elephant flows with the heavy hitters.
If you have elephant DJ. So if you have a field and then you have five elements,
five elephants, elephants passing through it, one behind the other, they're going to leave some trail.
Yeah. But then you have 10 elephants going in another way, they're going to leave another trail.
And then maybe you have a whole herd that goes on an opposite direction. And then,
after they're gone, what you're going to see, the tracks you're going to see in the field,
is mostly the one from the full herd. The ones from the five elements and 10 elephants, sorry,
the ones from the five elephants or 10 elephants are going to be mostly deleted and they're not going
to be visible anymore because of the full herd that passed there. So it's not that much time related
as in the volume related. But then after a while, wind blows, you have sandstorms or any kind of storms.
And even that track of the whole herd starts to fade out and newer tracks start to form.
Top K can be visualized like that for the people who like to visualize things.
And you mentioned that there are actually two types of PDFs called filters,
namely blue one and the cocoa one in red is. Maybe you can, yes, maybe you can shed some light
on the commonalities and the differences. Yeah. Well, the cocoa filters and they also check
and they enable you to check if an element is present in a set. I know also using a very small
member space, fifth size, they sacrifice some precision for it. But they're implemented
completely differently. And for some cases, they are maybe faster for checking, but not faster for
adding elements in the filter. There are some particularity. But the biggest one, the biggest
difference here is that the cocoa filter from a cocoa filter, you can delete elements. You cannot
delete elements from the blue filter. So that is one big difference. The second difference,
then maybe more subtle is for cocoa filters, you only get discrete error rates that you can set.
So maybe 0, 4, 0, 7, depending on the implementation, but specifically for the implementation in the
red is blue modules. You cannot choose any rate you want. You have a set of discrete error
rates that you can choose from. Another thing is that, even though it's maybe faster for some use
cases, if you have, if you repeat, it's going to fill up faster if you add the same elements to
it twice. So in blue filters, if you add the same elements for a second, third, and time,
in the blue filter, the blue filter is not going to change. Nothing's going to change inside of it.
With the cocoa filter, it will. And you're going to end up with that element twice or three times.
Okay. I assume all the implementations of these modules are open sourced.
If they're not all, I think there's source available license, not open source in the red is blue module,
but the difference is there. I don't actually, I don't want to go into that discussion.
Well, but yeah. But, uh, I don't know if you've ever talked about the source available license.
No, I haven't. And there's reason for that. And there's just not a licensing podcast, but rather,
well, full disclosure, actually, there will be an upcoming episode on open source licenses,
summertime frame, so stay tuned people if you're interested, if you're interested in open source
licenses, or if you can't get to sleep at night, just in, just don't miss this podcast. But
but this is not something that we will go into tonight. The main point is actually that all
these source code is available on GitHub of these of this implementation, whether
the use case, of course, it may be different one, but if you want to take a look at it and how it's
done internally as an implemented for you free. And I think, speaking of modules, the main
implementation language is still C, right? But I think there's one module implemented, I think
in Rust these days, and that's right, but it's Jason too, if I'm not completely mistaken.
Yes, I heard so. Okay. Yeah. Has this been released?
Yes, it has. And I think some people use already in production. This is, of course, a shameless
plus, this is, of course, a shameless plug for Rust. Full disclosure, yes, there will
be a podcast episode of the licensing loss on Rust, but this is now enough with the commercial
breaks. Let's continue with the PDFs. Okay. Martin, any, any, I wouldn't say final thoughts,
but any thoughts on any questions on, on, on PDFs? Well, I'd like to go back to
having this, I don't know, this opening statement that there were four different types.
Have we covered all of them? We covered with a given example for membership
cardinality, frequency, we didn't give an membership for similarity. I don't know any implementation
personally. That's why I don't want to go there, but similarity in general, it can estimate how
similar all different elements are in the set. Okay. So, are you familiar with the roadmap
for release bloom? Perhaps, if, I mean, it's been around for a couple of years, at least,
and it's, yeah, I think that was pretty stable. I haven't seen any, any big developments
in there. Well, full, full disclosure, dewey, I've gotten touched with regards to quantum
hashing integration of this, but I'm not too sure what they, what the current status is,
better check with product management, full disclosure, people, that was a joke.
An active error of research. I found a scientist correctly. And Martin, beyond the GPU database
field. I'm not afraid. Okay. Last question, Wesley, that I have, because this whole thing,
although it sounds very theoretical, we mind it in terms of the mass beyond all the rest of it. But
that flicks and friends much, but must be just sort of the iceberg. I reckon from practical
use case perspective, the applications must be, must be multitude of algorithms, basically,
making use of this. I think so. I think so. And I don't know if we can talk about what we've seen,
what I've seen in, in release lives, but there are people, we just, we just touched on a few
use cases, but imagine a Google, let's say Google sign up for them. Every time you try to create a
new, new Google address, it has to check if that your email already exists. There you have it,
another use case for, for a bloom filter. Has this email address been used already?
Or a dictionary? So is this word that I just wrote in the dictionary? Is it correctly spelled?
That was, that was actually one of the primary use cases for, for the bloom filter.
Which is quite an example, because as we all know, languages can become like spists.
Yes. Even, even our languages outside the problem with languages realm.
Yeah. I mean, the correct spelling, at least, at least a very basic way to, to highlight a word
that, that doesn't exist in the dictionary. If you look at it like that.
Sounds like a very good use case indeed.
Any final thoughts before we go into the boxes and the entire boxes?
No, I, I just, about the probabilistic data structure of the realm. Once you, once you get to,
to learn what they can do, you really kind of start to, to fall in love with them. You start to see
applications everywhere. And it's good. It's a very good tool to have in your developer tool belt.
But I, I don't know, where do you see this going? I mean, I'm not talking about de-ways now.
I'm just talking about kind of, from an application perspective, from a, from a, maybe from even from
an implementation perspective. This in terms of, of what, of the, the development of this bloom module.
Anything goes.
Where the humanity is going, for example, to Mars.
Or whether they, or whether they cartels will flip the coin. It does a frauditation. You never know.
No, I'm joking. And then, but, but if, if I had kind of visionary thoughts, now is the time.
Well, if the direction is more and more data and more and more real time requirements, then
I think that we are going to see many more new algorithms for probabilistic data structures.
We're going to see new implementations that we haven't taught possible.
And probably even more uses for, for the current algorithms and improvement. Even the bloom
to their algorithm has already been improved multiple times. There are multiple variants of it.
I mean, this is the beauty about math, right? Similar to computer science, it never sleeps.
Because, because people are basically just wondering what is wrong with the consider of things,
take it apart, put it back together again, and improve it.
Oh, engineers.
It's funny enough. This is, this is the very core thought of something got open source.
If I'm not completely mistaken.
Okay, can we get some of those engineers on tipsy path?
Jitsy, if you're listening, please get in touch.
Yeah, it didn't break down this time. Yes, but you probably have to cut this out.
Okay, guys, that has been a very fun episode. Never mind missing math, but we don't want to
bore all these relationships, all these relationships rather to death. So this was every, this is actually,
this was actually very light on math. Needless to say, there is a complex mathematical foundation
of this, beyond tissues, details, of course, on the show notes. Probably tricky to do on a podcast.
Yes, exactly. And with that thought, I think it's now, as the tradition demands it, now's the time
to go into the boxes. And Ellen, I'm not sure if you're familiar with the concept.
The boxes are the picks of the week. This acronym is a great model by Lynn. By Lynn was a loss.
So anybody trying to rip this off, your, our lawyers will be in touch.
Don't decide. And anything goes, anything that crossed your mind, worth mentioning,
Ellen, in terms of font remembering, something that you really like that has crossed your path and
say, we go to or something, and then we take turns.
It's what, can you, can you say it again, Foxy? What is Foxy?
Yes, anything that you found is anything that came across positives.
No, but what are positives of, okay.
I mean, we normally, we normally confine ourselves to movies these days.
But anything goes, can be a movie, can be TV series, can be a music, can be a book, you name it.
Hey, you know what? It was a fun set.
I hadn't went out. I'm Portugal and we have been in lockdown for a while. And it's crazy how
this, but after a few weeks of not going out for walks in nature at all, I finally went out
for a walk in the sun. And it was finally also a funny day after many weeks of rain.
And it was weird. I just felt that I sprung right back to life and I realized how much
the sun and being out in nature makes a really big difference in our lives.
Certainly a valid park. No, that's fine. Anything goes as much as that.
Does that make rain you're at the park?
We covered it in a minute. Okay, full disclosure, I had the very same experience today as
Elena because I took literally an hour break and an hour long break actually done some cycling
in Frankfurt. It was about 15 degrees, but the sun was shining. And that was in the current
situation, Germany is still under lockdown. Yes, it will come diversion, but over to your mountain.
I think, yeah, well, after today, I think my box was the week is from really data structures
because it's a lot of fun. It's a lot of things I use, but it's a beautiful piece of, I guess,
algorithm for most of these. Yeah. Okay, end epochs is if there are any.
I'm going to rain rain. I'm going to rain.
Great minds think life because I was just about to say the weather forecast because the weather
predicts actually a turn of winter come to ball for central thankful where I'm living. So
okay, fair enough. And any thoughts on it will possibly enter end epochs.
Puppies. Sorry, say good. Puppies. Everyone loves puppies. Okay. You probably have to explain why.
The flowers or the dogs. No, the dogs, the dogs, the big dogs, puppies. I don't know if I want to
explain exactly why it might be too much information. I'll take it and you don't have it open.
You still have only one, right? Yeah. Okay. We will probably leave it at that. Okay. Guys,
our other element that has been really fun. Well, thank you very much for for for joining this.
I hope you have you back it more than welcome to. I hope we hope to have you back
or Olga for the matter in an upcoming episode very soon and really looking forward to it.
Thank you. Thank you guys. Yeah, thanks. Thank you. This is the Linux in-laws.
You come for the knowledge. But stay for the madness. Thank you for listening.
This podcast is licensed under the latest version of the creative comments license.
Type attribution share like credits for the intro music go to blue zero stirs for the song
market to twin flames for their piece called the flow used for the second intros and finally
to select your ground for the songs we just use by the dark side. You find these and other
these licensed under cc at Chimando or website dedicated to liberate the music industry
from choking copyright legislation and other crap concepts
You've been listening to hecka public radio at hecka public radio dot org. We are a community
podcast network that releases shows every weekday Monday through Friday. Today's show like all our
shows was contributed by an hbr listener like yourself. If you ever thought of recording a podcast
and click on our contributing to find out how easy it really is. hecka public radio was found
by the digital dog pound and the infonomican computer club and it's part of the binary revolution
at binrev.com. If you have comments on today's show please email the host directly leave a comment
on the website or record a follow up episode yourself unless otherwise status. Today's show is
released on the creative comments, attribution, share a light 3.0 license.