520 lines
44 KiB
Plaintext
520 lines
44 KiB
Plaintext
|
|
Episode: 3309
|
||
|
|
Title: HPR3309: Linux Inlaws S01E27: The Big Uncertainties in Life and beyond
|
||
|
|
Source: https://hub.hackerpublicradio.org/ccdn.php?filename=/eps/hpr3309/hpr3309.mp3
|
||
|
|
Transcribed: 2025-10-24 20:35:39
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
This is Hacker Public Radio Episode 3309 for Thursday, the 8th of April 2021.
|
||
|
|
Today's show is entitled, Linux in laws S0127, the big uncertainties in life and beyond.
|
||
|
|
It is hosted by Monochromic and is about 57 minutes long and carries an explicit flag.
|
||
|
|
The summary is, the two chaps discuss uncertainties and beyond in this episode on probabilistic data structure.
|
||
|
|
This episode of HPR is brought to you by Ananasthost.com.
|
||
|
|
Get 15% discount on all shared hosting with the offer code HPR15. That's HPR15.
|
||
|
|
Better web hosting that's honest and fair at Ananasthost.com.
|
||
|
|
This is Linux in laws, a podcast on topics around free and open source software, any associated
|
||
|
|
contraband, communism, the revolution in general and whatever else, fans is ethical.
|
||
|
|
Please note that this and other episodes may contain strong language, offensive humor and
|
||
|
|
other certainly not politically correct language. You have been warned.
|
||
|
|
Our parents insisted on this disclaimer. Happy mum?
|
||
|
|
That's the content is not suitable for consumption in the workplace, especially when
|
||
|
|
played back on a speaker in an open plan office or similar environments. Any miners under the
|
||
|
|
age of 35 or any pets including fluffy little killer bunnies, your trusted guide dog,
|
||
|
|
unless on speed and qt rexes or other associated dinosaurs.
|
||
|
|
Welcome to Linux in laws, season one episode. I can't even remember what episode is it?
|
||
|
|
Episode where did it see vaguely? Maybe it's episode 40 something.
|
||
|
|
Which is of course a circuit to today's subject, there may probabilistic data structures.
|
||
|
|
And for this subject we have no other than or below it, Elena, you probably remember her from
|
||
|
|
the peasant girl, sorry, back in the Halloween special of something called Linux in laws.
|
||
|
|
If you haven't listened to it, the listeners should please go back, you find the link on the
|
||
|
|
on the website. It's an episode not to be necessarily understand. Yes, exactly.
|
||
|
|
So when Elena is not working as a voice double, she's actually working in at a company called
|
||
|
|
Linux. I thought maybe Olga, especially Elena, why don't you introduce yourself properly?
|
||
|
|
Wow, you already know Olga, so I'm just going to summon my Elena persona. Please do yes.
|
||
|
|
Okay, and the else we should be aware of here.
|
||
|
|
That's that's that's me, my Elena persona is pretty, pretty stable. Although I have different roles,
|
||
|
|
well, I have the role, my professional role, of course, I have the role of a mom, I have a role
|
||
|
|
of a friend, but in this context, let's talk about my, the professional Elena. So I work for
|
||
|
|
Redis Labs as a technical enablement. Oh, no, sorry, now senior technical enablement architect.
|
||
|
|
Well done, by the way, before I forget this. Thank you, thank you very much.
|
||
|
|
Technical enablement architect basically means we help the technical field, everyone working
|
||
|
|
everyone in the field in technical roles. To first of all, get up to speed with the system when
|
||
|
|
we all architecture and they join the company. And then also maintain their knowledge
|
||
|
|
fresh by releasing new trainings about all different technical aspects of our system.
|
||
|
|
So that's what I do professionally. Outside of that, I think that's really close to my heart is
|
||
|
|
the women in tech question, but I'm personally, I am more of the type of women who,
|
||
|
|
instead of going on Twitter and complaining about what someone did, I like to
|
||
|
|
roll up my sleeves and get some work done and we'll do some coding or just become a better
|
||
|
|
engineer in order to be a stronger woman in tech, a woman in technology. So
|
||
|
|
I had done some work previously on that field and a while ago, I was invited from the woman
|
||
|
|
in tech global organization to be the global education program director.
|
||
|
|
So now we're working on that. We have some partnerships with Cisco,
|
||
|
|
Microsoft, and we're going to be doing some free courses for women and people who identified
|
||
|
|
as female or minority technology. Wow. So yeah, coding. That's my thing.
|
||
|
|
Excellent. Okay. For those of people, our first, I really met Elena. I think it was in November,
|
||
|
|
December 2019 in London, when you give a presentation on something called probabilistic data structures
|
||
|
|
for no sequel, for in memory, no sequel to the way it's called Redis.
|
||
|
|
Yes. And this would be today's subject as in what are really probabilistic data structures
|
||
|
|
and why are they important? Not just in Redis context, but generally speaking, but before we go
|
||
|
|
into these technical details, maybe it's worth explaining what Redis is in an in memory,
|
||
|
|
no sequel to context, given the fact that two thirds of today's speakers on this podcast
|
||
|
|
will still work for Redis one defected quite some time ago. So let's do this jointly.
|
||
|
|
Redis is about 10 years old. It's a it has the spot. I think it has the seventh
|
||
|
|
to rank on a website called DB engines, if I'm not completely mistaken.
|
||
|
|
And it's a stack of those anything to go by. It's the most beloved database for I think at least
|
||
|
|
four years in a row is in terms of voted for the main differentiation. It's probably the
|
||
|
|
best term what I'm looking for is actually that in contrast to other databases, Redis does it
|
||
|
|
all in main memory in terms of it doesn't it yes, it supports persistence, but it's the main
|
||
|
|
focus of the processing of data is actually doing this in memory and hence this kind of
|
||
|
|
playground of real-time performance, but why should I do all the explaining?
|
||
|
|
Elena Martin could we to chime in? Well, we started really well. Redis is
|
||
|
|
very fast, he's very loved by developers and I can talk about myself at least one of the things
|
||
|
|
I loved that made me fall in love in Redis even before I knew about Redis' life. So before I joined
|
||
|
|
Redis Labs was the efficiency and the part that it doesn't necessarily stick to all the academic
|
||
|
|
talk, it just uses many times it uses just some approximations to get things done
|
||
|
|
and make them work very well in 99% of the cases for people. So this is a
|
||
|
|
episode of probabilistic data structures, so even in Redis itself we have a lot of approximations
|
||
|
|
like if we talk about the LFU, LRU, the vision policies in Redis, they use approximations too.
|
||
|
|
And I kind of like that that efficiency being smaller, small memory footprint very fast,
|
||
|
|
it does things well and fast. Martin, you think too bad? Yeah, I think the one thing I think
|
||
|
|
for people not familiar with Redis is that it's basically a bunch of data structures that you
|
||
|
|
use for different purposes, which as opposed to your relational database, which has
|
||
|
|
table and those kind of structures, it's a bunch of different data structures which are very close
|
||
|
|
to programming paradigms instead. So it's more of a build of this type of piece of technology,
|
||
|
|
right? Interesting observation there. Yes. A bunch of Legos for us to play with.
|
||
|
|
Yeah, that's actually very good image there. Okay, but enough about Redis details.
|
||
|
|
Elena, what exactly are probabilistic data structures and why are they so important?
|
||
|
|
Okay, so what exactly are probabilistic data structures? It's a group of data structures that
|
||
|
|
give a reasonable approximation, but using just the fraction of the usual time and
|
||
|
|
memory that the deterministic data structure would use. So they use hash functions usually
|
||
|
|
to randomize and and completely represent the set of items and then collisions are ignored,
|
||
|
|
which leads to usually leads to some margin of error. Before we go any, sorry, I mean before we're
|
||
|
|
going further, I think we cannot assume that everybody knows what hash functions are and what
|
||
|
|
collisions are. So hash functions are either way, okay, how do we? So let's say you get a value,
|
||
|
|
you do your hash it, I don't know how deep should we go into explaining how hash functions work,
|
||
|
|
but you have that value and many different values can have a similar, can have the same hash.
|
||
|
|
And that would be a hash collision. I like to use this analogy. Well, if we have people that
|
||
|
|
don't know what what the hash function and I like to use this analogy, if the object that we have
|
||
|
|
is let's say the real life object, a hash of that is its shadow. Okay, so the shadow of that
|
||
|
|
object would be its hash. It kind of represents, it kind of is a silhouette of what was there.
|
||
|
|
We can kind of tell something about it, but you cannot know what is it and many different objects,
|
||
|
|
let's say a ball, someone's head or a lamp can have the same shape of a shadow, but in that
|
||
|
|
way, different objects. Very much so. I like the image, I like the comparison there.
|
||
|
|
That's I like the analogy. Let's put it this way. Beautiful. Thank you. So yeah, so we're
|
||
|
|
in probabilistic data structures, you use that kind of, you represent the elements with their shadows,
|
||
|
|
basically. Now, by the way, before I forget, very practical uses of hash's, I actually,
|
||
|
|
if you look into a Linux Linux system, because the passwords stored in ETC password WD or in ETC
|
||
|
|
shadow are actually sold at hashes. I won't go into that near details, but this is the primary
|
||
|
|
touch point. If you use a Linux based system, when you log in, cap in your password, the password
|
||
|
|
is not stored in clear text in ETC password WD or in ETC shadow, but rather as a hash.
|
||
|
|
So the one thing to do with the probabilistic data structures here.
|
||
|
|
Thank you, Mark. So let's cut the short and this is just another application over hash. So
|
||
|
|
the idea is basically you enter password, this password is then converted by a hash function
|
||
|
|
to a hash. And then the hashes are compared. Should these has just match essentially. Sorry, but
|
||
|
|
yeah, but it's still connected. Yeah, not nothing to do with probabilistic data structures, but it's
|
||
|
|
still they use a deterministic function. A deterministic hash function means that every time you run it
|
||
|
|
with the same input, it's going to give you the same result, the same output.
|
||
|
|
So actually, I'm not sure if that is true for passwords to be honest, depending on the system.
|
||
|
|
It should be at least for PDS. Can I use PDS? Probably data structure. Thank you. Yes.
|
||
|
|
It is definitely true and crucial to their functioning. Now, I mean, why would someone use
|
||
|
|
more probabilistic data structures? You would ask, okay, well, why would I want you to sacrifice
|
||
|
|
precision, right? We are developers. We work in with exact thing. It's an exact science. I wanted to
|
||
|
|
have my results always correct. But there are cases where you would sacrifice some precision
|
||
|
|
if some accuracy, if you can gain space or time. So there is this thing called
|
||
|
|
in this triangle, the triangle of space time accuracy in data processing, where you have
|
||
|
|
space accuracy and time and component on the three edges of the time of the triangle.
|
||
|
|
And you can choose to. You cannot have three with the data structure. So either you're going to
|
||
|
|
have accuracy and low memory, but then you sacrifice time. So you don't get real time result.
|
||
|
|
Or you have space and time in the case of probabilistic data structures. You save one space,
|
||
|
|
space, you have good time performance, but you sacrifice accuracy. You cannot have all three.
|
||
|
|
Exactly. This is also known as the Kolevska theory, right? If I'm not completely mistaken.
|
||
|
|
No, I don't think so. Maybe I'm wrong. I don't know. Maybe there are some Kolevska who lived
|
||
|
|
long before me from mysterious. There is a mysterious people in the PDS world.
|
||
|
|
But we can talk about that, the Bloom guy, one of the most famous structures.
|
||
|
|
After Bloom, he's completely mysterious. You can't find anything about him online. You can
|
||
|
|
know photos. We don't know if he's still alive or not. He's mysterious.
|
||
|
|
Yes, let's let's practice for about two and a half minutes before we go into the subject.
|
||
|
|
But okay, it mustn't any any thoughts on on PDS is before we move on.
|
||
|
|
Well, we're not moving off PDS, are we?
|
||
|
|
I'm not sure if I was about in terms of the principles and stuff and whatever. The theory,
|
||
|
|
that is. Yeah, no, that's very well put about the trade-offs, right? It's it's a very different
|
||
|
|
approach in terms of a, you know, if you look at it from different database of different technologies,
|
||
|
|
you're effectively at storage time. You compute a data structure rather than
|
||
|
|
restoring all the data as is and then calculating a a result out of those. Which of these?
|
||
|
|
I mean, that goes especially for use cases where you quickly run into either time or space
|
||
|
|
problems in terms of having to start terabytes, gigabytes, petabytes, exabytes of memory,
|
||
|
|
where simply main memory never mind secondary storage doesn't measure up. Same goes for run time
|
||
|
|
and this is where PDS is if you are willing to trade in accuracy versus these kind of metrics,
|
||
|
|
really makes sense. Exactly, exactly. And this is it's becoming even even more relevant because
|
||
|
|
nowadays we have the rise of big data. So we need to store a lot of data, but we also want it
|
||
|
|
real time. And now how can how can we do it? I have an idea. You use GPU database.
|
||
|
|
Oh, dear. Anyway, let's not go that today.
|
||
|
|
Full disclosure, Martin works at Berkeley, the GPU database shop. Richard, if you're listening,
|
||
|
|
the email address is sponsor at Linus in lost out of you. But Martin will send you to the details.
|
||
|
|
Okay, sorry. Okay, but enough of the end of commercial break and then please do continue.
|
||
|
|
So yeah, and also another use case is where you the exact opposite is not about big data now,
|
||
|
|
but you have very limited memory on some devices like routers or maybe some IoT devices.
|
||
|
|
And you have a very very limited memory space, so you cannot store a bunch of data there so you can
|
||
|
|
compare things, let's say, if a member is present in a set, for example.
|
||
|
|
And I guess also for some questions, you don't need the exact answer,
|
||
|
|
you don't need to know if it was one million and five and 23 years care of, it's a million or a
|
||
|
|
hundred thousand. Yeah, exactly. We can, I mean, we can, if you want, we can already,
|
||
|
|
we can talk maybe about one filter, which are one of the simpler ones, but they're pretty cool,
|
||
|
|
and they have some really nice use cases, so it becomes a little bit more obvious to the audience.
|
||
|
|
Why all means? And there's no right air, yes. Okay, so we can, there are four main
|
||
|
|
families of probabilistic data structures. Membership, cardinality, frequency, similarity.
|
||
|
|
Membership, it is asking, so is this member present in a set? That's all it does.
|
||
|
|
Now, what would be, and Bloom filter is one of the probabilistic data structures, maybe together with
|
||
|
|
Cucucu filters and that different for riles of Bloom filters, and that are the most prominent in
|
||
|
|
this family. So, a Bloom filter can answer your question like, for example, in the financial
|
||
|
|
vertical. Has this user paid from this location before? Or has this credit card been reported
|
||
|
|
as stolen? So imagine how many transactions are done in the world every day, every minute,
|
||
|
|
every second, with how many different card numbers? Everyone of those, I'm assuming that everyone
|
||
|
|
of those transactions needs to be checked, if that card has actually been reported as stolen.
|
||
|
|
So if all of those things need to go to some main database, some relational database,
|
||
|
|
first of all, if you want to imagine the size of the database where we stole all those credit cards,
|
||
|
|
then that database has to be shared between different payment processors in the world.
|
||
|
|
It needs to be updated all the time. In real time, essentially, right? Because you're trying to
|
||
|
|
understand. And then you also on top of that, you even have the problem of storing those,
|
||
|
|
so the problem of security, you need to actually store those credit card numbers so you can compare
|
||
|
|
if this new credit card that just paid is a part of those numbers or not.
|
||
|
|
So in this specific use case, Bloom's filters are a perfect match because you can populate one
|
||
|
|
Bloom filter by adding, you can have your list of all the cards that have been reported as stolen,
|
||
|
|
you take them, hash them, put them in the Bloom filter. And now, every time someone wants to check
|
||
|
|
if a card number has been stolen, just takes that number, has it, and compares it to what we have
|
||
|
|
in the Bloom filter. The Bloom filter can give you two responses, yes or no? If it gives you the
|
||
|
|
response, no, you can definitely trust that response. That means that that card has definitely not
|
||
|
|
been stolen. If it gives you a response, yes, it means that, okay, this might have been stolen,
|
||
|
|
but maybe not. If we go back to the shadows analogy here, it can say, okay, well, you are asking,
|
||
|
|
okay, if this ball present in the set, and then the filter looks and finds a shadow that looks
|
||
|
|
like ball and say, yeah, I see a shadow of a ball, I think it is, I think it is present,
|
||
|
|
but actually it was not a ball, it was a lamp, or it was someone's hand. Or I had or something,
|
||
|
|
exactly. But it cannot find any shape that looks like a ball, it can definitely say, no, for sure,
|
||
|
|
there is no balls in this set. So no false negatives, if you throw a triangle at the set,
|
||
|
|
it will definitely say sorry, not present. Exactly. So in this case, for this specific use case,
|
||
|
|
when you're checking if a card has been stolen, the most valuable response to you is no, because in
|
||
|
|
most of the cases, in most of the transactions that you're going to make, the response is going to
|
||
|
|
be no, right? Okay. And just by knowing that for sure, if your Bloom filter answers no,
|
||
|
|
that prevents you from having to go to the main database to check.
|
||
|
|
So we save a lot of load on the main, some main database somewhere, or do it 100 or so database.
|
||
|
|
Another use case would be Netflix, right? Netflix, it's one of these hipster video networks,
|
||
|
|
yes, because Netflix uses caches all over the place, fun enough. And if I understand this correctly,
|
||
|
|
essentially what Netflix does in that particular context, there's such a chance of video information
|
||
|
|
in local caches. So if an endpoint, like a mobile device or browser basically goes back to the
|
||
|
|
middle tier asking, hey, I need this chunk of video, is it present in your cache? All you have to do
|
||
|
|
is basically apply a Bloom filter. And if it's not present, you just simply go upstream and get it,
|
||
|
|
but otherwise you can basically directly stream it from the cache, exactly.
|
||
|
|
I have a question. How would that look up compared to, say, an index look up on a
|
||
|
|
radix tree in terms of time cost? It's all one. I remember correctly. Because you just have to
|
||
|
|
compute the hash function, right? Exactly. Because it doesn't depend on anything. You just need
|
||
|
|
to compute the hash function. And not just that, the memory needed to store the Bloom filter versus
|
||
|
|
all of the cards that have been stolen in full string or number, whatever it is, is very big.
|
||
|
|
And imagine, then if you also need this, then propagative and further, because your full
|
||
|
|
data structure containing all of the stolen cards, if it's two megabytes or two hundred gigabytes,
|
||
|
|
and you need to sync them that between regions. So everyone can have a good latency. It makes,
|
||
|
|
it really does make a huge difference, right? Make sense, make sense, Martin?
|
||
|
|
Yes, of course. Yes, sorry. Sorry. It's a lot of stolen cards, by the way.
|
||
|
|
Martin Pernies, Martin Pernies punny the possibility of going into Craycut Fraud Hall, say,
|
||
|
|
extra, Martin, keep it going. Got to know the variables here.
|
||
|
|
So this is proprietary. I'm afraid. Okay, very, very interesting and useful information there,
|
||
|
|
and especially with regards to use case, because fraud detection is becoming much more of a problem
|
||
|
|
than, say, 10, 15, 20 years ago, because of the rapid move into online business, never mind,
|
||
|
|
high-knit cards, but what is the other third beer there? Corona, sorry, yes. Corona or not.
|
||
|
|
So more and more people are buying stuff online, and it is, say, most of the time they would use
|
||
|
|
some sort of credit cards, debit cards payment method. And so fraud detection is becoming more and
|
||
|
|
more important, as we speak at this very part of the time, because if you're emerging,
|
||
|
|
if you're currently a processing company, you want to make sure that you are not being ripped off.
|
||
|
|
Goes without saying. Yes, exactly. And there are many other use cases.
|
||
|
|
Mostly, I think it's the better fit when the meaningful
|
||
|
|
answer to your question is no. So whatever is going to prevent you from going,
|
||
|
|
if you get a yes from the bloom filter, that's still okay. You just know that, okay, well,
|
||
|
|
in this case, I'm going to go and have to pin the main database, whatever that name means.
|
||
|
|
But all those notes, all those no answers, prevented you from going to the main database
|
||
|
|
all the time. Just better latency, of course. Just curious, Alina, because the whole thing
|
||
|
|
writes on the hashing function, but what would be the typical probabilities, let's put it this way,
|
||
|
|
of your average BDS implementation, let's put it this way, with regards to accuracy.
|
||
|
|
As I said, at the end of the day, it depends on the implementation of the hashing function.
|
||
|
|
But what is your experiencing? What is your experience with regards to the probabilities coming
|
||
|
|
back that you have experienced so far, if you can talk about this? Yeah, of course. That should
|
||
|
|
be totally configurable. Everyone can implement their own bloom filter. That's really not a problem.
|
||
|
|
The algorithm is out there. People can do it. And the accuracy is going to depend on the number
|
||
|
|
of hash functions you're going to use, which is of course, then we're going to influence CPU usage,
|
||
|
|
and the size of your bloom filter. So the bigger the bit array, the more precise, the more accurate
|
||
|
|
the filter you have, the smaller the bit array. Yeah, we should probably explain why a bit array now
|
||
|
|
comes into different contexts. A bit array is then the field, the canvas on which we are projecting
|
||
|
|
the shadows. If our canvas is one by one meter, and we are projecting 20 shadows on that one by
|
||
|
|
one meter, it's going to get blurry, and after a while you're not going to be able to recognize
|
||
|
|
anything. Everything is going to look like everything. If you're projecting the same 20 elements
|
||
|
|
on a 10 by 10 canvas, they're going to be much more spaced out, and you're going to be able to
|
||
|
|
recognize them much more nicely. Okay. Yeah, I was just to make the point clear, the size of the
|
||
|
|
bit array and this analogy is the size of the canvas. Okay. You alluded to the whole thing earlier
|
||
|
|
on, but why are they called bloom filters, and why is bloom that much of a mystery?
|
||
|
|
For me, I don't know why he chose, or if he ever chose to be a mystery, for me, he's a mystery
|
||
|
|
because I couldn't find anything about him on Google. He's the chap who invented essentially
|
||
|
|
the algorithm behind this. Yeah. Yeah. I think in the 70s or something like that.
|
||
|
|
First, it was used for a dictionary, actually, to check words against the dictionary.
|
||
|
|
Yeah. Full details, of course, including the quantum superposition of the bloom filter
|
||
|
|
will, of course, be in the show notes, but to carry on a line up.
|
||
|
|
Well, that's it. I don't have anything else about the mysterious hours, I think, how are bloom?
|
||
|
|
How bloom? Okay. Is it still alive or is he?
|
||
|
|
I don't know. No one knows. Okay.
|
||
|
|
When I was researching that last year or a year and a half ago, I couldn't find out anything.
|
||
|
|
But he's a virus, is he? Yeah. But he's a virus. He's a virus,
|
||
|
|
is he? No, I'm responding at all. I don't know. I don't know. Maybe even I don't know
|
||
|
|
if I ever knew that. Actually, I don't know. On the off chance.
|
||
|
|
On the off chance. On the off chance, Mr. Bloom, if you're listening,
|
||
|
|
the email addresses feedback and little's in-laws. E.U. please get in touch.
|
||
|
|
So we can plan you in for an episode coming along in the future.
|
||
|
|
Okay. Going back to the corresponding value simplification now.
|
||
|
|
The way I see it, essentially, is that there are a couple of implementations of bloom filters
|
||
|
|
maybe you can elaborate on these. We have we have one implementation of bloom filters,
|
||
|
|
although we do have something quite specific to us. I know if it's patented, but one of the
|
||
|
|
the people who worked Ariel and Dr. Carlos Paquero who worked on that.
|
||
|
|
They really, really stuck paper. It's scalable bloom filters with this scalable bloom filters.
|
||
|
|
You could just specify an error rate. You want to keep. So you would say, okay,
|
||
|
|
I want my error rate to always stay below this value. And then as you add more elements to the
|
||
|
|
bloom filter, it's going to just scale up. It's going to keep stacking filters one on top of the other,
|
||
|
|
and so it can keep your error rate to what you request it. So that's pretty specific to
|
||
|
|
Redis Labs to the bloom, Redis Bloom implementation. But just to go one step back,
|
||
|
|
first of all, when the Redis Bloom module is the module that for Redis, Redis
|
||
|
|
implements module API, where anyone can go ahead and create a module you see and extend the
|
||
|
|
data structures. What's a module, please be more specific. It's a piece of code
|
||
|
|
written in C that uses the Redis API, the Redis module API, and then you can
|
||
|
|
use Redis Baby Glen. You can also hijack even some commands that are sent to the Redis server.
|
||
|
|
So you load your Redis server with your module and you can implement your own data structures.
|
||
|
|
So in Redis, you have strings, you have hashes, leads, you have a few data structures.
|
||
|
|
With the Redis module, you can implement your own. And in the case of Redis Bloom module,
|
||
|
|
we have four new data structures that were implemented to that module. It's the bloom filters,
|
||
|
|
the cuckoo filter, which is similar to Bloom, but implemented quite differently, let's say,
|
||
|
|
but a similar use case, it's from the membership family. We have the count means catch and the
|
||
|
|
top K heavy keeper. Those are the four data structures that we have. You mentioned that there were
|
||
|
|
the four different types of different purposes. So the bloom filters are membership or take it.
|
||
|
|
What are the others for? Yes. So cardinality, that is another
|
||
|
|
families. A cardinality is to estimate the cardinality of the sets. So membership is to
|
||
|
|
to determine if a member is present in a set. No, for cardinality determines the cardinality of a set.
|
||
|
|
Frequency, obviously, the frequency of elements in a string and similarity to determine how
|
||
|
|
the grade of similarity between elements.
|
||
|
|
That's not the term anything.
|
||
|
|
When would you use the other ones?
|
||
|
|
When would you use?
|
||
|
|
Have you some examples?
|
||
|
|
Okay, so for cardinality, in the Redis Bloom module, we don't have anything for cardinality,
|
||
|
|
but we have the hyperlog log, which is a native Redis data structure.
|
||
|
|
We can talk a little bit about that because it's another super cool one. With the hyperlog log,
|
||
|
|
in only 12k, 12 kilobytes of memory, you can estimate cardinality of huge sets.
|
||
|
|
Let's explain that with the use case.
|
||
|
|
Let's go with the YouTube video, for example.
|
||
|
|
In YouTube videos, you have a number of views.
|
||
|
|
With these views, how would that work conceptually?
|
||
|
|
Well, let's go with IP addresses or whatever kind of unique user identifier.
|
||
|
|
We have maybe user ID, maybe some combination of some IP with whatever identifier, browser,
|
||
|
|
whatever cookie, or something. So you have any kind of unique user ID.
|
||
|
|
And then every time that user ID views a video, you need to decide if that
|
||
|
|
user has viewed that video before.
|
||
|
|
So in order for you to be able to do that, you need to think, well, how can I know if they viewed
|
||
|
|
this video before? Unless I have a list of everyone who's ever viewed this video before.
|
||
|
|
And then I have, I compare to that list, and then I know the more viewers I have, the longer this list
|
||
|
|
becomes, it becomes pretty much maintainable in years. What happens with the hyperlog log log,
|
||
|
|
every time you have a new user coming, you get its ID, you just stick it in the hyperlog log
|
||
|
|
data set. And then every time you query, you ask, you query that the data structure is okay,
|
||
|
|
how many unique elements do you have inside, and it's going to tell you.
|
||
|
|
And no matter, you can add huge numbers. At the moment, fortunately, I don't remember the numbers,
|
||
|
|
the still guarantee low errors. I think the margin of error was 0.1%, but it's going to tell you,
|
||
|
|
okay, I've seen this many unique elements. And it's very interesting how it's implemented inside,
|
||
|
|
but I think I'm not sure I will be able to explain it very well, it'll be just words and not
|
||
|
|
visual explanation. Yeah, I think we're going to need this for the listeners to our podcasts as well.
|
||
|
|
The details, of course, will be in the show notes.
|
||
|
|
To keep track of all of them. There is actually, I do have a presentation I gave on
|
||
|
|
UbuntuCon, where I explained, I have a few minutes explaining how the hyperlog log works internally,
|
||
|
|
maybe we can link that to you if someone's interested. Yes, the details, of course, will be in the show
|
||
|
|
notes to your listeners. Okay, so that's about the cardinality family, frequency family,
|
||
|
|
estimates with what frequency some elements of cures in a set without having to store all the
|
||
|
|
elements that have ever occurred in that set. And in the Redis domain, we have the
|
||
|
|
company sketch and the top K, in its heavy keeper, and the top K is the more performant one.
|
||
|
|
What it does is, okay, you watch all the elements that show up in some stream and you can see
|
||
|
|
what are the elements that appear the most, the top five, top 10, what are the elements that
|
||
|
|
appear most. And this can be a nice use case for this, can be maybe in gaming, what are the
|
||
|
|
key players with high score, if it doesn't matter, if it doesn't, if you can sacrifice some
|
||
|
|
precision. And so you have the flow is the incoming game score, and then you can even store a separate
|
||
|
|
sorted set during the top K users. Every time a user scores points, it's added to the top K list.
|
||
|
|
Or then trending hashtags, so for social media platforms, the new distribution networks,
|
||
|
|
you can say, what are the K hashtags people have mentioned the most in the last X hours.
|
||
|
|
And then imagine how the flow of information, how many hashtags someone would have to store.
|
||
|
|
But if you post every hashtag, if you pass every new post and every new hashtag through this,
|
||
|
|
top K have a keeper, then the top K is going to actually store and give you the top K hashtag that
|
||
|
|
have appeared. Now I think what is important in this context is actually the observation that all
|
||
|
|
these, all these hashing algorithms, or sorry, all these properties, as well as other are
|
||
|
|
scalable, because essentially what they just store are hashes. So never mind whether you throw
|
||
|
|
10K values at them, or a petabyte of data, the data and NCB use like a consumption is predictable
|
||
|
|
to some extent. This is a property you do not, you do not necessarily have with all new
|
||
|
|
registers. And the sets, and the sorted sets that Elena just mentioned are probably the two
|
||
|
|
primary examples here. And sorry, sorted sets are not just a redisnturnal data structure,
|
||
|
|
but it essentially is set that where each and every element has a score attached to it.
|
||
|
|
Think of it like a leadership board or a recommendation list. It gives you a sequence of the elements
|
||
|
|
in a set, just for the few dismissals who do not know what a sorted set is. Okay.
|
||
|
|
Sorry, sorry, sorry, I have a problem. Yeah, it's going to ask, so does this one have an error rate
|
||
|
|
attached to it as well? The top K, yeah, yeah, it does. It's an estimation, because there is also,
|
||
|
|
with the top K, there is also a decay algorithm where it, so the top K, it's not, it's not the top
|
||
|
|
case since the top K was differentiated. It's the heavy hitters in the last whatever time,
|
||
|
|
whatever. There is a decay algorithm at the moment, I don't remember how exactly it was working.
|
||
|
|
I would have to read my own article, because it's been over half a year that I haven't checked.
|
||
|
|
But the analogy that I had for this is, imagine, so they call it elephant flows with the heavy hitters.
|
||
|
|
If you have elephant DJ. So if you have a field and then you have five elements,
|
||
|
|
five elephants, elephants passing through it, one behind the other, they're going to leave some trail.
|
||
|
|
Yeah. But then you have 10 elephants going in another way, they're going to leave another trail.
|
||
|
|
And then maybe you have a whole herd that goes on an opposite direction. And then,
|
||
|
|
after they're gone, what you're going to see, the tracks you're going to see in the field,
|
||
|
|
is mostly the one from the full herd. The ones from the five elements and 10 elephants, sorry,
|
||
|
|
the ones from the five elephants or 10 elephants are going to be mostly deleted and they're not going
|
||
|
|
to be visible anymore because of the full herd that passed there. So it's not that much time related
|
||
|
|
as in the volume related. But then after a while, wind blows, you have sandstorms or any kind of storms.
|
||
|
|
And even that track of the whole herd starts to fade out and newer tracks start to form.
|
||
|
|
Top K can be visualized like that for the people who like to visualize things.
|
||
|
|
And you mentioned that there are actually two types of PDFs called filters,
|
||
|
|
namely blue one and the cocoa one in red is. Maybe you can, yes, maybe you can shed some light
|
||
|
|
on the commonalities and the differences. Yeah. Well, the cocoa filters and they also check
|
||
|
|
and they enable you to check if an element is present in a set. I know also using a very small
|
||
|
|
member space, fifth size, they sacrifice some precision for it. But they're implemented
|
||
|
|
completely differently. And for some cases, they are maybe faster for checking, but not faster for
|
||
|
|
adding elements in the filter. There are some particularity. But the biggest one, the biggest
|
||
|
|
difference here is that the cocoa filter from a cocoa filter, you can delete elements. You cannot
|
||
|
|
delete elements from the blue filter. So that is one big difference. The second difference,
|
||
|
|
then maybe more subtle is for cocoa filters, you only get discrete error rates that you can set.
|
||
|
|
So maybe 0, 4, 0, 7, depending on the implementation, but specifically for the implementation in the
|
||
|
|
red is blue modules. You cannot choose any rate you want. You have a set of discrete error
|
||
|
|
rates that you can choose from. Another thing is that, even though it's maybe faster for some use
|
||
|
|
cases, if you have, if you repeat, it's going to fill up faster if you add the same elements to
|
||
|
|
it twice. So in blue filters, if you add the same elements for a second, third, and time,
|
||
|
|
in the blue filter, the blue filter is not going to change. Nothing's going to change inside of it.
|
||
|
|
With the cocoa filter, it will. And you're going to end up with that element twice or three times.
|
||
|
|
Okay. I assume all the implementations of these modules are open sourced.
|
||
|
|
If they're not all, I think there's source available license, not open source in the red is blue module,
|
||
|
|
but the difference is there. I don't actually, I don't want to go into that discussion.
|
||
|
|
Well, but yeah. But, uh, I don't know if you've ever talked about the source available license.
|
||
|
|
No, I haven't. And there's reason for that. And there's just not a licensing podcast, but rather,
|
||
|
|
well, full disclosure, actually, there will be an upcoming episode on open source licenses,
|
||
|
|
summertime frame, so stay tuned people if you're interested, if you're interested in open source
|
||
|
|
licenses, or if you can't get to sleep at night, just in, just don't miss this podcast. But
|
||
|
|
but this is not something that we will go into tonight. The main point is actually that all
|
||
|
|
these source code is available on GitHub of these of this implementation, whether
|
||
|
|
the use case, of course, it may be different one, but if you want to take a look at it and how it's
|
||
|
|
done internally as an implemented for you free. And I think, speaking of modules, the main
|
||
|
|
implementation language is still C, right? But I think there's one module implemented, I think
|
||
|
|
in Rust these days, and that's right, but it's Jason too, if I'm not completely mistaken.
|
||
|
|
Yes, I heard so. Okay. Yeah. Has this been released?
|
||
|
|
Yes, it has. And I think some people use already in production. This is, of course, a shameless
|
||
|
|
plus, this is, of course, a shameless plug for Rust. Full disclosure, yes, there will
|
||
|
|
be a podcast episode of the licensing loss on Rust, but this is now enough with the commercial
|
||
|
|
breaks. Let's continue with the PDFs. Okay. Martin, any, any, I wouldn't say final thoughts,
|
||
|
|
but any thoughts on any questions on, on, on PDFs? Well, I'd like to go back to
|
||
|
|
having this, I don't know, this opening statement that there were four different types.
|
||
|
|
Have we covered all of them? We covered with a given example for membership
|
||
|
|
cardinality, frequency, we didn't give an membership for similarity. I don't know any implementation
|
||
|
|
personally. That's why I don't want to go there, but similarity in general, it can estimate how
|
||
|
|
similar all different elements are in the set. Okay. So, are you familiar with the roadmap
|
||
|
|
for release bloom? Perhaps, if, I mean, it's been around for a couple of years, at least,
|
||
|
|
and it's, yeah, I think that was pretty stable. I haven't seen any, any big developments
|
||
|
|
in there. Well, full, full disclosure, dewey, I've gotten touched with regards to quantum
|
||
|
|
hashing integration of this, but I'm not too sure what they, what the current status is,
|
||
|
|
better check with product management, full disclosure, people, that was a joke.
|
||
|
|
An active error of research. I found a scientist correctly. And Martin, beyond the GPU database
|
||
|
|
field. I'm not afraid. Okay. Last question, Wesley, that I have, because this whole thing,
|
||
|
|
although it sounds very theoretical, we mind it in terms of the mass beyond all the rest of it. But
|
||
|
|
that flicks and friends much, but must be just sort of the iceberg. I reckon from practical
|
||
|
|
use case perspective, the applications must be, must be multitude of algorithms, basically,
|
||
|
|
making use of this. I think so. I think so. And I don't know if we can talk about what we've seen,
|
||
|
|
what I've seen in, in release lives, but there are people, we just, we just touched on a few
|
||
|
|
use cases, but imagine a Google, let's say Google sign up for them. Every time you try to create a
|
||
|
|
new, new Google address, it has to check if that your email already exists. There you have it,
|
||
|
|
another use case for, for a bloom filter. Has this email address been used already?
|
||
|
|
Or a dictionary? So is this word that I just wrote in the dictionary? Is it correctly spelled?
|
||
|
|
That was, that was actually one of the primary use cases for, for the bloom filter.
|
||
|
|
Which is quite an example, because as we all know, languages can become like spists.
|
||
|
|
Yes. Even, even our languages outside the problem with languages realm.
|
||
|
|
Yeah. I mean, the correct spelling, at least, at least a very basic way to, to highlight a word
|
||
|
|
that, that doesn't exist in the dictionary. If you look at it like that.
|
||
|
|
Sounds like a very good use case indeed.
|
||
|
|
Any final thoughts before we go into the boxes and the entire boxes?
|
||
|
|
No, I, I just, about the probabilistic data structure of the realm. Once you, once you get to,
|
||
|
|
to learn what they can do, you really kind of start to, to fall in love with them. You start to see
|
||
|
|
applications everywhere. And it's good. It's a very good tool to have in your developer tool belt.
|
||
|
|
But I, I don't know, where do you see this going? I mean, I'm not talking about de-ways now.
|
||
|
|
I'm just talking about kind of, from an application perspective, from a, from a, maybe from even from
|
||
|
|
an implementation perspective. This in terms of, of what, of the, the development of this bloom module.
|
||
|
|
Anything goes.
|
||
|
|
Where the humanity is going, for example, to Mars.
|
||
|
|
Or whether they, or whether they cartels will flip the coin. It does a frauditation. You never know.
|
||
|
|
No, I'm joking. And then, but, but if, if I had kind of visionary thoughts, now is the time.
|
||
|
|
Well, if the direction is more and more data and more and more real time requirements, then
|
||
|
|
I think that we are going to see many more new algorithms for probabilistic data structures.
|
||
|
|
We're going to see new implementations that we haven't taught possible.
|
||
|
|
And probably even more uses for, for the current algorithms and improvement. Even the bloom
|
||
|
|
to their algorithm has already been improved multiple times. There are multiple variants of it.
|
||
|
|
I mean, this is the beauty about math, right? Similar to computer science, it never sleeps.
|
||
|
|
Because, because people are basically just wondering what is wrong with the consider of things,
|
||
|
|
take it apart, put it back together again, and improve it.
|
||
|
|
Oh, engineers.
|
||
|
|
It's funny enough. This is, this is the very core thought of something got open source.
|
||
|
|
If I'm not completely mistaken.
|
||
|
|
Okay, can we get some of those engineers on tipsy path?
|
||
|
|
Jitsy, if you're listening, please get in touch.
|
||
|
|
Yeah, it didn't break down this time. Yes, but you probably have to cut this out.
|
||
|
|
Okay, guys, that has been a very fun episode. Never mind missing math, but we don't want to
|
||
|
|
bore all these relationships, all these relationships rather to death. So this was every, this is actually,
|
||
|
|
this was actually very light on math. Needless to say, there is a complex mathematical foundation
|
||
|
|
of this, beyond tissues, details, of course, on the show notes. Probably tricky to do on a podcast.
|
||
|
|
Yes, exactly. And with that thought, I think it's now, as the tradition demands it, now's the time
|
||
|
|
to go into the boxes. And Ellen, I'm not sure if you're familiar with the concept.
|
||
|
|
The boxes are the picks of the week. This acronym is a great model by Lynn. By Lynn was a loss.
|
||
|
|
So anybody trying to rip this off, your, our lawyers will be in touch.
|
||
|
|
Don't decide. And anything goes, anything that crossed your mind, worth mentioning,
|
||
|
|
Ellen, in terms of font remembering, something that you really like that has crossed your path and
|
||
|
|
say, we go to or something, and then we take turns.
|
||
|
|
It's what, can you, can you say it again, Foxy? What is Foxy?
|
||
|
|
Yes, anything that you found is anything that came across positives.
|
||
|
|
No, but what are positives of, okay.
|
||
|
|
I mean, we normally, we normally confine ourselves to movies these days.
|
||
|
|
But anything goes, can be a movie, can be TV series, can be a music, can be a book, you name it.
|
||
|
|
Hey, you know what? It was a fun set.
|
||
|
|
I hadn't went out. I'm Portugal and we have been in lockdown for a while. And it's crazy how
|
||
|
|
this, but after a few weeks of not going out for walks in nature at all, I finally went out
|
||
|
|
for a walk in the sun. And it was finally also a funny day after many weeks of rain.
|
||
|
|
And it was weird. I just felt that I sprung right back to life and I realized how much
|
||
|
|
the sun and being out in nature makes a really big difference in our lives.
|
||
|
|
Certainly a valid park. No, that's fine. Anything goes as much as that.
|
||
|
|
Does that make rain you're at the park?
|
||
|
|
We covered it in a minute. Okay, full disclosure, I had the very same experience today as
|
||
|
|
Elena because I took literally an hour break and an hour long break actually done some cycling
|
||
|
|
in Frankfurt. It was about 15 degrees, but the sun was shining. And that was in the current
|
||
|
|
situation, Germany is still under lockdown. Yes, it will come diversion, but over to your mountain.
|
||
|
|
I think, yeah, well, after today, I think my box was the week is from really data structures
|
||
|
|
because it's a lot of fun. It's a lot of things I use, but it's a beautiful piece of, I guess,
|
||
|
|
algorithm for most of these. Yeah. Okay, end epochs is if there are any.
|
||
|
|
I'm going to rain rain. I'm going to rain.
|
||
|
|
Great minds think life because I was just about to say the weather forecast because the weather
|
||
|
|
predicts actually a turn of winter come to ball for central thankful where I'm living. So
|
||
|
|
okay, fair enough. And any thoughts on it will possibly enter end epochs.
|
||
|
|
Puppies. Sorry, say good. Puppies. Everyone loves puppies. Okay. You probably have to explain why.
|
||
|
|
The flowers or the dogs. No, the dogs, the dogs, the big dogs, puppies. I don't know if I want to
|
||
|
|
explain exactly why it might be too much information. I'll take it and you don't have it open.
|
||
|
|
You still have only one, right? Yeah. Okay. We will probably leave it at that. Okay. Guys,
|
||
|
|
our other element that has been really fun. Well, thank you very much for for for joining this.
|
||
|
|
I hope you have you back it more than welcome to. I hope we hope to have you back
|
||
|
|
or Olga for the matter in an upcoming episode very soon and really looking forward to it.
|
||
|
|
Thank you. Thank you guys. Yeah, thanks. Thank you. This is the Linux in-laws.
|
||
|
|
You come for the knowledge. But stay for the madness. Thank you for listening.
|
||
|
|
This podcast is licensed under the latest version of the creative comments license.
|
||
|
|
Type attribution share like credits for the intro music go to blue zero stirs for the song
|
||
|
|
market to twin flames for their piece called the flow used for the second intros and finally
|
||
|
|
to select your ground for the songs we just use by the dark side. You find these and other
|
||
|
|
these licensed under cc at Chimando or website dedicated to liberate the music industry
|
||
|
|
from choking copyright legislation and other crap concepts
|
||
|
|
You've been listening to hecka public radio at hecka public radio dot org. We are a community
|
||
|
|
podcast network that releases shows every weekday Monday through Friday. Today's show like all our
|
||
|
|
shows was contributed by an hbr listener like yourself. If you ever thought of recording a podcast
|
||
|
|
and click on our contributing to find out how easy it really is. hecka public radio was found
|
||
|
|
by the digital dog pound and the infonomican computer club and it's part of the binary revolution
|
||
|
|
at binrev.com. If you have comments on today's show please email the host directly leave a comment
|
||
|
|
on the website or record a follow up episode yourself unless otherwise status. Today's show is
|
||
|
|
released on the creative comments, attribution, share a light 3.0 license.
|