Episode: 3309 Title: HPR3309: Linux Inlaws S01E27: The Big Uncertainties in Life and beyond Source: https://hub.hackerpublicradio.org/ccdn.php?filename=/eps/hpr3309/hpr3309.mp3 Transcribed: 2025-10-24 20:35:39 --- This is Hacker Public Radio Episode 3309 for Thursday, the 8th of April 2021. Today's show is entitled, Linux in laws S0127, the big uncertainties in life and beyond. It is hosted by Monochromic and is about 57 minutes long and carries an explicit flag. The summary is, the two chaps discuss uncertainties and beyond in this episode on probabilistic data structure. This episode of HPR is brought to you by Ananasthost.com. Get 15% discount on all shared hosting with the offer code HPR15. That's HPR15. Better web hosting that's honest and fair at Ananasthost.com. This is Linux in laws, a podcast on topics around free and open source software, any associated contraband, communism, the revolution in general and whatever else, fans is ethical. Please note that this and other episodes may contain strong language, offensive humor and other certainly not politically correct language. You have been warned. Our parents insisted on this disclaimer. Happy mum? That's the content is not suitable for consumption in the workplace, especially when played back on a speaker in an open plan office or similar environments. Any miners under the age of 35 or any pets including fluffy little killer bunnies, your trusted guide dog, unless on speed and qt rexes or other associated dinosaurs. Welcome to Linux in laws, season one episode. I can't even remember what episode is it? Episode where did it see vaguely? Maybe it's episode 40 something. Which is of course a circuit to today's subject, there may probabilistic data structures. And for this subject we have no other than or below it, Elena, you probably remember her from the peasant girl, sorry, back in the Halloween special of something called Linux in laws. If you haven't listened to it, the listeners should please go back, you find the link on the on the website. It's an episode not to be necessarily understand. Yes, exactly. So when Elena is not working as a voice double, she's actually working in at a company called Linux. I thought maybe Olga, especially Elena, why don't you introduce yourself properly? Wow, you already know Olga, so I'm just going to summon my Elena persona. Please do yes. Okay, and the else we should be aware of here. That's that's that's me, my Elena persona is pretty, pretty stable. Although I have different roles, well, I have the role, my professional role, of course, I have the role of a mom, I have a role of a friend, but in this context, let's talk about my, the professional Elena. So I work for Redis Labs as a technical enablement. Oh, no, sorry, now senior technical enablement architect. Well done, by the way, before I forget this. Thank you, thank you very much. Technical enablement architect basically means we help the technical field, everyone working everyone in the field in technical roles. To first of all, get up to speed with the system when we all architecture and they join the company. And then also maintain their knowledge fresh by releasing new trainings about all different technical aspects of our system. So that's what I do professionally. Outside of that, I think that's really close to my heart is the women in tech question, but I'm personally, I am more of the type of women who, instead of going on Twitter and complaining about what someone did, I like to roll up my sleeves and get some work done and we'll do some coding or just become a better engineer in order to be a stronger woman in tech, a woman in technology. So I had done some work previously on that field and a while ago, I was invited from the woman in tech global organization to be the global education program director. So now we're working on that. We have some partnerships with Cisco, Microsoft, and we're going to be doing some free courses for women and people who identified as female or minority technology. Wow. So yeah, coding. That's my thing. Excellent. Okay. For those of people, our first, I really met Elena. I think it was in November, December 2019 in London, when you give a presentation on something called probabilistic data structures for no sequel, for in memory, no sequel to the way it's called Redis. Yes. And this would be today's subject as in what are really probabilistic data structures and why are they important? Not just in Redis context, but generally speaking, but before we go into these technical details, maybe it's worth explaining what Redis is in an in memory, no sequel to context, given the fact that two thirds of today's speakers on this podcast will still work for Redis one defected quite some time ago. So let's do this jointly. Redis is about 10 years old. It's a it has the spot. I think it has the seventh to rank on a website called DB engines, if I'm not completely mistaken. And it's a stack of those anything to go by. It's the most beloved database for I think at least four years in a row is in terms of voted for the main differentiation. It's probably the best term what I'm looking for is actually that in contrast to other databases, Redis does it all in main memory in terms of it doesn't it yes, it supports persistence, but it's the main focus of the processing of data is actually doing this in memory and hence this kind of playground of real-time performance, but why should I do all the explaining? Elena Martin could we to chime in? Well, we started really well. Redis is very fast, he's very loved by developers and I can talk about myself at least one of the things I loved that made me fall in love in Redis even before I knew about Redis' life. So before I joined Redis Labs was the efficiency and the part that it doesn't necessarily stick to all the academic talk, it just uses many times it uses just some approximations to get things done and make them work very well in 99% of the cases for people. So this is a episode of probabilistic data structures, so even in Redis itself we have a lot of approximations like if we talk about the LFU, LRU, the vision policies in Redis, they use approximations too. And I kind of like that that efficiency being smaller, small memory footprint very fast, it does things well and fast. Martin, you think too bad? Yeah, I think the one thing I think for people not familiar with Redis is that it's basically a bunch of data structures that you use for different purposes, which as opposed to your relational database, which has table and those kind of structures, it's a bunch of different data structures which are very close to programming paradigms instead. So it's more of a build of this type of piece of technology, right? Interesting observation there. Yes. A bunch of Legos for us to play with. Yeah, that's actually very good image there. Okay, but enough about Redis details. Elena, what exactly are probabilistic data structures and why are they so important? Okay, so what exactly are probabilistic data structures? It's a group of data structures that give a reasonable approximation, but using just the fraction of the usual time and memory that the deterministic data structure would use. So they use hash functions usually to randomize and and completely represent the set of items and then collisions are ignored, which leads to usually leads to some margin of error. Before we go any, sorry, I mean before we're going further, I think we cannot assume that everybody knows what hash functions are and what collisions are. So hash functions are either way, okay, how do we? So let's say you get a value, you do your hash it, I don't know how deep should we go into explaining how hash functions work, but you have that value and many different values can have a similar, can have the same hash. And that would be a hash collision. I like to use this analogy. Well, if we have people that don't know what what the hash function and I like to use this analogy, if the object that we have is let's say the real life object, a hash of that is its shadow. Okay, so the shadow of that object would be its hash. It kind of represents, it kind of is a silhouette of what was there. We can kind of tell something about it, but you cannot know what is it and many different objects, let's say a ball, someone's head or a lamp can have the same shape of a shadow, but in that way, different objects. Very much so. I like the image, I like the comparison there. That's I like the analogy. Let's put it this way. Beautiful. Thank you. So yeah, so we're in probabilistic data structures, you use that kind of, you represent the elements with their shadows, basically. Now, by the way, before I forget, very practical uses of hash's, I actually, if you look into a Linux Linux system, because the passwords stored in ETC password WD or in ETC shadow are actually sold at hashes. I won't go into that near details, but this is the primary touch point. If you use a Linux based system, when you log in, cap in your password, the password is not stored in clear text in ETC password WD or in ETC shadow, but rather as a hash. So the one thing to do with the probabilistic data structures here. Thank you, Mark. So let's cut the short and this is just another application over hash. So the idea is basically you enter password, this password is then converted by a hash function to a hash. And then the hashes are compared. Should these has just match essentially. Sorry, but yeah, but it's still connected. Yeah, not nothing to do with probabilistic data structures, but it's still they use a deterministic function. A deterministic hash function means that every time you run it with the same input, it's going to give you the same result, the same output. So actually, I'm not sure if that is true for passwords to be honest, depending on the system. It should be at least for PDS. Can I use PDS? Probably data structure. Thank you. Yes. It is definitely true and crucial to their functioning. Now, I mean, why would someone use more probabilistic data structures? You would ask, okay, well, why would I want you to sacrifice precision, right? We are developers. We work in with exact thing. It's an exact science. I wanted to have my results always correct. But there are cases where you would sacrifice some precision if some accuracy, if you can gain space or time. So there is this thing called in this triangle, the triangle of space time accuracy in data processing, where you have space accuracy and time and component on the three edges of the time of the triangle. And you can choose to. You cannot have three with the data structure. So either you're going to have accuracy and low memory, but then you sacrifice time. So you don't get real time result. Or you have space and time in the case of probabilistic data structures. You save one space, space, you have good time performance, but you sacrifice accuracy. You cannot have all three. Exactly. This is also known as the Kolevska theory, right? If I'm not completely mistaken. No, I don't think so. Maybe I'm wrong. I don't know. Maybe there are some Kolevska who lived long before me from mysterious. There is a mysterious people in the PDS world. But we can talk about that, the Bloom guy, one of the most famous structures. After Bloom, he's completely mysterious. You can't find anything about him online. You can know photos. We don't know if he's still alive or not. He's mysterious. Yes, let's let's practice for about two and a half minutes before we go into the subject. But okay, it mustn't any any thoughts on on PDS is before we move on. Well, we're not moving off PDS, are we? I'm not sure if I was about in terms of the principles and stuff and whatever. The theory, that is. Yeah, no, that's very well put about the trade-offs, right? It's it's a very different approach in terms of a, you know, if you look at it from different database of different technologies, you're effectively at storage time. You compute a data structure rather than restoring all the data as is and then calculating a a result out of those. Which of these? I mean, that goes especially for use cases where you quickly run into either time or space problems in terms of having to start terabytes, gigabytes, petabytes, exabytes of memory, where simply main memory never mind secondary storage doesn't measure up. Same goes for run time and this is where PDS is if you are willing to trade in accuracy versus these kind of metrics, really makes sense. Exactly, exactly. And this is it's becoming even even more relevant because nowadays we have the rise of big data. So we need to store a lot of data, but we also want it real time. And now how can how can we do it? I have an idea. You use GPU database. Oh, dear. Anyway, let's not go that today. Full disclosure, Martin works at Berkeley, the GPU database shop. Richard, if you're listening, the email address is sponsor at Linus in lost out of you. But Martin will send you to the details. Okay, sorry. Okay, but enough of the end of commercial break and then please do continue. So yeah, and also another use case is where you the exact opposite is not about big data now, but you have very limited memory on some devices like routers or maybe some IoT devices. And you have a very very limited memory space, so you cannot store a bunch of data there so you can compare things, let's say, if a member is present in a set, for example. And I guess also for some questions, you don't need the exact answer, you don't need to know if it was one million and five and 23 years care of, it's a million or a hundred thousand. Yeah, exactly. We can, I mean, we can, if you want, we can already, we can talk maybe about one filter, which are one of the simpler ones, but they're pretty cool, and they have some really nice use cases, so it becomes a little bit more obvious to the audience. Why all means? And there's no right air, yes. Okay, so we can, there are four main families of probabilistic data structures. Membership, cardinality, frequency, similarity. Membership, it is asking, so is this member present in a set? That's all it does. Now, what would be, and Bloom filter is one of the probabilistic data structures, maybe together with Cucucu filters and that different for riles of Bloom filters, and that are the most prominent in this family. So, a Bloom filter can answer your question like, for example, in the financial vertical. Has this user paid from this location before? Or has this credit card been reported as stolen? So imagine how many transactions are done in the world every day, every minute, every second, with how many different card numbers? Everyone of those, I'm assuming that everyone of those transactions needs to be checked, if that card has actually been reported as stolen. So if all of those things need to go to some main database, some relational database, first of all, if you want to imagine the size of the database where we stole all those credit cards, then that database has to be shared between different payment processors in the world. It needs to be updated all the time. In real time, essentially, right? Because you're trying to understand. And then you also on top of that, you even have the problem of storing those, so the problem of security, you need to actually store those credit card numbers so you can compare if this new credit card that just paid is a part of those numbers or not. So in this specific use case, Bloom's filters are a perfect match because you can populate one Bloom filter by adding, you can have your list of all the cards that have been reported as stolen, you take them, hash them, put them in the Bloom filter. And now, every time someone wants to check if a card number has been stolen, just takes that number, has it, and compares it to what we have in the Bloom filter. The Bloom filter can give you two responses, yes or no? If it gives you the response, no, you can definitely trust that response. That means that that card has definitely not been stolen. If it gives you a response, yes, it means that, okay, this might have been stolen, but maybe not. If we go back to the shadows analogy here, it can say, okay, well, you are asking, okay, if this ball present in the set, and then the filter looks and finds a shadow that looks like ball and say, yeah, I see a shadow of a ball, I think it is, I think it is present, but actually it was not a ball, it was a lamp, or it was someone's hand. Or I had or something, exactly. But it cannot find any shape that looks like a ball, it can definitely say, no, for sure, there is no balls in this set. So no false negatives, if you throw a triangle at the set, it will definitely say sorry, not present. Exactly. So in this case, for this specific use case, when you're checking if a card has been stolen, the most valuable response to you is no, because in most of the cases, in most of the transactions that you're going to make, the response is going to be no, right? Okay. And just by knowing that for sure, if your Bloom filter answers no, that prevents you from having to go to the main database to check. So we save a lot of load on the main, some main database somewhere, or do it 100 or so database. Another use case would be Netflix, right? Netflix, it's one of these hipster video networks, yes, because Netflix uses caches all over the place, fun enough. And if I understand this correctly, essentially what Netflix does in that particular context, there's such a chance of video information in local caches. So if an endpoint, like a mobile device or browser basically goes back to the middle tier asking, hey, I need this chunk of video, is it present in your cache? All you have to do is basically apply a Bloom filter. And if it's not present, you just simply go upstream and get it, but otherwise you can basically directly stream it from the cache, exactly. I have a question. How would that look up compared to, say, an index look up on a radix tree in terms of time cost? It's all one. I remember correctly. Because you just have to compute the hash function, right? Exactly. Because it doesn't depend on anything. You just need to compute the hash function. And not just that, the memory needed to store the Bloom filter versus all of the cards that have been stolen in full string or number, whatever it is, is very big. And imagine, then if you also need this, then propagative and further, because your full data structure containing all of the stolen cards, if it's two megabytes or two hundred gigabytes, and you need to sync them that between regions. So everyone can have a good latency. It makes, it really does make a huge difference, right? Make sense, make sense, Martin? Yes, of course. Yes, sorry. Sorry. It's a lot of stolen cards, by the way. Martin Pernies, Martin Pernies punny the possibility of going into Craycut Fraud Hall, say, extra, Martin, keep it going. Got to know the variables here. So this is proprietary. I'm afraid. Okay, very, very interesting and useful information there, and especially with regards to use case, because fraud detection is becoming much more of a problem than, say, 10, 15, 20 years ago, because of the rapid move into online business, never mind, high-knit cards, but what is the other third beer there? Corona, sorry, yes. Corona or not. So more and more people are buying stuff online, and it is, say, most of the time they would use some sort of credit cards, debit cards payment method. And so fraud detection is becoming more and more important, as we speak at this very part of the time, because if you're emerging, if you're currently a processing company, you want to make sure that you are not being ripped off. Goes without saying. Yes, exactly. And there are many other use cases. Mostly, I think it's the better fit when the meaningful answer to your question is no. So whatever is going to prevent you from going, if you get a yes from the bloom filter, that's still okay. You just know that, okay, well, in this case, I'm going to go and have to pin the main database, whatever that name means. But all those notes, all those no answers, prevented you from going to the main database all the time. Just better latency, of course. Just curious, Alina, because the whole thing writes on the hashing function, but what would be the typical probabilities, let's put it this way, of your average BDS implementation, let's put it this way, with regards to accuracy. As I said, at the end of the day, it depends on the implementation of the hashing function. But what is your experiencing? What is your experience with regards to the probabilities coming back that you have experienced so far, if you can talk about this? Yeah, of course. That should be totally configurable. Everyone can implement their own bloom filter. That's really not a problem. The algorithm is out there. People can do it. And the accuracy is going to depend on the number of hash functions you're going to use, which is of course, then we're going to influence CPU usage, and the size of your bloom filter. So the bigger the bit array, the more precise, the more accurate the filter you have, the smaller the bit array. Yeah, we should probably explain why a bit array now comes into different contexts. A bit array is then the field, the canvas on which we are projecting the shadows. If our canvas is one by one meter, and we are projecting 20 shadows on that one by one meter, it's going to get blurry, and after a while you're not going to be able to recognize anything. Everything is going to look like everything. If you're projecting the same 20 elements on a 10 by 10 canvas, they're going to be much more spaced out, and you're going to be able to recognize them much more nicely. Okay. Yeah, I was just to make the point clear, the size of the bit array and this analogy is the size of the canvas. Okay. You alluded to the whole thing earlier on, but why are they called bloom filters, and why is bloom that much of a mystery? For me, I don't know why he chose, or if he ever chose to be a mystery, for me, he's a mystery because I couldn't find anything about him on Google. He's the chap who invented essentially the algorithm behind this. Yeah. Yeah. I think in the 70s or something like that. First, it was used for a dictionary, actually, to check words against the dictionary. Yeah. Full details, of course, including the quantum superposition of the bloom filter will, of course, be in the show notes, but to carry on a line up. Well, that's it. I don't have anything else about the mysterious hours, I think, how are bloom? How bloom? Okay. Is it still alive or is he? I don't know. No one knows. Okay. When I was researching that last year or a year and a half ago, I couldn't find out anything. But he's a virus, is he? Yeah. But he's a virus. He's a virus, is he? No, I'm responding at all. I don't know. I don't know. Maybe even I don't know if I ever knew that. Actually, I don't know. On the off chance. On the off chance. On the off chance, Mr. Bloom, if you're listening, the email addresses feedback and little's in-laws. E.U. please get in touch. So we can plan you in for an episode coming along in the future. Okay. Going back to the corresponding value simplification now. The way I see it, essentially, is that there are a couple of implementations of bloom filters maybe you can elaborate on these. We have we have one implementation of bloom filters, although we do have something quite specific to us. I know if it's patented, but one of the the people who worked Ariel and Dr. Carlos Paquero who worked on that. They really, really stuck paper. It's scalable bloom filters with this scalable bloom filters. You could just specify an error rate. You want to keep. So you would say, okay, I want my error rate to always stay below this value. And then as you add more elements to the bloom filter, it's going to just scale up. It's going to keep stacking filters one on top of the other, and so it can keep your error rate to what you request it. So that's pretty specific to Redis Labs to the bloom, Redis Bloom implementation. But just to go one step back, first of all, when the Redis Bloom module is the module that for Redis, Redis implements module API, where anyone can go ahead and create a module you see and extend the data structures. What's a module, please be more specific. It's a piece of code written in C that uses the Redis API, the Redis module API, and then you can use Redis Baby Glen. You can also hijack even some commands that are sent to the Redis server. So you load your Redis server with your module and you can implement your own data structures. So in Redis, you have strings, you have hashes, leads, you have a few data structures. With the Redis module, you can implement your own. And in the case of Redis Bloom module, we have four new data structures that were implemented to that module. It's the bloom filters, the cuckoo filter, which is similar to Bloom, but implemented quite differently, let's say, but a similar use case, it's from the membership family. We have the count means catch and the top K heavy keeper. Those are the four data structures that we have. You mentioned that there were the four different types of different purposes. So the bloom filters are membership or take it. What are the others for? Yes. So cardinality, that is another families. A cardinality is to estimate the cardinality of the sets. So membership is to to determine if a member is present in a set. No, for cardinality determines the cardinality of a set. Frequency, obviously, the frequency of elements in a string and similarity to determine how the grade of similarity between elements. That's not the term anything. When would you use the other ones? When would you use? Have you some examples? Okay, so for cardinality, in the Redis Bloom module, we don't have anything for cardinality, but we have the hyperlog log, which is a native Redis data structure. We can talk a little bit about that because it's another super cool one. With the hyperlog log, in only 12k, 12 kilobytes of memory, you can estimate cardinality of huge sets. Let's explain that with the use case. Let's go with the YouTube video, for example. In YouTube videos, you have a number of views. With these views, how would that work conceptually? Well, let's go with IP addresses or whatever kind of unique user identifier. We have maybe user ID, maybe some combination of some IP with whatever identifier, browser, whatever cookie, or something. So you have any kind of unique user ID. And then every time that user ID views a video, you need to decide if that user has viewed that video before. So in order for you to be able to do that, you need to think, well, how can I know if they viewed this video before? Unless I have a list of everyone who's ever viewed this video before. And then I have, I compare to that list, and then I know the more viewers I have, the longer this list becomes, it becomes pretty much maintainable in years. What happens with the hyperlog log log, every time you have a new user coming, you get its ID, you just stick it in the hyperlog log data set. And then every time you query, you ask, you query that the data structure is okay, how many unique elements do you have inside, and it's going to tell you. And no matter, you can add huge numbers. At the moment, fortunately, I don't remember the numbers, the still guarantee low errors. I think the margin of error was 0.1%, but it's going to tell you, okay, I've seen this many unique elements. And it's very interesting how it's implemented inside, but I think I'm not sure I will be able to explain it very well, it'll be just words and not visual explanation. Yeah, I think we're going to need this for the listeners to our podcasts as well. The details, of course, will be in the show notes. To keep track of all of them. There is actually, I do have a presentation I gave on UbuntuCon, where I explained, I have a few minutes explaining how the hyperlog log works internally, maybe we can link that to you if someone's interested. Yes, the details, of course, will be in the show notes to your listeners. Okay, so that's about the cardinality family, frequency family, estimates with what frequency some elements of cures in a set without having to store all the elements that have ever occurred in that set. And in the Redis domain, we have the company sketch and the top K, in its heavy keeper, and the top K is the more performant one. What it does is, okay, you watch all the elements that show up in some stream and you can see what are the elements that appear the most, the top five, top 10, what are the elements that appear most. And this can be a nice use case for this, can be maybe in gaming, what are the key players with high score, if it doesn't matter, if it doesn't, if you can sacrifice some precision. And so you have the flow is the incoming game score, and then you can even store a separate sorted set during the top K users. Every time a user scores points, it's added to the top K list. Or then trending hashtags, so for social media platforms, the new distribution networks, you can say, what are the K hashtags people have mentioned the most in the last X hours. And then imagine how the flow of information, how many hashtags someone would have to store. But if you post every hashtag, if you pass every new post and every new hashtag through this, top K have a keeper, then the top K is going to actually store and give you the top K hashtag that have appeared. Now I think what is important in this context is actually the observation that all these, all these hashing algorithms, or sorry, all these properties, as well as other are scalable, because essentially what they just store are hashes. So never mind whether you throw 10K values at them, or a petabyte of data, the data and NCB use like a consumption is predictable to some extent. This is a property you do not, you do not necessarily have with all new registers. And the sets, and the sorted sets that Elena just mentioned are probably the two primary examples here. And sorry, sorted sets are not just a redisnturnal data structure, but it essentially is set that where each and every element has a score attached to it. Think of it like a leadership board or a recommendation list. It gives you a sequence of the elements in a set, just for the few dismissals who do not know what a sorted set is. Okay. Sorry, sorry, sorry, I have a problem. Yeah, it's going to ask, so does this one have an error rate attached to it as well? The top K, yeah, yeah, it does. It's an estimation, because there is also, with the top K, there is also a decay algorithm where it, so the top K, it's not, it's not the top case since the top K was differentiated. It's the heavy hitters in the last whatever time, whatever. There is a decay algorithm at the moment, I don't remember how exactly it was working. I would have to read my own article, because it's been over half a year that I haven't checked. But the analogy that I had for this is, imagine, so they call it elephant flows with the heavy hitters. If you have elephant DJ. So if you have a field and then you have five elements, five elephants, elephants passing through it, one behind the other, they're going to leave some trail. Yeah. But then you have 10 elephants going in another way, they're going to leave another trail. And then maybe you have a whole herd that goes on an opposite direction. And then, after they're gone, what you're going to see, the tracks you're going to see in the field, is mostly the one from the full herd. The ones from the five elements and 10 elephants, sorry, the ones from the five elephants or 10 elephants are going to be mostly deleted and they're not going to be visible anymore because of the full herd that passed there. So it's not that much time related as in the volume related. But then after a while, wind blows, you have sandstorms or any kind of storms. And even that track of the whole herd starts to fade out and newer tracks start to form. Top K can be visualized like that for the people who like to visualize things. And you mentioned that there are actually two types of PDFs called filters, namely blue one and the cocoa one in red is. Maybe you can, yes, maybe you can shed some light on the commonalities and the differences. Yeah. Well, the cocoa filters and they also check and they enable you to check if an element is present in a set. I know also using a very small member space, fifth size, they sacrifice some precision for it. But they're implemented completely differently. And for some cases, they are maybe faster for checking, but not faster for adding elements in the filter. There are some particularity. But the biggest one, the biggest difference here is that the cocoa filter from a cocoa filter, you can delete elements. You cannot delete elements from the blue filter. So that is one big difference. The second difference, then maybe more subtle is for cocoa filters, you only get discrete error rates that you can set. So maybe 0, 4, 0, 7, depending on the implementation, but specifically for the implementation in the red is blue modules. You cannot choose any rate you want. You have a set of discrete error rates that you can choose from. Another thing is that, even though it's maybe faster for some use cases, if you have, if you repeat, it's going to fill up faster if you add the same elements to it twice. So in blue filters, if you add the same elements for a second, third, and time, in the blue filter, the blue filter is not going to change. Nothing's going to change inside of it. With the cocoa filter, it will. And you're going to end up with that element twice or three times. Okay. I assume all the implementations of these modules are open sourced. If they're not all, I think there's source available license, not open source in the red is blue module, but the difference is there. I don't actually, I don't want to go into that discussion. Well, but yeah. But, uh, I don't know if you've ever talked about the source available license. No, I haven't. And there's reason for that. And there's just not a licensing podcast, but rather, well, full disclosure, actually, there will be an upcoming episode on open source licenses, summertime frame, so stay tuned people if you're interested, if you're interested in open source licenses, or if you can't get to sleep at night, just in, just don't miss this podcast. But but this is not something that we will go into tonight. The main point is actually that all these source code is available on GitHub of these of this implementation, whether the use case, of course, it may be different one, but if you want to take a look at it and how it's done internally as an implemented for you free. And I think, speaking of modules, the main implementation language is still C, right? But I think there's one module implemented, I think in Rust these days, and that's right, but it's Jason too, if I'm not completely mistaken. Yes, I heard so. Okay. Yeah. Has this been released? Yes, it has. And I think some people use already in production. This is, of course, a shameless plus, this is, of course, a shameless plug for Rust. Full disclosure, yes, there will be a podcast episode of the licensing loss on Rust, but this is now enough with the commercial breaks. Let's continue with the PDFs. Okay. Martin, any, any, I wouldn't say final thoughts, but any thoughts on any questions on, on, on PDFs? Well, I'd like to go back to having this, I don't know, this opening statement that there were four different types. Have we covered all of them? We covered with a given example for membership cardinality, frequency, we didn't give an membership for similarity. I don't know any implementation personally. That's why I don't want to go there, but similarity in general, it can estimate how similar all different elements are in the set. Okay. So, are you familiar with the roadmap for release bloom? Perhaps, if, I mean, it's been around for a couple of years, at least, and it's, yeah, I think that was pretty stable. I haven't seen any, any big developments in there. Well, full, full disclosure, dewey, I've gotten touched with regards to quantum hashing integration of this, but I'm not too sure what they, what the current status is, better check with product management, full disclosure, people, that was a joke. An active error of research. I found a scientist correctly. And Martin, beyond the GPU database field. I'm not afraid. Okay. Last question, Wesley, that I have, because this whole thing, although it sounds very theoretical, we mind it in terms of the mass beyond all the rest of it. But that flicks and friends much, but must be just sort of the iceberg. I reckon from practical use case perspective, the applications must be, must be multitude of algorithms, basically, making use of this. I think so. I think so. And I don't know if we can talk about what we've seen, what I've seen in, in release lives, but there are people, we just, we just touched on a few use cases, but imagine a Google, let's say Google sign up for them. Every time you try to create a new, new Google address, it has to check if that your email already exists. There you have it, another use case for, for a bloom filter. Has this email address been used already? Or a dictionary? So is this word that I just wrote in the dictionary? Is it correctly spelled? That was, that was actually one of the primary use cases for, for the bloom filter. Which is quite an example, because as we all know, languages can become like spists. Yes. Even, even our languages outside the problem with languages realm. Yeah. I mean, the correct spelling, at least, at least a very basic way to, to highlight a word that, that doesn't exist in the dictionary. If you look at it like that. Sounds like a very good use case indeed. Any final thoughts before we go into the boxes and the entire boxes? No, I, I just, about the probabilistic data structure of the realm. Once you, once you get to, to learn what they can do, you really kind of start to, to fall in love with them. You start to see applications everywhere. And it's good. It's a very good tool to have in your developer tool belt. But I, I don't know, where do you see this going? I mean, I'm not talking about de-ways now. I'm just talking about kind of, from an application perspective, from a, from a, maybe from even from an implementation perspective. This in terms of, of what, of the, the development of this bloom module. Anything goes. Where the humanity is going, for example, to Mars. Or whether they, or whether they cartels will flip the coin. It does a frauditation. You never know. No, I'm joking. And then, but, but if, if I had kind of visionary thoughts, now is the time. Well, if the direction is more and more data and more and more real time requirements, then I think that we are going to see many more new algorithms for probabilistic data structures. We're going to see new implementations that we haven't taught possible. And probably even more uses for, for the current algorithms and improvement. Even the bloom to their algorithm has already been improved multiple times. There are multiple variants of it. I mean, this is the beauty about math, right? Similar to computer science, it never sleeps. Because, because people are basically just wondering what is wrong with the consider of things, take it apart, put it back together again, and improve it. Oh, engineers. It's funny enough. This is, this is the very core thought of something got open source. If I'm not completely mistaken. Okay, can we get some of those engineers on tipsy path? Jitsy, if you're listening, please get in touch. Yeah, it didn't break down this time. Yes, but you probably have to cut this out. Okay, guys, that has been a very fun episode. Never mind missing math, but we don't want to bore all these relationships, all these relationships rather to death. So this was every, this is actually, this was actually very light on math. Needless to say, there is a complex mathematical foundation of this, beyond tissues, details, of course, on the show notes. Probably tricky to do on a podcast. Yes, exactly. And with that thought, I think it's now, as the tradition demands it, now's the time to go into the boxes. And Ellen, I'm not sure if you're familiar with the concept. The boxes are the picks of the week. This acronym is a great model by Lynn. By Lynn was a loss. So anybody trying to rip this off, your, our lawyers will be in touch. Don't decide. And anything goes, anything that crossed your mind, worth mentioning, Ellen, in terms of font remembering, something that you really like that has crossed your path and say, we go to or something, and then we take turns. It's what, can you, can you say it again, Foxy? What is Foxy? Yes, anything that you found is anything that came across positives. No, but what are positives of, okay. I mean, we normally, we normally confine ourselves to movies these days. But anything goes, can be a movie, can be TV series, can be a music, can be a book, you name it. Hey, you know what? It was a fun set. I hadn't went out. I'm Portugal and we have been in lockdown for a while. And it's crazy how this, but after a few weeks of not going out for walks in nature at all, I finally went out for a walk in the sun. And it was finally also a funny day after many weeks of rain. And it was weird. I just felt that I sprung right back to life and I realized how much the sun and being out in nature makes a really big difference in our lives. Certainly a valid park. No, that's fine. Anything goes as much as that. Does that make rain you're at the park? We covered it in a minute. Okay, full disclosure, I had the very same experience today as Elena because I took literally an hour break and an hour long break actually done some cycling in Frankfurt. It was about 15 degrees, but the sun was shining. And that was in the current situation, Germany is still under lockdown. Yes, it will come diversion, but over to your mountain. I think, yeah, well, after today, I think my box was the week is from really data structures because it's a lot of fun. It's a lot of things I use, but it's a beautiful piece of, I guess, algorithm for most of these. Yeah. Okay, end epochs is if there are any. I'm going to rain rain. I'm going to rain. Great minds think life because I was just about to say the weather forecast because the weather predicts actually a turn of winter come to ball for central thankful where I'm living. So okay, fair enough. And any thoughts on it will possibly enter end epochs. Puppies. Sorry, say good. Puppies. Everyone loves puppies. Okay. You probably have to explain why. The flowers or the dogs. No, the dogs, the dogs, the big dogs, puppies. I don't know if I want to explain exactly why it might be too much information. I'll take it and you don't have it open. You still have only one, right? Yeah. Okay. We will probably leave it at that. Okay. Guys, our other element that has been really fun. Well, thank you very much for for for joining this. I hope you have you back it more than welcome to. I hope we hope to have you back or Olga for the matter in an upcoming episode very soon and really looking forward to it. Thank you. Thank you guys. Yeah, thanks. Thank you. This is the Linux in-laws. You come for the knowledge. But stay for the madness. Thank you for listening. This podcast is licensed under the latest version of the creative comments license. Type attribution share like credits for the intro music go to blue zero stirs for the song market to twin flames for their piece called the flow used for the second intros and finally to select your ground for the songs we just use by the dark side. You find these and other these licensed under cc at Chimando or website dedicated to liberate the music industry from choking copyright legislation and other crap concepts You've been listening to hecka public radio at hecka public radio dot org. We are a community podcast network that releases shows every weekday Monday through Friday. Today's show like all our shows was contributed by an hbr listener like yourself. If you ever thought of recording a podcast and click on our contributing to find out how easy it really is. hecka public radio was found by the digital dog pound and the infonomican computer club and it's part of the binary revolution at binrev.com. If you have comments on today's show please email the host directly leave a comment on the website or record a follow up episode yourself unless otherwise status. Today's show is released on the creative comments, attribution, share a light 3.0 license.