Episode: 1937 Title: HPR1937: Klaatu talks to Cloudera about Hadoop and Big Data Source: https://hub.hackerpublicradio.org/ccdn.php?filename=/eps/hpr1937/hpr1937.mp3 Transcribed: 2025-10-18 11:28:37 --- This is HPR Episode 1937 entitled, Klaatu Talks to Klaatu about Hadoop and Big Beta and is part of the series in the news. It is hosted by Klaatu and is about 11 minutes long. The summary is, Klaatu Talks to Klau about Hadoop and Big Beta. This episode of HPR is brought to you by an honesthost.com. Get 15% discount on all shared hosting with the offer code, HPR15. Beta Web Hosting that is honest and fair at An Honesthost.com. Hi everyone, this is Pat too and I met all things open 2015. I'm talking to Ricky from KlauDera. So Ricky, what is KlauDera? We're an open source company that is trying to take Apache Hadoop to the enterprise level to allow enterprises to extract information out of multi terabyte data sets, the petabyte data sets. When you say Hadoop first of all, I know the name, I know that it's got something to do with big data, is it a file system, is it a framework, what is Hadoop? That's a great question. That answer really kind of evolves over time really. Back in the day, when I first started KlauDera like three and a half years ago, Hadoop was really just a file system in a distributed processing framework called MapReduce. HGFS is the file system for Hadoop and it's a distributed file system which allows you to take many nodes, typically back in the day we see commodity nodes, just a couple thousand dollar servers, you get like four or five, maybe a hundred to a thousand of them and you put them together. Then you can store a large volume of data on the systems and it kind of brings it all together in one file system name space. So you mentioned the file system, the other part was the distributed processing. Years ago, that processing framework was called MapReduce and you can essentially write code in this very certain pattern, and it's just a pattern that was kind of labeled by Google called MapReduce and if you program in this paradigm, you're able to write some pretty basic code but you're able to submit it to the cluster and analyze data in a distributed fashion so you don't have to worry about nodes failing, you don't have to worry about where the data is and all that kind of stuff so you can scale out your distributed processing. KlauDera is obviously like it's using, I mean they're the standard for production ready, Apache Hadoop, but I know that, I mean I see on the document the white pages there that it lists like half a dozen Apache projects, like how does that all sort of fit together? Right, great question. So we try to take a lot of these projects out there that kind of are on the umbrella of the Hadoop ecosystem. So you have, I mean you can name it, it's like 15, 20 of them, right? So what we do is we actually similar to like kind of how Red Hat works is like if you're going to run a production system, you don't get a kernel.org and download a kernel and compile it. You go download Red Hat or you download it Ubuntu or Debian System. We're very similar in that way. So we take down the, we employ large majority of the committers, contributors to these open source projects that we ship. We provide patches, they work upstream and then we also take the stable bits, we pull it down and we have, we do production releases where we test it, we do integration tests, smoke tests, all that kind of stuff to say like yes, we certify that this product is production ready, it contains all the features that our customers want, but it's also a stable piece of software and that we're confident that you can run it in an enterprise, in a production environment. Who would be your customers? I'm obviously big data, but I mean like what would be, I mean like are you talking about like websites with lots of user, like with lots of like login users or is it also in like internal intranets and big enterprises or both? Really it's kind of a lot of different verticals. So one of the big verticals in our customers is the financial companies. So credit card fraud, for example, is a huge use case with machine learning and the more data you have, the threat of machine learning library or machine learning algorithm, the better. Even if you're algorithm, it's actually better to have more data in a dumber algorithm than a really smart algorithm in a small amount of data. We have a lot of use cases in the financial vertical, also e-commerce, social media websites that just need to store an abundant amount of data. We have a database that runs across a loop called HBase, that essentially allows you to do billions to trillions of rows and able to find like a needle in the haystack kind of thing where you say, this user idea is this, I want to find out the user, but we have hundreds of thousands of users or millions of users and the data that we keep about them is quite dense. And so I need to find this data, you can just give it a key and it's able to just go in and get it. So a lot of companies, maybe not our customers, but there's companies like for example Pinterest is one of the largest HBase users, Facebook is a huge HBase user. That's what it's sounding like. Yes. It's like you were describing Facebook. Yes, exactly. Yeah. So like something like Facebook, let's say where they've got like the uploads of pictures and they're scanning for facial recognition, is that something that had, or something like, I mean, I don't know if Facebook actually uses to do, but if they did like, is that, does that fit into that sort of ecosystem of, we have all this data being thrown at us, how do we process it, scale it down to thumbnails, and then post it to the person's profile all in one go. Right. Yeah, you can actually do it. So one thing nice about Hadoop is that you can kind of just shove any file binary format or not into the file system and then later decide what to do with it. There's a company was one of our clients, the name is Skates here right now for some reason, but they did imaging of like all kinds of like the world. So they every hour or whatever, they're taking pictures around the world and they're like, was it the NSA, does that really mean, yeah, maybe, so yeah, and they would take these images and they upload it to Hadoop and they had custom MapReduce code that would scan through these images and glean information using those images. So they could actually, for example, say that they, that some sort of ship had moved through the Panama Canal and know that like at this time and this time, just based on the images. Just based on the images. That almost seems like an inefficient way to do that actually, like, isn't there some other trigger? Exactly, yeah, you would think, but you know, like, you can, there's all kinds of information you could or like, maybe there's an army moving through an area or there's like, or maybe a giant migration of animals going this way. One of our customers is a monsanto, which made, you know, it's, you know, mixed feelings for a lot of people, but they were able, they're able to use to do, to do analysis of knowing how far to plant seeds to yield maximum crop growth. For example, using big data, using weather patterns. So they're able to correlate all these different dispersed data sets, correlate them together and then make educated decisions on like what to do. Wow, that's cool. I used to do that with the Farmers Almanac too, but I guess it's simpler. It's just right like 90% of that, right? Yeah, well, that's really cool. So you're, I mean, you're an open source project and a business, correct? Yeah. Yeah. So we, we have a, an interesting business model. We have a hybrid business model. We are very against vendor lock-in. We have been for a long time. It's just one of our, one of our principles that we stand on. We don't want you to feel like you have to pay us in order to keep running our software. So everything that touches your data, everything that you analyze your data, things that you use on a day-to-day basis to run your business. If you decide to use Cladera, it's 100% open source. It's not like we're going to just, you know, like some companies just like you didn't pay the bill. So we unplug your database, say you're done. But we, in order to keep our vendors, other vendors or competitors from being able to just take everything, we have some stuff that we keep proprietary that doesn't touch your data, for example. Okay. For example, we have Cladera manager was the biggest, one of our biggest products for a very long time, where it makes it very easy to install Hadoop to actually manage Hadoop. That's really one of the hardest things to do is to manage Hadoop cluster, especially when you're doing with tens to 20s to 100s and nodes. Yeah. It's very difficult to do. Back when I first started, Cladera manager was in its infancy and we were still learning how to build these clusters by hand. Every XML value to handwrite and ship them out and write little scripts and use puppet or something. And it was a nightmare. So that was one piece of software that we did. We have, but we continue to add other software to our portfolio, such as Navigator, which tracks data that's coming in, it tracks the lineage of it. So that way, when an executive gets a report, you can actually track the lineage of how that data came in. And you can, you know, for example, say concretely that yes, this data is not false or this data is not corrupt or was joined against bad data or something. So people, I mean, I don't think this is sort of the thing that people could just like try out like, I mean, like this is for an acquisitions person or something, right? I mean, it's not. Sort of. Yeah. I would say, even though that Hadoop is kind of still, it's come a very long way. It's still kind of difficult to just kind of just step in to the pond. That is kind of difficult, but you can, there's a lot of training resources online way more than there was three years ago. So you can totally do that. I will say that if you are in the tech industry, it would definitely be who do you to start looking into Hadoop, learning a little bit about it, because it's going to start becoming. It's becoming more and more, like I've been here for a long time. More and more companies are talking about it. More and more companies want it. More and more companies just actually not just want it, but need it now. Google hit this problem a long time ago, and now companies are kind of realizing to kind of become a real company these days, you need to become information driven. And your data set is not getting smaller. Exactly. Your data set's only getting larger. And now with this internet of things buzzword going around, everybody's collecting as much data as possible. Yeah. And they're realizing, oh, I can't just like shove that into Oracle database, you know what it's like. I would say, going our website, we have a quick start VM where you can just download and it's a virtual box machine. You can just get started. There's also like Docker images online that you can probably get going with too. Cool. Okay. Yeah. What is the site? It's cladera.com. It's a quick start bundle on there in the resources section. Cool. Thanks for giving me an overview. It's actually really interesting and informational. Thank you. I appreciate it. All right. Talk to you later. You've been listening to Hacker Public Radio at Hacker Public Radio dot org. We are a community podcast network that releases shows every weekday, Monday through Friday. Today's show, like all our shows, was contributed by a HBR listener like yourself. If you ever thought of recording a podcast, then click on our contributing to find out how easy it really is. Hacker Public Radio was founded by the digital dog pound and the infonomicon computer club and is part of the binary revolution at binrev.com. If you have comments on today's show, please email the host directly, leave a comment on the website or record a follow-up episode yourself. Unless otherwise stated, today's show is released on the creative comments, attribution, share a like, 3.0 license.