Episode: 1937
Title: HPR1937: Klaatu talks to Cloudera about Hadoop and Big Data
Source: https://hub.hackerpublicradio.org/ccdn.php?filename=/eps/hpr1937/hpr1937.mp3
Transcribed: 2025-10-18 11:28:37

---

This is HPR Episode 1937 entitled, Klaatu Talks to Klaatu about Hadoop and Big Beta and
is part of the series in the news.
It is hosted by Klaatu and is about 11 minutes long.
The summary is, Klaatu Talks to Klau about Hadoop and Big Beta.
This episode of HPR is brought to you by an honesthost.com.
Get 15% discount on all shared hosting with the offer code, HPR15.
Beta Web Hosting that is honest and fair at An Honesthost.com.
Hi everyone, this is Pat too and I met all things open 2015.
I'm talking to Ricky from KlauDera.
So Ricky, what is KlauDera?
We're an open source company that is trying to take Apache Hadoop to the enterprise level
to allow enterprises to extract information out of multi terabyte data sets, the petabyte
data sets.
When you say Hadoop first of all, I know the name, I know that it's got something to do
with big data, is it a file system, is it a framework, what is Hadoop?
That's a great question.
That answer really kind of evolves over time really.
Back in the day, when I first started KlauDera like three and a half years ago, Hadoop
was really just a file system in a distributed processing framework called MapReduce.
HGFS is the file system for Hadoop and it's a distributed file system which allows you
to take many nodes, typically back in the day we see commodity nodes, just a couple thousand
dollar servers, you get like four or five, maybe a hundred to a thousand of them and you
put them together.
Then you can store a large volume of data on the systems and it kind of brings it all
together in one file system name space.
So you mentioned the file system, the other part was the distributed processing.
Years ago, that processing framework was called MapReduce and you can essentially write
code in this very certain pattern, and it's just a pattern that was kind of labeled by
Google called MapReduce and if you program in this paradigm, you're able to write some
pretty basic code but you're able to submit it to the cluster and analyze data in a distributed
fashion so you don't have to worry about nodes failing, you don't have to worry about where
the data is and all that kind of stuff so you can scale out your distributed processing.
KlauDera is obviously like it's using, I mean they're the standard for production ready,
Apache Hadoop, but I know that, I mean I see on the document the white pages there that
it lists like half a dozen Apache projects, like how does that all sort of fit together?
Right, great question.
So we try to take a lot of these projects out there that kind of are on the umbrella
of the Hadoop ecosystem.
So you have, I mean you can name it, it's like 15, 20 of them, right?
So what we do is we actually similar to like kind of how Red Hat works is like if you're
going to run a production system, you don't get a kernel.org and download a kernel
and compile it.
You go download Red Hat or you download it Ubuntu or Debian System.
We're very similar in that way.
So we take down the, we employ large majority of the committers, contributors to these
open source projects that we ship.
We provide patches, they work upstream and then we also take the stable bits, we pull
it down and we have, we do production releases where we test it, we do integration tests,
smoke tests, all that kind of stuff to say like yes, we certify that this product is
production ready, it contains all the features that our customers want, but it's also a stable
piece of software and that we're confident that you can run it in an enterprise, in a production
environment.
Who would be your customers?
I'm obviously big data, but I mean like what would be, I mean like are you talking about
like websites with lots of user, like with lots of like login users or is it also in
like internal intranets and big enterprises or both?
Really it's kind of a lot of different verticals.
So one of the big verticals in our customers is the financial companies.
So credit card fraud, for example, is a huge use case with machine learning and the more
data you have, the threat of machine learning library or machine learning algorithm, the
better.
Even if you're algorithm, it's actually better to have more data in a dumber algorithm
than a really smart algorithm in a small amount of data.
We have a lot of use cases in the financial vertical, also e-commerce, social media websites
that just need to store an abundant amount of data.
We have a database that runs across a loop called HBase, that essentially allows you
to do billions to trillions of rows and able to find like a needle in the haystack kind
of thing where you say, this user idea is this, I want to find out the user, but we have
hundreds of thousands of users or millions of users and the data that we keep about them
is quite dense.
And so I need to find this data, you can just give it a key and it's able to just go in
and get it.
So a lot of companies, maybe not our customers, but there's companies like for example Pinterest
is one of the largest HBase users, Facebook is a huge HBase user.
That's what it's sounding like.
Yes.
It's like you were describing Facebook.
Yes, exactly.
Yeah.
So like something like Facebook, let's say where they've got like the uploads of pictures
and they're scanning for facial recognition, is that something that had, or something
like, I mean, I don't know if Facebook actually uses to do, but if they did like, is that, does
that fit into that sort of ecosystem of, we have all this data being thrown at us, how
do we process it, scale it down to thumbnails, and then post it to the person's profile
all in one go.
Right.
Yeah, you can actually do it.
So one thing nice about Hadoop is that you can kind of just shove any file binary format
or not into the file system and then later decide what to do with it.
There's a company was one of our clients, the name is Skates here right now for some reason,
but they did imaging of like all kinds of like the world.
So they every hour or whatever, they're taking pictures around the world and they're
like, was it the NSA, does that really mean, yeah, maybe, so yeah, and they would take
these images and they upload it to Hadoop and they had custom MapReduce code that would
scan through these images and glean information using those images.
So they could actually, for example, say that they, that some sort of ship had moved through
the Panama Canal and know that like at this time and this time, just based on the images.
Just based on the images.
That almost seems like an inefficient way to do that actually, like, isn't there some
other trigger?
Exactly, yeah, you would think, but you know, like, you can, there's all kinds of information
you could or like, maybe there's an army moving through an area or there's like, or maybe
a giant migration of animals going this way.
One of our customers is a monsanto, which made, you know, it's, you know, mixed feelings
for a lot of people, but they were able, they're able to use to do, to do analysis of
knowing how far to plant seeds to yield maximum crop growth.
For example, using big data, using weather patterns.
So they're able to correlate all these different dispersed data sets, correlate them together
and then make educated decisions on like what to do.
Wow, that's cool.
I used to do that with the Farmers Almanac too, but I guess it's simpler.
It's just right like 90% of that, right?
Yeah, well, that's really cool.
So you're, I mean, you're an open source project and a business, correct?
Yeah.
Yeah.
So we, we have a, an interesting business model.
We have a hybrid business model.
We are very against vendor lock-in.
We have been for a long time.
It's just one of our, one of our principles that we stand on.
We don't want you to feel like you have to pay us in order to keep running our software.
So everything that touches your data, everything that you analyze your data, things that you
use on a day-to-day basis to run your business.
If you decide to use Cladera, it's 100% open source.
It's not like we're going to just, you know, like some companies just like you didn't pay
the bill.
So we unplug your database, say you're done.
But we, in order to keep our vendors, other vendors or competitors from being able to
just take everything, we have some stuff that we keep proprietary that doesn't touch your
data, for example.
Okay.
For example, we have Cladera manager was the biggest, one of our biggest products for a
very long time, where it makes it very easy to install Hadoop to actually manage Hadoop.
That's really one of the hardest things to do is to manage Hadoop cluster, especially
when you're doing with tens to 20s to 100s and nodes.
Yeah.
It's very difficult to do.
Back when I first started, Cladera manager was in its infancy and we were still learning
how to build these clusters by hand.
Every XML value to handwrite and ship them out and write little scripts and use puppet
or something.
And it was a nightmare.
So that was one piece of software that we did.
We have, but we continue to add other software to our portfolio, such as Navigator, which
tracks data that's coming in, it tracks the lineage of it.
So that way, when an executive gets a report, you can actually track the lineage of how
that data came in.
And you can, you know, for example, say concretely that yes, this data is not false or this
data is not corrupt or was joined against bad data or something.
So people, I mean, I don't think this is sort of the thing that people could just like
try out like, I mean, like this is for an acquisitions person or something, right?
I mean, it's not.
Sort of.
Yeah.
I would say, even though that Hadoop is kind of still, it's come a very long way.
It's still kind of difficult to just kind of just step in to the pond.
That is kind of difficult, but you can, there's a lot of training resources online way more
than there was three years ago.
So you can totally do that.
I will say that if you are in the tech industry, it would definitely be who do you to start
looking into Hadoop, learning a little bit about it, because it's going to start becoming.
It's becoming more and more, like I've been here for a long time.
More and more companies are talking about it.
More and more companies want it.
More and more companies just actually not just want it, but need it now.
Google hit this problem a long time ago, and now companies are kind of realizing to kind
of become a real company these days, you need to become information driven.
And your data set is not getting smaller.
Exactly.
Your data set's only getting larger.
And now with this internet of things buzzword going around, everybody's collecting as much
data as possible.
Yeah.
And they're realizing, oh, I can't just like shove that into Oracle database, you know what
it's like.
I would say, going our website, we have a quick start VM where you can just download and
it's a virtual box machine.
You can just get started.
There's also like Docker images online that you can probably get going with too.
Cool.
Okay.
Yeah.
What is the site?
It's cladera.com.
It's a quick start bundle on there in the resources section.
Cool.
Thanks for giving me an overview.
It's actually really interesting and informational.
Thank you.
I appreciate it.
All right.
Talk to you later.
You've been listening to Hacker Public Radio at Hacker Public Radio dot org.
We are a community podcast network that releases shows every weekday, Monday through Friday.
Today's show, like all our shows, was contributed by a HBR listener like yourself.
If you ever thought of recording a podcast, then click on our contributing to find out
how easy it really is.
Hacker Public Radio was founded by the digital dog pound and the infonomicon computer club
and is part of the binary revolution at binrev.com.
If you have comments on today's show, please email the host directly, leave a comment on
the website or record a follow-up episode yourself.
Unless otherwise stated, today's show is released on the creative comments, attribution,
share a like, 3.0 license.