- MCP server with stdio transport for local use - Search episodes, transcripts, hosts, and series - 4,511 episodes with metadata and transcripts - Data loader with in-memory JSON storage 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
212 lines
12 KiB
Plaintext
212 lines
12 KiB
Plaintext
Episode: 1937
|
|
Title: HPR1937: Klaatu talks to Cloudera about Hadoop and Big Data
|
|
Source: https://hub.hackerpublicradio.org/ccdn.php?filename=/eps/hpr1937/hpr1937.mp3
|
|
Transcribed: 2025-10-18 11:28:37
|
|
|
|
---
|
|
|
|
This is HPR Episode 1937 entitled, Klaatu Talks to Klaatu about Hadoop and Big Beta and
|
|
is part of the series in the news.
|
|
It is hosted by Klaatu and is about 11 minutes long.
|
|
The summary is, Klaatu Talks to Klau about Hadoop and Big Beta.
|
|
This episode of HPR is brought to you by an honesthost.com.
|
|
Get 15% discount on all shared hosting with the offer code, HPR15.
|
|
Beta Web Hosting that is honest and fair at An Honesthost.com.
|
|
Hi everyone, this is Pat too and I met all things open 2015.
|
|
I'm talking to Ricky from KlauDera.
|
|
So Ricky, what is KlauDera?
|
|
We're an open source company that is trying to take Apache Hadoop to the enterprise level
|
|
to allow enterprises to extract information out of multi terabyte data sets, the petabyte
|
|
data sets.
|
|
When you say Hadoop first of all, I know the name, I know that it's got something to do
|
|
with big data, is it a file system, is it a framework, what is Hadoop?
|
|
That's a great question.
|
|
That answer really kind of evolves over time really.
|
|
Back in the day, when I first started KlauDera like three and a half years ago, Hadoop
|
|
was really just a file system in a distributed processing framework called MapReduce.
|
|
HGFS is the file system for Hadoop and it's a distributed file system which allows you
|
|
to take many nodes, typically back in the day we see commodity nodes, just a couple thousand
|
|
dollar servers, you get like four or five, maybe a hundred to a thousand of them and you
|
|
put them together.
|
|
Then you can store a large volume of data on the systems and it kind of brings it all
|
|
together in one file system name space.
|
|
So you mentioned the file system, the other part was the distributed processing.
|
|
Years ago, that processing framework was called MapReduce and you can essentially write
|
|
code in this very certain pattern, and it's just a pattern that was kind of labeled by
|
|
Google called MapReduce and if you program in this paradigm, you're able to write some
|
|
pretty basic code but you're able to submit it to the cluster and analyze data in a distributed
|
|
fashion so you don't have to worry about nodes failing, you don't have to worry about where
|
|
the data is and all that kind of stuff so you can scale out your distributed processing.
|
|
KlauDera is obviously like it's using, I mean they're the standard for production ready,
|
|
Apache Hadoop, but I know that, I mean I see on the document the white pages there that
|
|
it lists like half a dozen Apache projects, like how does that all sort of fit together?
|
|
Right, great question.
|
|
So we try to take a lot of these projects out there that kind of are on the umbrella
|
|
of the Hadoop ecosystem.
|
|
So you have, I mean you can name it, it's like 15, 20 of them, right?
|
|
So what we do is we actually similar to like kind of how Red Hat works is like if you're
|
|
going to run a production system, you don't get a kernel.org and download a kernel
|
|
and compile it.
|
|
You go download Red Hat or you download it Ubuntu or Debian System.
|
|
We're very similar in that way.
|
|
So we take down the, we employ large majority of the committers, contributors to these
|
|
open source projects that we ship.
|
|
We provide patches, they work upstream and then we also take the stable bits, we pull
|
|
it down and we have, we do production releases where we test it, we do integration tests,
|
|
smoke tests, all that kind of stuff to say like yes, we certify that this product is
|
|
production ready, it contains all the features that our customers want, but it's also a stable
|
|
piece of software and that we're confident that you can run it in an enterprise, in a production
|
|
environment.
|
|
Who would be your customers?
|
|
I'm obviously big data, but I mean like what would be, I mean like are you talking about
|
|
like websites with lots of user, like with lots of like login users or is it also in
|
|
like internal intranets and big enterprises or both?
|
|
Really it's kind of a lot of different verticals.
|
|
So one of the big verticals in our customers is the financial companies.
|
|
So credit card fraud, for example, is a huge use case with machine learning and the more
|
|
data you have, the threat of machine learning library or machine learning algorithm, the
|
|
better.
|
|
Even if you're algorithm, it's actually better to have more data in a dumber algorithm
|
|
than a really smart algorithm in a small amount of data.
|
|
We have a lot of use cases in the financial vertical, also e-commerce, social media websites
|
|
that just need to store an abundant amount of data.
|
|
We have a database that runs across a loop called HBase, that essentially allows you
|
|
to do billions to trillions of rows and able to find like a needle in the haystack kind
|
|
of thing where you say, this user idea is this, I want to find out the user, but we have
|
|
hundreds of thousands of users or millions of users and the data that we keep about them
|
|
is quite dense.
|
|
And so I need to find this data, you can just give it a key and it's able to just go in
|
|
and get it.
|
|
So a lot of companies, maybe not our customers, but there's companies like for example Pinterest
|
|
is one of the largest HBase users, Facebook is a huge HBase user.
|
|
That's what it's sounding like.
|
|
Yes.
|
|
It's like you were describing Facebook.
|
|
Yes, exactly.
|
|
Yeah.
|
|
So like something like Facebook, let's say where they've got like the uploads of pictures
|
|
and they're scanning for facial recognition, is that something that had, or something
|
|
like, I mean, I don't know if Facebook actually uses to do, but if they did like, is that, does
|
|
that fit into that sort of ecosystem of, we have all this data being thrown at us, how
|
|
do we process it, scale it down to thumbnails, and then post it to the person's profile
|
|
all in one go.
|
|
Right.
|
|
Yeah, you can actually do it.
|
|
So one thing nice about Hadoop is that you can kind of just shove any file binary format
|
|
or not into the file system and then later decide what to do with it.
|
|
There's a company was one of our clients, the name is Skates here right now for some reason,
|
|
but they did imaging of like all kinds of like the world.
|
|
So they every hour or whatever, they're taking pictures around the world and they're
|
|
like, was it the NSA, does that really mean, yeah, maybe, so yeah, and they would take
|
|
these images and they upload it to Hadoop and they had custom MapReduce code that would
|
|
scan through these images and glean information using those images.
|
|
So they could actually, for example, say that they, that some sort of ship had moved through
|
|
the Panama Canal and know that like at this time and this time, just based on the images.
|
|
Just based on the images.
|
|
That almost seems like an inefficient way to do that actually, like, isn't there some
|
|
other trigger?
|
|
Exactly, yeah, you would think, but you know, like, you can, there's all kinds of information
|
|
you could or like, maybe there's an army moving through an area or there's like, or maybe
|
|
a giant migration of animals going this way.
|
|
One of our customers is a monsanto, which made, you know, it's, you know, mixed feelings
|
|
for a lot of people, but they were able, they're able to use to do, to do analysis of
|
|
knowing how far to plant seeds to yield maximum crop growth.
|
|
For example, using big data, using weather patterns.
|
|
So they're able to correlate all these different dispersed data sets, correlate them together
|
|
and then make educated decisions on like what to do.
|
|
Wow, that's cool.
|
|
I used to do that with the Farmers Almanac too, but I guess it's simpler.
|
|
It's just right like 90% of that, right?
|
|
Yeah, well, that's really cool.
|
|
So you're, I mean, you're an open source project and a business, correct?
|
|
Yeah.
|
|
Yeah.
|
|
So we, we have a, an interesting business model.
|
|
We have a hybrid business model.
|
|
We are very against vendor lock-in.
|
|
We have been for a long time.
|
|
It's just one of our, one of our principles that we stand on.
|
|
We don't want you to feel like you have to pay us in order to keep running our software.
|
|
So everything that touches your data, everything that you analyze your data, things that you
|
|
use on a day-to-day basis to run your business.
|
|
If you decide to use Cladera, it's 100% open source.
|
|
It's not like we're going to just, you know, like some companies just like you didn't pay
|
|
the bill.
|
|
So we unplug your database, say you're done.
|
|
But we, in order to keep our vendors, other vendors or competitors from being able to
|
|
just take everything, we have some stuff that we keep proprietary that doesn't touch your
|
|
data, for example.
|
|
Okay.
|
|
For example, we have Cladera manager was the biggest, one of our biggest products for a
|
|
very long time, where it makes it very easy to install Hadoop to actually manage Hadoop.
|
|
That's really one of the hardest things to do is to manage Hadoop cluster, especially
|
|
when you're doing with tens to 20s to 100s and nodes.
|
|
Yeah.
|
|
It's very difficult to do.
|
|
Back when I first started, Cladera manager was in its infancy and we were still learning
|
|
how to build these clusters by hand.
|
|
Every XML value to handwrite and ship them out and write little scripts and use puppet
|
|
or something.
|
|
And it was a nightmare.
|
|
So that was one piece of software that we did.
|
|
We have, but we continue to add other software to our portfolio, such as Navigator, which
|
|
tracks data that's coming in, it tracks the lineage of it.
|
|
So that way, when an executive gets a report, you can actually track the lineage of how
|
|
that data came in.
|
|
And you can, you know, for example, say concretely that yes, this data is not false or this
|
|
data is not corrupt or was joined against bad data or something.
|
|
So people, I mean, I don't think this is sort of the thing that people could just like
|
|
try out like, I mean, like this is for an acquisitions person or something, right?
|
|
I mean, it's not.
|
|
Sort of.
|
|
Yeah.
|
|
I would say, even though that Hadoop is kind of still, it's come a very long way.
|
|
It's still kind of difficult to just kind of just step in to the pond.
|
|
That is kind of difficult, but you can, there's a lot of training resources online way more
|
|
than there was three years ago.
|
|
So you can totally do that.
|
|
I will say that if you are in the tech industry, it would definitely be who do you to start
|
|
looking into Hadoop, learning a little bit about it, because it's going to start becoming.
|
|
It's becoming more and more, like I've been here for a long time.
|
|
More and more companies are talking about it.
|
|
More and more companies want it.
|
|
More and more companies just actually not just want it, but need it now.
|
|
Google hit this problem a long time ago, and now companies are kind of realizing to kind
|
|
of become a real company these days, you need to become information driven.
|
|
And your data set is not getting smaller.
|
|
Exactly.
|
|
Your data set's only getting larger.
|
|
And now with this internet of things buzzword going around, everybody's collecting as much
|
|
data as possible.
|
|
Yeah.
|
|
And they're realizing, oh, I can't just like shove that into Oracle database, you know what
|
|
it's like.
|
|
I would say, going our website, we have a quick start VM where you can just download and
|
|
it's a virtual box machine.
|
|
You can just get started.
|
|
There's also like Docker images online that you can probably get going with too.
|
|
Cool.
|
|
Okay.
|
|
Yeah.
|
|
What is the site?
|
|
It's cladera.com.
|
|
It's a quick start bundle on there in the resources section.
|
|
Cool.
|
|
Thanks for giving me an overview.
|
|
It's actually really interesting and informational.
|
|
Thank you.
|
|
I appreciate it.
|
|
All right.
|
|
Talk to you later.
|
|
You've been listening to Hacker Public Radio at Hacker Public Radio dot org.
|
|
We are a community podcast network that releases shows every weekday, Monday through Friday.
|
|
Today's show, like all our shows, was contributed by a HBR listener like yourself.
|
|
If you ever thought of recording a podcast, then click on our contributing to find out
|
|
how easy it really is.
|
|
Hacker Public Radio was founded by the digital dog pound and the infonomicon computer club
|
|
and is part of the binary revolution at binrev.com.
|
|
If you have comments on today's show, please email the host directly, leave a comment on
|
|
the website or record a follow-up episode yourself.
|
|
Unless otherwise stated, today's show is released on the creative comments, attribution,
|
|
share a like, 3.0 license.
|