Initial commit: HPR Knowledge Base MCP Server
- MCP server with stdio transport for local use - Search episodes, transcripts, hosts, and series - 4,511 episodes with metadata and transcripts - Data loader with in-memory JSON storage 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
This commit is contained in:
211
hpr_transcripts/hpr1937.txt
Normal file
211
hpr_transcripts/hpr1937.txt
Normal file
@@ -0,0 +1,211 @@
|
||||
Episode: 1937
|
||||
Title: HPR1937: Klaatu talks to Cloudera about Hadoop and Big Data
|
||||
Source: https://hub.hackerpublicradio.org/ccdn.php?filename=/eps/hpr1937/hpr1937.mp3
|
||||
Transcribed: 2025-10-18 11:28:37
|
||||
|
||||
---
|
||||
|
||||
This is HPR Episode 1937 entitled, Klaatu Talks to Klaatu about Hadoop and Big Beta and
|
||||
is part of the series in the news.
|
||||
It is hosted by Klaatu and is about 11 minutes long.
|
||||
The summary is, Klaatu Talks to Klau about Hadoop and Big Beta.
|
||||
This episode of HPR is brought to you by an honesthost.com.
|
||||
Get 15% discount on all shared hosting with the offer code, HPR15.
|
||||
Beta Web Hosting that is honest and fair at An Honesthost.com.
|
||||
Hi everyone, this is Pat too and I met all things open 2015.
|
||||
I'm talking to Ricky from KlauDera.
|
||||
So Ricky, what is KlauDera?
|
||||
We're an open source company that is trying to take Apache Hadoop to the enterprise level
|
||||
to allow enterprises to extract information out of multi terabyte data sets, the petabyte
|
||||
data sets.
|
||||
When you say Hadoop first of all, I know the name, I know that it's got something to do
|
||||
with big data, is it a file system, is it a framework, what is Hadoop?
|
||||
That's a great question.
|
||||
That answer really kind of evolves over time really.
|
||||
Back in the day, when I first started KlauDera like three and a half years ago, Hadoop
|
||||
was really just a file system in a distributed processing framework called MapReduce.
|
||||
HGFS is the file system for Hadoop and it's a distributed file system which allows you
|
||||
to take many nodes, typically back in the day we see commodity nodes, just a couple thousand
|
||||
dollar servers, you get like four or five, maybe a hundred to a thousand of them and you
|
||||
put them together.
|
||||
Then you can store a large volume of data on the systems and it kind of brings it all
|
||||
together in one file system name space.
|
||||
So you mentioned the file system, the other part was the distributed processing.
|
||||
Years ago, that processing framework was called MapReduce and you can essentially write
|
||||
code in this very certain pattern, and it's just a pattern that was kind of labeled by
|
||||
Google called MapReduce and if you program in this paradigm, you're able to write some
|
||||
pretty basic code but you're able to submit it to the cluster and analyze data in a distributed
|
||||
fashion so you don't have to worry about nodes failing, you don't have to worry about where
|
||||
the data is and all that kind of stuff so you can scale out your distributed processing.
|
||||
KlauDera is obviously like it's using, I mean they're the standard for production ready,
|
||||
Apache Hadoop, but I know that, I mean I see on the document the white pages there that
|
||||
it lists like half a dozen Apache projects, like how does that all sort of fit together?
|
||||
Right, great question.
|
||||
So we try to take a lot of these projects out there that kind of are on the umbrella
|
||||
of the Hadoop ecosystem.
|
||||
So you have, I mean you can name it, it's like 15, 20 of them, right?
|
||||
So what we do is we actually similar to like kind of how Red Hat works is like if you're
|
||||
going to run a production system, you don't get a kernel.org and download a kernel
|
||||
and compile it.
|
||||
You go download Red Hat or you download it Ubuntu or Debian System.
|
||||
We're very similar in that way.
|
||||
So we take down the, we employ large majority of the committers, contributors to these
|
||||
open source projects that we ship.
|
||||
We provide patches, they work upstream and then we also take the stable bits, we pull
|
||||
it down and we have, we do production releases where we test it, we do integration tests,
|
||||
smoke tests, all that kind of stuff to say like yes, we certify that this product is
|
||||
production ready, it contains all the features that our customers want, but it's also a stable
|
||||
piece of software and that we're confident that you can run it in an enterprise, in a production
|
||||
environment.
|
||||
Who would be your customers?
|
||||
I'm obviously big data, but I mean like what would be, I mean like are you talking about
|
||||
like websites with lots of user, like with lots of like login users or is it also in
|
||||
like internal intranets and big enterprises or both?
|
||||
Really it's kind of a lot of different verticals.
|
||||
So one of the big verticals in our customers is the financial companies.
|
||||
So credit card fraud, for example, is a huge use case with machine learning and the more
|
||||
data you have, the threat of machine learning library or machine learning algorithm, the
|
||||
better.
|
||||
Even if you're algorithm, it's actually better to have more data in a dumber algorithm
|
||||
than a really smart algorithm in a small amount of data.
|
||||
We have a lot of use cases in the financial vertical, also e-commerce, social media websites
|
||||
that just need to store an abundant amount of data.
|
||||
We have a database that runs across a loop called HBase, that essentially allows you
|
||||
to do billions to trillions of rows and able to find like a needle in the haystack kind
|
||||
of thing where you say, this user idea is this, I want to find out the user, but we have
|
||||
hundreds of thousands of users or millions of users and the data that we keep about them
|
||||
is quite dense.
|
||||
And so I need to find this data, you can just give it a key and it's able to just go in
|
||||
and get it.
|
||||
So a lot of companies, maybe not our customers, but there's companies like for example Pinterest
|
||||
is one of the largest HBase users, Facebook is a huge HBase user.
|
||||
That's what it's sounding like.
|
||||
Yes.
|
||||
It's like you were describing Facebook.
|
||||
Yes, exactly.
|
||||
Yeah.
|
||||
So like something like Facebook, let's say where they've got like the uploads of pictures
|
||||
and they're scanning for facial recognition, is that something that had, or something
|
||||
like, I mean, I don't know if Facebook actually uses to do, but if they did like, is that, does
|
||||
that fit into that sort of ecosystem of, we have all this data being thrown at us, how
|
||||
do we process it, scale it down to thumbnails, and then post it to the person's profile
|
||||
all in one go.
|
||||
Right.
|
||||
Yeah, you can actually do it.
|
||||
So one thing nice about Hadoop is that you can kind of just shove any file binary format
|
||||
or not into the file system and then later decide what to do with it.
|
||||
There's a company was one of our clients, the name is Skates here right now for some reason,
|
||||
but they did imaging of like all kinds of like the world.
|
||||
So they every hour or whatever, they're taking pictures around the world and they're
|
||||
like, was it the NSA, does that really mean, yeah, maybe, so yeah, and they would take
|
||||
these images and they upload it to Hadoop and they had custom MapReduce code that would
|
||||
scan through these images and glean information using those images.
|
||||
So they could actually, for example, say that they, that some sort of ship had moved through
|
||||
the Panama Canal and know that like at this time and this time, just based on the images.
|
||||
Just based on the images.
|
||||
That almost seems like an inefficient way to do that actually, like, isn't there some
|
||||
other trigger?
|
||||
Exactly, yeah, you would think, but you know, like, you can, there's all kinds of information
|
||||
you could or like, maybe there's an army moving through an area or there's like, or maybe
|
||||
a giant migration of animals going this way.
|
||||
One of our customers is a monsanto, which made, you know, it's, you know, mixed feelings
|
||||
for a lot of people, but they were able, they're able to use to do, to do analysis of
|
||||
knowing how far to plant seeds to yield maximum crop growth.
|
||||
For example, using big data, using weather patterns.
|
||||
So they're able to correlate all these different dispersed data sets, correlate them together
|
||||
and then make educated decisions on like what to do.
|
||||
Wow, that's cool.
|
||||
I used to do that with the Farmers Almanac too, but I guess it's simpler.
|
||||
It's just right like 90% of that, right?
|
||||
Yeah, well, that's really cool.
|
||||
So you're, I mean, you're an open source project and a business, correct?
|
||||
Yeah.
|
||||
Yeah.
|
||||
So we, we have a, an interesting business model.
|
||||
We have a hybrid business model.
|
||||
We are very against vendor lock-in.
|
||||
We have been for a long time.
|
||||
It's just one of our, one of our principles that we stand on.
|
||||
We don't want you to feel like you have to pay us in order to keep running our software.
|
||||
So everything that touches your data, everything that you analyze your data, things that you
|
||||
use on a day-to-day basis to run your business.
|
||||
If you decide to use Cladera, it's 100% open source.
|
||||
It's not like we're going to just, you know, like some companies just like you didn't pay
|
||||
the bill.
|
||||
So we unplug your database, say you're done.
|
||||
But we, in order to keep our vendors, other vendors or competitors from being able to
|
||||
just take everything, we have some stuff that we keep proprietary that doesn't touch your
|
||||
data, for example.
|
||||
Okay.
|
||||
For example, we have Cladera manager was the biggest, one of our biggest products for a
|
||||
very long time, where it makes it very easy to install Hadoop to actually manage Hadoop.
|
||||
That's really one of the hardest things to do is to manage Hadoop cluster, especially
|
||||
when you're doing with tens to 20s to 100s and nodes.
|
||||
Yeah.
|
||||
It's very difficult to do.
|
||||
Back when I first started, Cladera manager was in its infancy and we were still learning
|
||||
how to build these clusters by hand.
|
||||
Every XML value to handwrite and ship them out and write little scripts and use puppet
|
||||
or something.
|
||||
And it was a nightmare.
|
||||
So that was one piece of software that we did.
|
||||
We have, but we continue to add other software to our portfolio, such as Navigator, which
|
||||
tracks data that's coming in, it tracks the lineage of it.
|
||||
So that way, when an executive gets a report, you can actually track the lineage of how
|
||||
that data came in.
|
||||
And you can, you know, for example, say concretely that yes, this data is not false or this
|
||||
data is not corrupt or was joined against bad data or something.
|
||||
So people, I mean, I don't think this is sort of the thing that people could just like
|
||||
try out like, I mean, like this is for an acquisitions person or something, right?
|
||||
I mean, it's not.
|
||||
Sort of.
|
||||
Yeah.
|
||||
I would say, even though that Hadoop is kind of still, it's come a very long way.
|
||||
It's still kind of difficult to just kind of just step in to the pond.
|
||||
That is kind of difficult, but you can, there's a lot of training resources online way more
|
||||
than there was three years ago.
|
||||
So you can totally do that.
|
||||
I will say that if you are in the tech industry, it would definitely be who do you to start
|
||||
looking into Hadoop, learning a little bit about it, because it's going to start becoming.
|
||||
It's becoming more and more, like I've been here for a long time.
|
||||
More and more companies are talking about it.
|
||||
More and more companies want it.
|
||||
More and more companies just actually not just want it, but need it now.
|
||||
Google hit this problem a long time ago, and now companies are kind of realizing to kind
|
||||
of become a real company these days, you need to become information driven.
|
||||
And your data set is not getting smaller.
|
||||
Exactly.
|
||||
Your data set's only getting larger.
|
||||
And now with this internet of things buzzword going around, everybody's collecting as much
|
||||
data as possible.
|
||||
Yeah.
|
||||
And they're realizing, oh, I can't just like shove that into Oracle database, you know what
|
||||
it's like.
|
||||
I would say, going our website, we have a quick start VM where you can just download and
|
||||
it's a virtual box machine.
|
||||
You can just get started.
|
||||
There's also like Docker images online that you can probably get going with too.
|
||||
Cool.
|
||||
Okay.
|
||||
Yeah.
|
||||
What is the site?
|
||||
It's cladera.com.
|
||||
It's a quick start bundle on there in the resources section.
|
||||
Cool.
|
||||
Thanks for giving me an overview.
|
||||
It's actually really interesting and informational.
|
||||
Thank you.
|
||||
I appreciate it.
|
||||
All right.
|
||||
Talk to you later.
|
||||
You've been listening to Hacker Public Radio at Hacker Public Radio dot org.
|
||||
We are a community podcast network that releases shows every weekday, Monday through Friday.
|
||||
Today's show, like all our shows, was contributed by a HBR listener like yourself.
|
||||
If you ever thought of recording a podcast, then click on our contributing to find out
|
||||
how easy it really is.
|
||||
Hacker Public Radio was founded by the digital dog pound and the infonomicon computer club
|
||||
and is part of the binary revolution at binrev.com.
|
||||
If you have comments on today's show, please email the host directly, leave a comment on
|
||||
the website or record a follow-up episode yourself.
|
||||
Unless otherwise stated, today's show is released on the creative comments, attribution,
|
||||
share a like, 3.0 license.
|
||||
Reference in New Issue
Block a user