Initial commit: HPR Knowledge Base MCP Server
- MCP server with stdio transport for local use - Search episodes, transcripts, hosts, and series - 4,511 episodes with metadata and transcripts - Data loader with in-memory JSON storage 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
This commit is contained in:
193
hpr_transcripts/hpr2370.txt
Normal file
193
hpr_transcripts/hpr2370.txt
Normal file
@@ -0,0 +1,193 @@
|
||||
Episode: 2370
|
||||
Title: HPR2370: Who is HortonWorks?
|
||||
Source: https://hub.hackerpublicradio.org/ccdn.php?filename=/eps/hpr2370/hpr2370.mp3
|
||||
Transcribed: 2025-10-19 01:48:10
|
||||
|
||||
---
|
||||
|
||||
This is HPR episode 2,370 entitled, Who Important Works.
|
||||
It is hosted by NAWP and in about 19 minutes long, and currently in a clean flag, a summer
|
||||
in, and what they do with Hadoop.
|
||||
This episode of HPR is brought to you by an Honesthost.com.
|
||||
Get 15% discount on all shared hosting with the offer code, HPR15, that's HPR15, better
|
||||
web hosting that's Honest and Fair.
|
||||
It's An Honesthost.com.
|
||||
Good day everyone, my name is JLVP, and I'm continuing my story about Hadoop, and I'm
|
||||
starting to go into now what is Horton Works, and I'll cover the quick facts and what
|
||||
they do in this podcast.
|
||||
Okay, so Horton Works, Incorporated Symbol HDP, is a leading innovator in the industry
|
||||
of creating, distributing, and supporting enterprise-ready open data platforms and modern
|
||||
applications.
|
||||
Their mission is to manage the world's data.
|
||||
They have a single mind-focus on driving innovation in open-source communities such as
|
||||
Apache Hadoop, Nephi, and Spark, and they, along with other partners, provide expertise,
|
||||
training, and services that allow their customers to unlock transformational value for their
|
||||
organizations across any line of business.
|
||||
They have connected data platforms that power modern data applications, deliver actionable
|
||||
intelligence from all data, data in motion, data at rest, and they are powering the future
|
||||
of data.
|
||||
Okay, and so they were founded in 2011 with 24 engineers from the original Hadoop team
|
||||
at Yahoo that spun out to form Horton Works.
|
||||
I wonder what Yahoo would have become if they would have kept those guys.
|
||||
They're in Santa Clara, California, and their business model is open-source software,
|
||||
subscriptions, training, and consulting services.
|
||||
Their billings were 81 million.
|
||||
Their gap revenue was 52 million.
|
||||
They provide 24-7 global web and telephone support.
|
||||
They have 2100-plus joint engineering, strategic, reseller, technology, and system integrator
|
||||
partners.
|
||||
We're one of those partners, and currently they have 1,075 employees in 17 countries,
|
||||
and that's pretty much it for what they do.
|
||||
Okay.
|
||||
Two main categories of data, or two main categories of business, are data center and cloud.
|
||||
And inside of the data center is HDF and HDP, and they call it HDF is Hortonworks Data Flow,
|
||||
and HDP is Hortonworks Data Platform.
|
||||
And so one is data and motion, and the other is data and rest.
|
||||
And you have in the middle an action intelligence that they use.
|
||||
And it's a two-diagram, so two circles with a smaller circle in the middle.
|
||||
And they call it the Hortonworks Connected Data Platforms Cloud Solution, and it delivers
|
||||
end-to-end capabilities for the cloud delivering fast time to value and integrative control
|
||||
by leveraging public cloud organizations have the capacity to leverage, host, and compute
|
||||
storage capacities to augment a data strategy.
|
||||
And through the integration of their data center solutions, organizations are able to create
|
||||
the right architecture to empower them.
|
||||
And they really, it looks like they have a, they look like they, nah, I'm messing up.
|
||||
They rely on outside developers or like a community of developers, and they rely heavily.
|
||||
They're the biggest committers to the Hadoop project in Apache.
|
||||
And the, so their HDP program is the industry's only true, secure, enterprise-ready, open-source
|
||||
Apache Hadoop distribution based on the centralized yarn architecture.
|
||||
HDP addresses the complete needs of data at rest and empowers real-time customer applications
|
||||
and delivers robust analytics that accelerate decision-making and innovation.
|
||||
And of course, I want you to start a subscription right away, but they say they're open that Hortonworks
|
||||
is committed to 100% open approach to software development that spurs innovation.
|
||||
HDP provides, or enables enterprises to deploy, integrate, and work with unprecedented volumes
|
||||
of structured and unstructured data.
|
||||
HDP delivers enterprise-guided software that fosters innovation and prevents vendor lock-in.
|
||||
Okay, they're also central.
|
||||
So HDP is based on a centralized architecture supported by yarn, allocates resources
|
||||
among various applications.
|
||||
Yarn maximizes the data in-destined by enabling enterprises to analyze the data to support
|
||||
diverse use cases.
|
||||
And yarn coordinates cluster-wide services for operations, data, and governance security.
|
||||
And it's interoperable.
|
||||
So HDP supports interoperable with a broad ecosystem of data center and cloud providers.
|
||||
HDP minimizes the expense and effort required to connect the customer's IT infrastructure
|
||||
with HDP's data and process and capabilities.
|
||||
With HDP customers can preserve their investment in existing IT architecture as they have
|
||||
dropped a dupe.
|
||||
And lastly, they're enterprise-ready.
|
||||
So HDP provides centralized management for monitoring of clusters.
|
||||
And this is a really big deal because if you have six or seven racks of
|
||||
really thin high-density computers with storage attached, it can be a real bear to manage that
|
||||
whole thing and see exactly what's going on and try to keep everything in one rack.
|
||||
So with the HDP security and governance is built into the platform,
|
||||
and HDP ensures that internet security is constantly administered across all data
|
||||
access engines.
|
||||
And again, that's really important in today's enterprise environment,
|
||||
the security aspect of everything.
|
||||
Okay, so the cornerstones of the Hortonworks data platform are yarn,
|
||||
and the Hadoop distributed file system are HDFS that we covered before.
|
||||
And the components of Hortonworks data platform are HDP,
|
||||
while HDFS provides the scalable fault tolerant, cost-efficient storage for your big data like
|
||||
yarn will provide the centralized architecture that enables you to possess
|
||||
multiple workloads simultaneously.
|
||||
Yarn provides the resource management and pluggable infrastructure for enabling a wide
|
||||
variety of data access methods, and those two things are the cornerstone of everything they do.
|
||||
Okay, so again to review Hadoop or what HDFS does, the file system,
|
||||
is it's a distributed Java-based file system for storing large volumes of data.
|
||||
HDFS and yarn form the data management layer of Apache Hadoop.
|
||||
And this is where a lot of people make money or different companies, consultants,
|
||||
and everything. It's right there with HDFS and yarn.
|
||||
And Hortonworks is the leading producer of this.
|
||||
So yarn is the architectural center of Hadoop.
|
||||
The resource management framework that enables enterprise to process data in multiple ways
|
||||
simultaneously for batch, interactive, and real-time data workloads in one shared data set.
|
||||
And yarn provides the resource management and HDFS provides the scalable fault tolerance,
|
||||
cost-efficient storage for big data.
|
||||
So HDFS is again, it's a Java-based file system, and it scales really greatly.
|
||||
And it's super, super reliable as we talked about before.
|
||||
And it was designed to span large clusters of commodity servers,
|
||||
which means the cost of this is fairly low.
|
||||
And HDFS has demonstrated production scalability of up to 200 petabyte of storage
|
||||
and a single cluster of 4,500 servers, supporting close to a billion files and blocks.
|
||||
And when that quantity and quality of enterprise data is available in HDFS and yarn enables multiple
|
||||
data access applications to process it, Hadoop users can confidently answer questions that
|
||||
eluded previous data platforms. So if you have a question, like I've got, I don't know,
|
||||
500 million Facebook users in North America. And I want to know how many like
|
||||
mustard on their hotdog? I can go through there and figure that out with this.
|
||||
And so that's what's really interesting about this. And again, it's scalable,
|
||||
fault tolerant, distributed storage that work closely with a wide variety of concurrent
|
||||
data access applications. And it's coordinated by yarn and HDFS will just work under a variety
|
||||
of physical and systematic circumstances by distributing storage and computation across
|
||||
many servers. The combined storage resource can grow linearly with the man, while remaining
|
||||
very economical at every amount of storage. And we talked about it yesterday a little bit,
|
||||
and the key features are that it's rack awareness, so it's not going to leave the rack
|
||||
unless you make it or unless you tell it to. It's got a minimal amount of data motion,
|
||||
which is so critical, especially if you're using anything less than 10G on your thing.
|
||||
Utilities dynamically diagnose the health of the file system and rebalance the data on different
|
||||
nodes. And it has rollback function that allows operators to bring back the previous version of HDFS
|
||||
after an upgrade in case of human or systematic errors. And the standby node name provides
|
||||
redundancy and supports high availability. And the operability, HDFS requires minimal operator
|
||||
intervention, allowing a single operator to maintain clusters of thousands of nodes. So you got one
|
||||
guy. So you're paying one guy to manage this entire stack of stack of stuff. So let's talk about
|
||||
yarn. And so yarn is the part that I didn't know about before studying today. And so it's really
|
||||
according to Horton works, it's the architectural center of the enterprise adoop. And it's part of
|
||||
the adoop project. Yarn is the thing that allows multiple data processing engines, such as interactive
|
||||
SQL, real-time screaming, data science, and batch processing to handle data stored in a single
|
||||
platform, unlocking an entirely new approach to analytics. So yarn is the foundation of a new
|
||||
generation of adoop and is enabling organizations everywhere to realize the modern data architecture.
|
||||
Okay, and how does it really do that? So what does yarn do? And yarn is the prerequisite for
|
||||
enterprise adoop, according to Horton works, providing the resource management and central platform
|
||||
to deliver consistent operation, security, data governance, tools across adoop clusters. Yarn
|
||||
also extends the power of adoop to incumbent and new technologies found within a data center
|
||||
so that they can take an average advantage of cost effective linear storage and processing.
|
||||
It provides ISPs and developers a consistent framework for writing data across applications that
|
||||
run in adoop. Okay, and it's got like, so if you if you picture sort of like
|
||||
a two layer, so HDFS, the adoop distributed file system is down at the bottom. And then you draw
|
||||
line. And then you have the yarn, your data operating system, slash cluster resource management
|
||||
software. And out of the yarn, yarn is like panel keys. So you have a script, which they would
|
||||
call pig, that sits on top of a thing called test. And then you have SQL, which is the Apache
|
||||
project hive. And then you have Java, Java, Java, Skega, which is cascading, which is also on
|
||||
this test piece. And then there's a no SQL piece called H base or aqnumlo, two separate products.
|
||||
And it's the teeth, it's tooth description is slider. And then you have stream. And it's a
|
||||
Apache project name is storm. And it's also a part of the slider tooth. And so if you need to do
|
||||
in memory, so you need some super, super fast stuff, you use Apache Spark. And if you need
|
||||
to search all that, you use a solar. And most importantly, is that if you have a special need
|
||||
and you have people that can program it, you can plug in your ISV engines into that and make it
|
||||
though. And then on the very top is a thin green thing that says batch interactive real-time data
|
||||
access. So that's really how it works. So yarn is sort of the in between that holds these
|
||||
panel keys that come up out of the system to enable a hadoop to work in a modern enterprise
|
||||
architecture. So let's talk about this architectural center a little bit. And what yarn
|
||||
enables hadoop cluster to do with it. So it's multi-tenancy. And yarn allows multiple access
|
||||
engines, either open source or proprietary. Now this is really important because big companies like
|
||||
SAP have proprietary things that hook into hadoop, SAP's Vora application for instance. And this
|
||||
allows these applications to use hadoop as the common standard for batch interactive and real-time
|
||||
engines that can simultaneously access the same data set. So multi-tenant data processing
|
||||
improves and enterprises return on its hadoop investments. Then yarn specializes in a thing
|
||||
called cluster utilization. And yarn's dynamic allocation of cluster resources improves the
|
||||
utilization over a static, over the more static map reduced roles. And we talked about map
|
||||
reduced and how that was the original thing. But yarn is the next thing after map R.
|
||||
And it was used in earlier versions of it too. And then yarn helps with the scalability.
|
||||
So the data center processing power continues to rapidly expand. And yarn's resource manager
|
||||
focuses on exclusively on scheduling. And keeps the pace as clusters expand to the thousands of
|
||||
nodes and managing petabytes of data. And lastly, yarn is really, really good with compatibility.
|
||||
Existing map reduced applications developed for hadoop one can run in yarn without any disruption
|
||||
to existing processes that already work. And so I think the way it works is that you get a
|
||||
subscription from Hortonworks and they'll help you with your hadoop and your yarn and get it all
|
||||
worked out. And there's several other companies that do it. Clotera is doing it. But
|
||||
I wanted to look at it. Hortonworks is take on it because they seem to be the largest
|
||||
competitors at the project. Sort of like who commits the most in Linux. You know, you go to see
|
||||
which distributions commit most. And you would, you know, I think Red Hat did the most commit. So
|
||||
it's always interesting to see how Red Hat is going and what their direction is in the enterprise
|
||||
space. All right. Well, this pretty much concludes the talk today. I hope you all had a fine day
|
||||
and I didn't bore you too much, but I really wanted to understand Hadoop a little better because
|
||||
I wanted to understand SAP Vora a lot. Okay. And you all have a nice day. And I'll talk to you next time.
|
||||
Thank you.
|
||||
You've been listening to Heka Public Radio at HekaPublicRadio.org. We are a community podcast network
|
||||
that releases shows every weekday Monday through Friday. Today's show, like all our shows,
|
||||
was contributed by an HPR listener like yourself. If you ever thought of recording a podcast and
|
||||
click on our contributing to find out how easy it really is. Heka Public Radio was founded by the
|
||||
digital dog pound and the infonomican computer club. And it's part of the binary revolution at
|
||||
binrev.com. If you have comments on today's show, please email the host directly, leave a comment
|
||||
on the website or record a follow-up episode yourself. Unless otherwise stated, today's show is
|
||||
released under creative comments, attribution, share a like, 3.0 license.
|
||||
Reference in New Issue
Block a user