Episode: 2370
Title: HPR2370: Who is HortonWorks?
Source: https://hub.hackerpublicradio.org/ccdn.php?filename=/eps/hpr2370/hpr2370.mp3
Transcribed: 2025-10-19 01:48:10

---

This is HPR episode 2,370 entitled, Who Important Works.
It is hosted by NAWP and in about 19 minutes long, and currently in a clean flag, a summer
in, and what they do with Hadoop.
This episode of HPR is brought to you by an Honesthost.com.
Get 15% discount on all shared hosting with the offer code, HPR15, that's HPR15, better
web hosting that's Honest and Fair.
It's An Honesthost.com.
Good day everyone, my name is JLVP, and I'm continuing my story about Hadoop, and I'm
starting to go into now what is Horton Works, and I'll cover the quick facts and what
they do in this podcast.
Okay, so Horton Works, Incorporated Symbol HDP, is a leading innovator in the industry
of creating, distributing, and supporting enterprise-ready open data platforms and modern
applications.
Their mission is to manage the world's data.
They have a single mind-focus on driving innovation in open-source communities such as
Apache Hadoop, Nephi, and Spark, and they, along with other partners, provide expertise,
training, and services that allow their customers to unlock transformational value for their
organizations across any line of business.
They have connected data platforms that power modern data applications, deliver actionable
intelligence from all data, data in motion, data at rest, and they are powering the future
of data.
Okay, and so they were founded in 2011 with 24 engineers from the original Hadoop team
at Yahoo that spun out to form Horton Works.
I wonder what Yahoo would have become if they would have kept those guys.
They're in Santa Clara, California, and their business model is open-source software,
subscriptions, training, and consulting services.
Their billings were 81 million.
Their gap revenue was 52 million.
They provide 24-7 global web and telephone support.
They have 2100-plus joint engineering, strategic, reseller, technology, and system integrator
partners.
We're one of those partners, and currently they have 1,075 employees in 17 countries,
and that's pretty much it for what they do.
Okay.
Two main categories of data, or two main categories of business, are data center and cloud.
And inside of the data center is HDF and HDP, and they call it HDF is Hortonworks Data Flow,
and HDP is Hortonworks Data Platform.
And so one is data and motion, and the other is data and rest.
And you have in the middle an action intelligence that they use.
And it's a two-diagram, so two circles with a smaller circle in the middle.
And they call it the Hortonworks Connected Data Platforms Cloud Solution, and it delivers
end-to-end capabilities for the cloud delivering fast time to value and integrative control
by leveraging public cloud organizations have the capacity to leverage, host, and compute
storage capacities to augment a data strategy.
And through the integration of their data center solutions, organizations are able to create
the right architecture to empower them.
And they really, it looks like they have a, they look like they, nah, I'm messing up.
They rely on outside developers or like a community of developers, and they rely heavily.
They're the biggest committers to the Hadoop project in Apache.
And the, so their HDP program is the industry's only true, secure, enterprise-ready, open-source
Apache Hadoop distribution based on the centralized yarn architecture.
HDP addresses the complete needs of data at rest and empowers real-time customer applications
and delivers robust analytics that accelerate decision-making and innovation.
And of course, I want you to start a subscription right away, but they say they're open that Hortonworks
is committed to 100% open approach to software development that spurs innovation.
HDP provides, or enables enterprises to deploy, integrate, and work with unprecedented volumes
of structured and unstructured data.
HDP delivers enterprise-guided software that fosters innovation and prevents vendor lock-in.
Okay, they're also central.
So HDP is based on a centralized architecture supported by yarn, allocates resources
among various applications.
Yarn maximizes the data in-destined by enabling enterprises to analyze the data to support
diverse use cases.
And yarn coordinates cluster-wide services for operations, data, and governance security.
And it's interoperable.
So HDP supports interoperable with a broad ecosystem of data center and cloud providers.
HDP minimizes the expense and effort required to connect the customer's IT infrastructure
with HDP's data and process and capabilities.
With HDP customers can preserve their investment in existing IT architecture as they have
dropped a dupe.
And lastly, they're enterprise-ready.
So HDP provides centralized management for monitoring of clusters.
And this is a really big deal because if you have six or seven racks of
really thin high-density computers with storage attached, it can be a real bear to manage that
whole thing and see exactly what's going on and try to keep everything in one rack.
So with the HDP security and governance is built into the platform,
and HDP ensures that internet security is constantly administered across all data
access engines.
And again, that's really important in today's enterprise environment,
the security aspect of everything.
Okay, so the cornerstones of the Hortonworks data platform are yarn,
and the Hadoop distributed file system are HDFS that we covered before.
And the components of Hortonworks data platform are HDP,
while HDFS provides the scalable fault tolerant, cost-efficient storage for your big data like
yarn will provide the centralized architecture that enables you to possess
multiple workloads simultaneously.
Yarn provides the resource management and pluggable infrastructure for enabling a wide
variety of data access methods, and those two things are the cornerstone of everything they do.
Okay, so again to review Hadoop or what HDFS does, the file system,
is it's a distributed Java-based file system for storing large volumes of data.
HDFS and yarn form the data management layer of Apache Hadoop.
And this is where a lot of people make money or different companies, consultants,
and everything. It's right there with HDFS and yarn.
And Hortonworks is the leading producer of this.
So yarn is the architectural center of Hadoop.
The resource management framework that enables enterprise to process data in multiple ways
simultaneously for batch, interactive, and real-time data workloads in one shared data set.
And yarn provides the resource management and HDFS provides the scalable fault tolerance,
cost-efficient storage for big data.
So HDFS is again, it's a Java-based file system, and it scales really greatly.
And it's super, super reliable as we talked about before.
And it was designed to span large clusters of commodity servers,
which means the cost of this is fairly low.
And HDFS has demonstrated production scalability of up to 200 petabyte of storage
and a single cluster of 4,500 servers, supporting close to a billion files and blocks.
And when that quantity and quality of enterprise data is available in HDFS and yarn enables multiple
data access applications to process it, Hadoop users can confidently answer questions that
eluded previous data platforms. So if you have a question, like I've got, I don't know,
500 million Facebook users in North America. And I want to know how many like
mustard on their hotdog? I can go through there and figure that out with this.
And so that's what's really interesting about this. And again, it's scalable,
fault tolerant, distributed storage that work closely with a wide variety of concurrent
data access applications. And it's coordinated by yarn and HDFS will just work under a variety
of physical and systematic circumstances by distributing storage and computation across
many servers. The combined storage resource can grow linearly with the man, while remaining
very economical at every amount of storage. And we talked about it yesterday a little bit,
and the key features are that it's rack awareness, so it's not going to leave the rack
unless you make it or unless you tell it to. It's got a minimal amount of data motion,
which is so critical, especially if you're using anything less than 10G on your thing.
Utilities dynamically diagnose the health of the file system and rebalance the data on different
nodes. And it has rollback function that allows operators to bring back the previous version of HDFS
after an upgrade in case of human or systematic errors. And the standby node name provides
redundancy and supports high availability. And the operability, HDFS requires minimal operator
intervention, allowing a single operator to maintain clusters of thousands of nodes. So you got one
guy. So you're paying one guy to manage this entire stack of stack of stuff. So let's talk about
yarn. And so yarn is the part that I didn't know about before studying today. And so it's really
according to Horton works, it's the architectural center of the enterprise adoop. And it's part of
the adoop project. Yarn is the thing that allows multiple data processing engines, such as interactive
SQL, real-time screaming, data science, and batch processing to handle data stored in a single
platform, unlocking an entirely new approach to analytics. So yarn is the foundation of a new
generation of adoop and is enabling organizations everywhere to realize the modern data architecture.
Okay, and how does it really do that? So what does yarn do? And yarn is the prerequisite for
enterprise adoop, according to Horton works, providing the resource management and central platform
to deliver consistent operation, security, data governance, tools across adoop clusters. Yarn
also extends the power of adoop to incumbent and new technologies found within a data center
so that they can take an average advantage of cost effective linear storage and processing.
It provides ISPs and developers a consistent framework for writing data across applications that
run in adoop. Okay, and it's got like, so if you if you picture sort of like
a two layer, so HDFS, the adoop distributed file system is down at the bottom. And then you draw
line. And then you have the yarn, your data operating system, slash cluster resource management
software. And out of the yarn, yarn is like panel keys. So you have a script, which they would
call pig, that sits on top of a thing called test. And then you have SQL, which is the Apache
project hive. And then you have Java, Java, Java, Skega, which is cascading, which is also on
this test piece. And then there's a no SQL piece called H base or aqnumlo, two separate products.
And it's the teeth, it's tooth description is slider. And then you have stream. And it's a
Apache project name is storm. And it's also a part of the slider tooth. And so if you need to do
in memory, so you need some super, super fast stuff, you use Apache Spark. And if you need
to search all that, you use a solar. And most importantly, is that if you have a special need
and you have people that can program it, you can plug in your ISV engines into that and make it
though. And then on the very top is a thin green thing that says batch interactive real-time data
access. So that's really how it works. So yarn is sort of the in between that holds these
panel keys that come up out of the system to enable a hadoop to work in a modern enterprise
architecture. So let's talk about this architectural center a little bit. And what yarn
enables hadoop cluster to do with it. So it's multi-tenancy. And yarn allows multiple access
engines, either open source or proprietary. Now this is really important because big companies like
SAP have proprietary things that hook into hadoop, SAP's Vora application for instance. And this
allows these applications to use hadoop as the common standard for batch interactive and real-time
engines that can simultaneously access the same data set. So multi-tenant data processing
improves and enterprises return on its hadoop investments. Then yarn specializes in a thing
called cluster utilization. And yarn's dynamic allocation of cluster resources improves the
utilization over a static, over the more static map reduced roles. And we talked about map
reduced and how that was the original thing. But yarn is the next thing after map R.
And it was used in earlier versions of it too. And then yarn helps with the scalability.
So the data center processing power continues to rapidly expand. And yarn's resource manager
focuses on exclusively on scheduling. And keeps the pace as clusters expand to the thousands of
nodes and managing petabytes of data. And lastly, yarn is really, really good with compatibility.
Existing map reduced applications developed for hadoop one can run in yarn without any disruption
to existing processes that already work. And so I think the way it works is that you get a
subscription from Hortonworks and they'll help you with your hadoop and your yarn and get it all
worked out. And there's several other companies that do it. Clotera is doing it. But
I wanted to look at it. Hortonworks is take on it because they seem to be the largest
competitors at the project. Sort of like who commits the most in Linux. You know, you go to see
which distributions commit most. And you would, you know, I think Red Hat did the most commit. So
it's always interesting to see how Red Hat is going and what their direction is in the enterprise
space. All right. Well, this pretty much concludes the talk today. I hope you all had a fine day
and I didn't bore you too much, but I really wanted to understand Hadoop a little better because
I wanted to understand SAP Vora a lot. Okay. And you all have a nice day. And I'll talk to you next time.
Thank you.
You've been listening to Heka Public Radio at HekaPublicRadio.org. We are a community podcast network
that releases shows every weekday Monday through Friday. Today's show, like all our shows,
was contributed by an HPR listener like yourself. If you ever thought of recording a podcast and
click on our contributing to find out how easy it really is. Heka Public Radio was founded by the
digital dog pound and the infonomican computer club. And it's part of the binary revolution at
binrev.com. If you have comments on today's show, please email the host directly, leave a comment
on the website or record a follow-up episode yourself. Unless otherwise stated, today's show is
released under creative comments, attribution, share a like, 3.0 license.