Episode: 2370 Title: HPR2370: Who is HortonWorks? Source: https://hub.hackerpublicradio.org/ccdn.php?filename=/eps/hpr2370/hpr2370.mp3 Transcribed: 2025-10-19 01:48:10 --- This is HPR episode 2,370 entitled, Who Important Works. It is hosted by NAWP and in about 19 minutes long, and currently in a clean flag, a summer in, and what they do with Hadoop. This episode of HPR is brought to you by an Honesthost.com. Get 15% discount on all shared hosting with the offer code, HPR15, that's HPR15, better web hosting that's Honest and Fair. It's An Honesthost.com. Good day everyone, my name is JLVP, and I'm continuing my story about Hadoop, and I'm starting to go into now what is Horton Works, and I'll cover the quick facts and what they do in this podcast. Okay, so Horton Works, Incorporated Symbol HDP, is a leading innovator in the industry of creating, distributing, and supporting enterprise-ready open data platforms and modern applications. Their mission is to manage the world's data. They have a single mind-focus on driving innovation in open-source communities such as Apache Hadoop, Nephi, and Spark, and they, along with other partners, provide expertise, training, and services that allow their customers to unlock transformational value for their organizations across any line of business. They have connected data platforms that power modern data applications, deliver actionable intelligence from all data, data in motion, data at rest, and they are powering the future of data. Okay, and so they were founded in 2011 with 24 engineers from the original Hadoop team at Yahoo that spun out to form Horton Works. I wonder what Yahoo would have become if they would have kept those guys. They're in Santa Clara, California, and their business model is open-source software, subscriptions, training, and consulting services. Their billings were 81 million. Their gap revenue was 52 million. They provide 24-7 global web and telephone support. They have 2100-plus joint engineering, strategic, reseller, technology, and system integrator partners. We're one of those partners, and currently they have 1,075 employees in 17 countries, and that's pretty much it for what they do. Okay. Two main categories of data, or two main categories of business, are data center and cloud. And inside of the data center is HDF and HDP, and they call it HDF is Hortonworks Data Flow, and HDP is Hortonworks Data Platform. And so one is data and motion, and the other is data and rest. And you have in the middle an action intelligence that they use. And it's a two-diagram, so two circles with a smaller circle in the middle. And they call it the Hortonworks Connected Data Platforms Cloud Solution, and it delivers end-to-end capabilities for the cloud delivering fast time to value and integrative control by leveraging public cloud organizations have the capacity to leverage, host, and compute storage capacities to augment a data strategy. And through the integration of their data center solutions, organizations are able to create the right architecture to empower them. And they really, it looks like they have a, they look like they, nah, I'm messing up. They rely on outside developers or like a community of developers, and they rely heavily. They're the biggest committers to the Hadoop project in Apache. And the, so their HDP program is the industry's only true, secure, enterprise-ready, open-source Apache Hadoop distribution based on the centralized yarn architecture. HDP addresses the complete needs of data at rest and empowers real-time customer applications and delivers robust analytics that accelerate decision-making and innovation. And of course, I want you to start a subscription right away, but they say they're open that Hortonworks is committed to 100% open approach to software development that spurs innovation. HDP provides, or enables enterprises to deploy, integrate, and work with unprecedented volumes of structured and unstructured data. HDP delivers enterprise-guided software that fosters innovation and prevents vendor lock-in. Okay, they're also central. So HDP is based on a centralized architecture supported by yarn, allocates resources among various applications. Yarn maximizes the data in-destined by enabling enterprises to analyze the data to support diverse use cases. And yarn coordinates cluster-wide services for operations, data, and governance security. And it's interoperable. So HDP supports interoperable with a broad ecosystem of data center and cloud providers. HDP minimizes the expense and effort required to connect the customer's IT infrastructure with HDP's data and process and capabilities. With HDP customers can preserve their investment in existing IT architecture as they have dropped a dupe. And lastly, they're enterprise-ready. So HDP provides centralized management for monitoring of clusters. And this is a really big deal because if you have six or seven racks of really thin high-density computers with storage attached, it can be a real bear to manage that whole thing and see exactly what's going on and try to keep everything in one rack. So with the HDP security and governance is built into the platform, and HDP ensures that internet security is constantly administered across all data access engines. And again, that's really important in today's enterprise environment, the security aspect of everything. Okay, so the cornerstones of the Hortonworks data platform are yarn, and the Hadoop distributed file system are HDFS that we covered before. And the components of Hortonworks data platform are HDP, while HDFS provides the scalable fault tolerant, cost-efficient storage for your big data like yarn will provide the centralized architecture that enables you to possess multiple workloads simultaneously. Yarn provides the resource management and pluggable infrastructure for enabling a wide variety of data access methods, and those two things are the cornerstone of everything they do. Okay, so again to review Hadoop or what HDFS does, the file system, is it's a distributed Java-based file system for storing large volumes of data. HDFS and yarn form the data management layer of Apache Hadoop. And this is where a lot of people make money or different companies, consultants, and everything. It's right there with HDFS and yarn. And Hortonworks is the leading producer of this. So yarn is the architectural center of Hadoop. The resource management framework that enables enterprise to process data in multiple ways simultaneously for batch, interactive, and real-time data workloads in one shared data set. And yarn provides the resource management and HDFS provides the scalable fault tolerance, cost-efficient storage for big data. So HDFS is again, it's a Java-based file system, and it scales really greatly. And it's super, super reliable as we talked about before. And it was designed to span large clusters of commodity servers, which means the cost of this is fairly low. And HDFS has demonstrated production scalability of up to 200 petabyte of storage and a single cluster of 4,500 servers, supporting close to a billion files and blocks. And when that quantity and quality of enterprise data is available in HDFS and yarn enables multiple data access applications to process it, Hadoop users can confidently answer questions that eluded previous data platforms. So if you have a question, like I've got, I don't know, 500 million Facebook users in North America. And I want to know how many like mustard on their hotdog? I can go through there and figure that out with this. And so that's what's really interesting about this. And again, it's scalable, fault tolerant, distributed storage that work closely with a wide variety of concurrent data access applications. And it's coordinated by yarn and HDFS will just work under a variety of physical and systematic circumstances by distributing storage and computation across many servers. The combined storage resource can grow linearly with the man, while remaining very economical at every amount of storage. And we talked about it yesterday a little bit, and the key features are that it's rack awareness, so it's not going to leave the rack unless you make it or unless you tell it to. It's got a minimal amount of data motion, which is so critical, especially if you're using anything less than 10G on your thing. Utilities dynamically diagnose the health of the file system and rebalance the data on different nodes. And it has rollback function that allows operators to bring back the previous version of HDFS after an upgrade in case of human or systematic errors. And the standby node name provides redundancy and supports high availability. And the operability, HDFS requires minimal operator intervention, allowing a single operator to maintain clusters of thousands of nodes. So you got one guy. So you're paying one guy to manage this entire stack of stack of stuff. So let's talk about yarn. And so yarn is the part that I didn't know about before studying today. And so it's really according to Horton works, it's the architectural center of the enterprise adoop. And it's part of the adoop project. Yarn is the thing that allows multiple data processing engines, such as interactive SQL, real-time screaming, data science, and batch processing to handle data stored in a single platform, unlocking an entirely new approach to analytics. So yarn is the foundation of a new generation of adoop and is enabling organizations everywhere to realize the modern data architecture. Okay, and how does it really do that? So what does yarn do? And yarn is the prerequisite for enterprise adoop, according to Horton works, providing the resource management and central platform to deliver consistent operation, security, data governance, tools across adoop clusters. Yarn also extends the power of adoop to incumbent and new technologies found within a data center so that they can take an average advantage of cost effective linear storage and processing. It provides ISPs and developers a consistent framework for writing data across applications that run in adoop. Okay, and it's got like, so if you if you picture sort of like a two layer, so HDFS, the adoop distributed file system is down at the bottom. And then you draw line. And then you have the yarn, your data operating system, slash cluster resource management software. And out of the yarn, yarn is like panel keys. So you have a script, which they would call pig, that sits on top of a thing called test. And then you have SQL, which is the Apache project hive. And then you have Java, Java, Java, Skega, which is cascading, which is also on this test piece. And then there's a no SQL piece called H base or aqnumlo, two separate products. And it's the teeth, it's tooth description is slider. And then you have stream. And it's a Apache project name is storm. And it's also a part of the slider tooth. And so if you need to do in memory, so you need some super, super fast stuff, you use Apache Spark. And if you need to search all that, you use a solar. And most importantly, is that if you have a special need and you have people that can program it, you can plug in your ISV engines into that and make it though. And then on the very top is a thin green thing that says batch interactive real-time data access. So that's really how it works. So yarn is sort of the in between that holds these panel keys that come up out of the system to enable a hadoop to work in a modern enterprise architecture. So let's talk about this architectural center a little bit. And what yarn enables hadoop cluster to do with it. So it's multi-tenancy. And yarn allows multiple access engines, either open source or proprietary. Now this is really important because big companies like SAP have proprietary things that hook into hadoop, SAP's Vora application for instance. And this allows these applications to use hadoop as the common standard for batch interactive and real-time engines that can simultaneously access the same data set. So multi-tenant data processing improves and enterprises return on its hadoop investments. Then yarn specializes in a thing called cluster utilization. And yarn's dynamic allocation of cluster resources improves the utilization over a static, over the more static map reduced roles. And we talked about map reduced and how that was the original thing. But yarn is the next thing after map R. And it was used in earlier versions of it too. And then yarn helps with the scalability. So the data center processing power continues to rapidly expand. And yarn's resource manager focuses on exclusively on scheduling. And keeps the pace as clusters expand to the thousands of nodes and managing petabytes of data. And lastly, yarn is really, really good with compatibility. Existing map reduced applications developed for hadoop one can run in yarn without any disruption to existing processes that already work. And so I think the way it works is that you get a subscription from Hortonworks and they'll help you with your hadoop and your yarn and get it all worked out. And there's several other companies that do it. Clotera is doing it. But I wanted to look at it. Hortonworks is take on it because they seem to be the largest competitors at the project. Sort of like who commits the most in Linux. You know, you go to see which distributions commit most. And you would, you know, I think Red Hat did the most commit. So it's always interesting to see how Red Hat is going and what their direction is in the enterprise space. All right. Well, this pretty much concludes the talk today. I hope you all had a fine day and I didn't bore you too much, but I really wanted to understand Hadoop a little better because I wanted to understand SAP Vora a lot. Okay. And you all have a nice day. And I'll talk to you next time. Thank you. You've been listening to Heka Public Radio at HekaPublicRadio.org. We are a community podcast network that releases shows every weekday Monday through Friday. Today's show, like all our shows, was contributed by an HPR listener like yourself. If you ever thought of recording a podcast and click on our contributing to find out how easy it really is. Heka Public Radio was founded by the digital dog pound and the infonomican computer club. And it's part of the binary revolution at binrev.com. If you have comments on today's show, please email the host directly, leave a comment on the website or record a follow-up episode yourself. Unless otherwise stated, today's show is released under creative comments, attribution, share a like, 3.0 license.