Initial commit: HPR Knowledge Base MCP Server
- MCP server with stdio transport for local use - Search episodes, transcripts, hosts, and series - 4,511 episodes with metadata and transcripts - Data loader with in-memory JSON storage 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
This commit is contained in:
246
hpr_transcripts/hpr2347.txt
Normal file
246
hpr_transcripts/hpr2347.txt
Normal file
@@ -0,0 +1,246 @@
|
||||
Episode: 2347
|
||||
Title: HPR2347: An Intro to Apache Hadoop
|
||||
Source: https://hub.hackerpublicradio.org/ccdn.php?filename=/eps/hpr2347/hpr2347.mp3
|
||||
Transcribed: 2025-10-19 01:33:31
|
||||
|
||||
---
|
||||
|
||||
This is HPR episode 2,347 entitled, an intro to Apache Hadoop.
|
||||
It is hosted by AWP and is about 37 minutes long, and carries an explicit flag.
|
||||
The summary is just a pretty boring summary of what Hadoop is and how it works.
|
||||
This episode of HPR is brought to you by an honesthost.com.
|
||||
Get 15% discount on all shared hosting with the offer code HPR15, that's HPR15.
|
||||
Better web hosting that's honest and fair at an honesthost.com.
|
||||
Good day, my name is JWP, and today I want to talk to you about Apache Hadoop.
|
||||
Hadoop is an open source Java-based programming framework that supports processing
|
||||
and storage of extremely large data sets in a distributed computing environment.
|
||||
It is part of the Apache project and is sponsored by the Apache Software Foundation.
|
||||
As a sidebar, a Apache Foundation is a host now.
|
||||
They don't do just the web server anymore.
|
||||
It's a lot of things.
|
||||
They host a lot of projects.
|
||||
Open Office, for instance, is one of their projects in addition to the web server.
|
||||
Hadoop is just one of the many things that take care of Apache takes care of.
|
||||
As I said before, it's an open source Java-based framework.
|
||||
All of the other languages goes along.
|
||||
It's done in big data.
|
||||
A lot of people don't understand what big data is.
|
||||
Big data is a term for data sets that are large or complex.
|
||||
The traditional data processing application software is inadequate to deal with them.
|
||||
Challenges include capture, storage, analytics, information, privacy, and things like this.
|
||||
It's just a really, really big data set that you would have.
|
||||
In these data sets, they're using a map-produced programming model.
|
||||
It consists of computer clusters built from commodity hardware.
|
||||
All the modules in Hadoop are designed with the fundamental assumption that hardware failures are common occurrences
|
||||
and should be automatically handled by the framework.
|
||||
So what is this map-produced thing?
|
||||
It's a programming model and associated implementation for processing and generating big data sets
|
||||
with parallel distributed algorithm on a cluster.
|
||||
It's composed of a map procedure that performs filtering and performs the summary of the operation after it's done.
|
||||
The map-produced thing is a really, really large topic.
|
||||
But a map-produced was originally referred to as the primary, as a proprietary Google technology,
|
||||
but has been let go in 2014, and Google is no longer using map-produced as their primary big data processing model.
|
||||
So it brings us to what is a clustered file system.
|
||||
Then a clustered file system is a file system which is shared by simultaneously mounted on multiple servers.
|
||||
So it's just not like your PI down in the basement with your blog.
|
||||
When I sell a Hadoop system, it literally has hundreds of nodes, usually.
|
||||
There are several approaches to clustering. Most of them do not employ a clustered file system,
|
||||
only a direct attached system for each node.
|
||||
Clustered file systems can provide features like location, independent addressing, redundancy to approval ability,
|
||||
or reduction in complexity of other parts of the cluster.
|
||||
Parallel file systems are a type of cluster system that spread data across multiple nodes,
|
||||
and are usually for redundancy or performance.
|
||||
So the core of Apache Hadoop consists of the storage part known as the Hadoop distributed file system, or HDFS,
|
||||
and a processing part, which is map-produced programming model.
|
||||
Hadoop splits the files into large blocks and distributes them across nodes and a cluster,
|
||||
and transfers the package code to nodes to process the data in parallel.
|
||||
This approach takes advantage of data, locicity, and nodes manipulate the data they have access to.
|
||||
This allows the data set to be processed faster and more efficiently than it would have been,
|
||||
and a more conventional supercomputer architecture that relies on a parallel file system,
|
||||
where computation and data are distributed via high-speed networking.
|
||||
The base Apache Hadoop framework is composed of the following modules.
|
||||
Hadoop Common, which contains the libraries and utilities needed by Hadoop and other Hadoop modules.
|
||||
Hadoop distributed file system, HDFS, a distributed file system that stores data on commodity machines,
|
||||
providing a very high aggregate bandwidth across the cluster.
|
||||
Hadoop Yarn, which is a resource management platform responsible for managing computing nodes in clusters
|
||||
and using them for scheduling of users and applications.
|
||||
And Hadoop map-produced implementation of the map-reduced programming models for large-scale data processing.
|
||||
The term Hadoop has come to refer not to just the base modules, but also the ecosystem,
|
||||
or a collection of additional software packages that can be installed on top or alongside Hadoop,
|
||||
such as Apache Pig, Apache Hive, Apache Age Base, Apache Phoenix, Apache Spark, Apache Zookeeper,
|
||||
Caldera Impala, Apache Flume, Apache Scroop, Apache Ozzy, and Apache Storm.
|
||||
Apache's map-produced and HDFS models were inspired by Google Papers on their map-produced and Google Foss system.
|
||||
As I said before, the Hadoop framework itself is written in the Java programming language,
|
||||
with some native code and C, and the command line utilities written as shell scripts.
|
||||
Though map-produced Java code is common, any programming language can be used with Hadoop streaming
|
||||
to implement the map and reduce parts of the user's program.
|
||||
Other objects in Hadoop ecosystem expose richer user interfaces.
|
||||
So how did Hadoop come about?
|
||||
Well, the fundamental idea or the genesis of Hadoop came from the Google Foss system paper.
|
||||
This paper was published in 2003, and this paper spawned another research paper from Google called MapReduce,
|
||||
or simplified data processing on large clusters.
|
||||
Development started in the Apache Nudge project, but was moved to the new Hadoop sub-project in January 2006
|
||||
by a guy named Dave Cutting who was working at Yahoo at the time.
|
||||
It's named after his son's toy elephant, and the initial code was factored out of Nudge, that's NUTCH,
|
||||
and was clustered with a consisted of 5,000 lines of code for HDFS and 6,000 lines of code from MapReduce.
|
||||
Its first commiter added Hadoop project was a guy named Owen O'Malley in 2006,
|
||||
and Hadoop 0.10 was released in April 2006 and continues to evolve by many contributors in the Hadoop project.
|
||||
So started in 2003, became a thing in 2006,
|
||||
and the first one did 1.8 terabytes.
|
||||
So on 188 nodes, and it took 47.9 hours, trust me, they're much, much bigger now.
|
||||
I saw whole racks of these things, and it's really, really good.
|
||||
The first milestone was in 2008, Yahoo did a 10,000 core Hadoop cluster,
|
||||
and that's really small by today's standards.
|
||||
Hadoop does these summits every year, and it seems to grow in 2013, they had 2700,
|
||||
and in 2014 they had 3200, and the Wiki doesn't say how big it is,
|
||||
but my company sent me to an open source Hadoop high performance computing thing,
|
||||
and we've talked about this for days, really.
|
||||
Alright, so let's talk a little bit about the architecture.
|
||||
Hadoop consists of the common Hadoop package.
|
||||
This provides a file system and OS level abstractions.
|
||||
A MapReduce engine, either MapReduce, MR1, or Yarn slash MR2,
|
||||
and the Hadoop distributive file system, so HDFS.
|
||||
The Hadoop package contains a Java archive, or Java files, and scripts needed to start Hadoop.
|
||||
Okay, and then for effective scheduling of work,
|
||||
every Hadoop compatible file system should provide location awareness.
|
||||
The name of the rack more precisely of the networks,
|
||||
which where the node is, Hadoop applications can use this information to execute code on the node
|
||||
where the data is, and failing that on the same switch or rack to reduce backbone traffic.
|
||||
HDFS uses this method when replicating data for redundancy across multiple racks.
|
||||
This approach reduces the impact of a rack power outage or switch failure.
|
||||
If one of these failures occurs, the data won't remain available.
|
||||
Now, that's something that, you know, me as a SAP person or a mission critical person always likes
|
||||
is that the data is always available.
|
||||
So, if a node breaks, if a PDU breaks on the back of the rack, it's always there with Hadoop.
|
||||
Okay, and a small Hadoop cluster, it includes a single master and multiple worker nodes.
|
||||
The master node consists of a job tracker, task tracker, name node, and data node.
|
||||
Okay, a slave or worker nodes acts as a data node and task tracker,
|
||||
though it's possible to have a data only worker nodes and compute only worker nodes.
|
||||
These are normally used only in non-standard applications.
|
||||
Hadoop requires a job or runtime environment, GRU 1.6 or higher,
|
||||
and the standard startup and shutdown scripts require secure shell SSH to be set up between the nodes in a cluster.
|
||||
In a larger cluster, HDFS nodes are managed through a dedicated name node server to host the file system index
|
||||
and a secondary node name server that generates snapshots of the node name's memory structures
|
||||
thereby preventing the file system corruption and data loss.
|
||||
Similarly, a standalone job tracker server can manage job scheduling across nodes
|
||||
when map produces use and alternative file system node, secondary name node,
|
||||
and data node architecture of FDS are replaced with file system equivalents.
|
||||
And while that's a lot of words, if you look at the wiki,
|
||||
the map producer layer has like a dotted line, if there's a task tracker, a job tracker, and a task shaster,
|
||||
and underneath the HDFS layer has the name node, the data node, and two data nodes.
|
||||
So this is a multi node cluster.
|
||||
So the map producer layer has the task tracker, job tracker, and task tracker on the slave.
|
||||
So the master has the task tracker and job tracker, and the slave has the task tracker on the map layer.
|
||||
And the FDS layer that has a name node and a data node in following data nodes.
|
||||
Okay, so let's talk a little bit more about the Hadoop distributed file system or HDFS.
|
||||
Okay, it's distributed, scalable, and portable file system written in Java for the Hadoop framework.
|
||||
Some consider HDFS to instead to be a data store due to its lack of positive ex-compliance and inability to be manned.
|
||||
But it does provide Shell's command and a Java API methods that are similar to other file systems.
|
||||
So it's a, you know, it's really is a, I think it really is a file system because it's not positive ex.
|
||||
I don't, you know, we have to live in a, we have to live in a new world.
|
||||
Okay, I took a break. I had to get some water.
|
||||
I don't normally talk on these podcasts for longer, for this long.
|
||||
So, again, the Hadoop cluster has a nominally a single name node plus a cluster of data nodes.
|
||||
And although redundancy options are available for the name node due to its critical nature, each data node serves blocks of data over the network using a block protocol specific to HDFS.
|
||||
The file system uses TC PIP sockets for communication and the clients use remote procedure call RPC to communicate with each other.
|
||||
HDFS stores large files. They typically are in the range of gigabytes to terabytes across multiple machines.
|
||||
Again, it achieves reliability by replicating the data across multiple host and hence theoretically does not require raid storage on host.
|
||||
But to increase performance, some raid configurations might be useful with the fault replication value.
|
||||
Three, data is stored on three nodes, two on the same rack and one on a different rack.
|
||||
If the data nodes can talk to each other to rebalance data, to move copies around and to keep replication of data high.
|
||||
HDFS is not fully posy compliant because the requirements for a posy X file system differ from the target goals for Hadoop application.
|
||||
The tradeoff of not having a fully posy compliant file system is increased performance for data throughput and support for non-posy X operations such as append.
|
||||
So, because it's open source, things change and in May of 2012, HDFS added the high availability capacities and announced that there were going to let the main meta server or the name node fail over manually to a backup.
|
||||
They also started in 2012 a project dedicated to automated automatic failover.
|
||||
So in 2012, it was a manual process, manual process.
|
||||
The HDFS file system includes a so-called secondary name node and misleading name that incorrectly interprets as a backup node for when the primary name node goes offline.
|
||||
In fact, the secondary name node regularly connects with the primary node and builds snapshots of the primary node's directory information, which the file system saves to local or remote directories.
|
||||
These check pointed images can be used to restart a failed primary node without having to replay the entire journal of file system actions and then edit the log to create an update directory.
|
||||
Because the name node is a single point for storage and management of metadata, it can become a bottleneck for supporting a huge number of files, especially large number of small files.
|
||||
So HDFS Federation, a new addition, which is a new addition aims to take, tackle this problem in a certain extent by allowing multiple namespaces served by separate name nodes.
|
||||
Moreover, there are some issues in FDSS, namely the small file issue, scalability problem, and several single points of failure in bottleneck and the huge metadata request.
|
||||
The advantage of using the HDFS is data awareness between the job tracker and task tracker.
|
||||
The job tracker schedules map or reduces jobs to task trackers with an awareness of the data location.
|
||||
For example, node A contains data XYZ, node B contains BABC, job tracker schedules B node to perform the map or reduce a task on ABC node, and node A would schedule to perform map or reduce tasks on XYZ.
|
||||
This reduces the amount of traffic that goes over the network and prevents unnecessary data transfer.
|
||||
When Hadoop is used with other file systems, this advantage is not always available.
|
||||
This advantage can have a significant impact on job completion times, which has been demonstrated when running data intensive jobs.
|
||||
Okay, and this is the reason that HDFS isn't a good thing for like Oracle database or an MSFQL database or MariaDB or something like that is because FDSS was designed for mostly immutable files.
|
||||
And it may not be suitable for file systems requiring concurrent write operations.
|
||||
Well, FDS can be mounted directly with a file system in user space or what we would call fuse virtual file system on Linux and some Unix systems.
|
||||
This file access can be achieved through native Java application programming interface.
|
||||
The thrift API to generate client and the language of the users choosing either C++, Java, Python, PHP, Ruby, AirLang, Perl, Haskell, C-PoundSign, Coco, SmartTalk, and Okami.
|
||||
The command on interface browsed through the HDFS UI web application, web app over HTTP over third party client libraries.
|
||||
Okay, so FDSS is designed for portability across various hardware platforms and compatibility with a variety of underlying operating systems.
|
||||
The HDFS design introduces portability limitations that result in some performance bottlenecks since Java implementation can't use features that are exclusive to the platform on which FDS is running.
|
||||
Due to its widespread integration to enterprise level infrastructures, monitoring HDFS performance at scale has become increasingly important issue.
|
||||
Monitoring end-to-end performance requires tracking metrics from data nodes, name nodes, and underlying operating systems.
|
||||
There are currently several monitoring platforms to track HDFS performance, including Hortonworks, Cloudera, and DataDoc.
|
||||
Now, all of these companies were at this conference that I went to, and each of them is pretty interesting, and I'll probably do a separate talk probably shorter than the 20 minutes that I'm talking about this now.
|
||||
And it's really boring, but what I found was that your data never dies.
|
||||
You can, for instance, do this project on six Raspberry Pi things, and you have data that never dies. It just goes on forever.
|
||||
It's easy to replicate across a lot of geographies, and while latency is an issue, it can deal with latency problems. It checks itself.
|
||||
So, as we said before, Hadoop works directly with any distributed file system, so CFS, any system that you can think of.
|
||||
But it says that it can be work with any distributed file system that can be mounted by the underlying operating system, simply by using file, colon, slash, slash URL.
|
||||
However, this comes at a price, the loss of locality, to reduce network traffic, Hadoop needs to know which servers are closest to the data, and this information that is critical to the Hadoop Pacific file system bridges that it provides.
|
||||
And so if you do try to use another distributed file system, you lose that locality. So if you have a far ranging Hadoop cluster, it will leave the rack. It'll go down the street. It'll do whatever it wants to do.
|
||||
And which sort of defeats the purpose? In May 2011, the list of supported file systems bundled with a Duke or HDFS, Hadoop's own rack aware file system. So what was key to me is that it's rack aware. It stays inside the rack.
|
||||
You don't have to worry about network interfaces, how thick your cable is, how far this is. It knows to stay in the rack.
|
||||
And this was designed to scale petabytes of storage that run on top of the file systems of the underlying operating systems.
|
||||
Okay. And of course, it has FTP. This stores all of its data on remotely accessible FTP servers. It works with Amazon S3.
|
||||
This targets clusters hosted on the Amazon elastic compute cloud for server on demand infrastructure. Okay. Now there's no rack awareness on this. So that's key. It's all remote. And so it's not rack aware there.
|
||||
Okay. It also works on Windows Azure storage blobs or the WSSB. There are a number of third party file bridges that have also been written none of which current none of which are currently in Hadoop distributions. However, some commercial distributions of the Duke ship with an alternative file system, specifically IBM and map are.
|
||||
And so, so the IBM is everyone knows has a proprietary GP RFS or IBM general parallel file system. And the periscope was published to run with the Duke in 2010.
|
||||
And another one is a pastry. In 2010, HP also discussed using their iBrix fusion file system driver. And in May 2011, map R technologies announced their map RFS, which is relatively popular in the system.
|
||||
Okay. So we talked about the bottom half of what Hadoop is the file system and the different things. Now let's talk about what goes above. And that's the map reduce engine.
|
||||
Okay. And as I said, it comes above the file system. It consists of a job tracker to which client applications submit map reduce jobs. The job tracker pushes the work out to available task tracker nodes in the cluster, striving to keep the work as close as possible to the data.
|
||||
Okay. So like if you have, for instance, an HP DL 380 or an HP Apollo 4000 system, it tries to keep it in that node. So where it doesn't have to leave the node.
|
||||
Okay. The job tracker knows which nodes contain the data and which other machines are nearby. If the work cannot be hosted on the actual node where the data resides, priority is given to nodes in the same rack.
|
||||
This reduces network traffic on the main network backbone. And this is key. This is the main setback of this. If the task tracker fails or times out, that part of the job is rescheduled.
|
||||
So it can move the task tracker on each node spawns a separate Java virtual machine process to prevent the task tracker itself from failing if a running job crashes in its JVM.
|
||||
A heartbeat is sent from the task tracker to the job tracker every few minutes to check the status. The job tracker and task tracker status and information is exposed by Jenny and can be viewed from a web browser.
|
||||
And it's very simple. I did it. I didn't have any hands on before.
|
||||
There are some known limitations to this approach. And the first of which is the allocation of work to task trackers is very simple. Every task tracker has a number available slots such as four slots and every active map or reduced task takes up one slot.
|
||||
The job tracker allocates the work to the tracker at the nearest to the data with an available slot. There is no consideration of the current system load of the allocated machine hence its actual availability.
|
||||
If one task tracker is very slow, it can delay the entire map reduced job, especially towards the end of a job where everything can end up waiting on the slowest task.
|
||||
With the speculative execution enabled, however, a single task can be executed on multiple slave nodes.
|
||||
Okay, let's talk about scheduling. By default, Hadoop uses FIFO scheduling and optionally five scheduling priorities to schedule jobs from the work queue. In version point 19, the job scheduler was refactored out of the job tractor while scheduling availability to use an alternate scheduler such as the fair scheduler or capacity scheduler.
|
||||
Which I'll talk about later. Okay, as an example of great, the greatness of open source, there was this thing called fair scheduler, which can be used by Duke.
|
||||
The fair scheduler was developed by all people Facebook and they shared it. So it just goes to show you the greatness of open source.
|
||||
So the goal of fair scheduler is to provide fast response times for small jobs in QOS for production jobs. The fair scheduler has three basic concepts.
|
||||
Jobs are grouped into pools. Each pool is assigned a guaranteed minimum share and excess capacity is split between jobs.
|
||||
By default, jobs in the fair scheduler are uncategorized, go into a default tool. Pools have to specify the minimum number of map slots, reduce slots, and limit the number of running jobs.
|
||||
Okay, the next one is called capacity scheduler. And it was invented by Yahoo. And the capacity scheduler supports several features that are similar to the fair scheduler.
|
||||
Qs are allocated a fraction of the total resource capacity. Free resources are allocated to Qs beyond the total capacity. And within the job Q with a high level of priority has access to the Qs resources.
|
||||
There is no permission permission permission job scheduling. Okay, so let's talk about other applications. Okay, because it's all open source lots of people have come up with other things for this HDFS file system.
|
||||
And it's it's just not for map reduced jobs. It can be used with other things, which are undeveloped men at Apache. This includes the H database.
|
||||
The Apache Mahat Machine Learning System, Apache Hive Data Warehouse System. Now that's if you really want to scare some people with talking about big data and taking money away from like huge companies like SAP, start talking about the Hive Data Warehouse that's that's free from Apache.
|
||||
Hadoop and theory can be used for any sort of work that is batch oriented rather than real time. It is very data intensive and benefits from parallel processing of data.
|
||||
It can be used to complement real time systems such as the land lambda architecture, Apache storm, link and spark streaming. As of October 2009, commercial applications of adoop include log and our check stream analysis of various kinds, marketing analytics, machine learning and our sophisticated data manning, image processing.
|
||||
Processing with XML images, web crawling and text processing in general, archiving, operational or tubular data for compliance.
|
||||
Okay, so probably the most prominent use case was in 2008 Yahoo launched what it claimed was the largest Hadoop production application. The Yahoo web search web map is an Hadoop application that runs on a Linux cluster with more than 10,000 cores and produced data that was used in every Yahoo search query.
|
||||
There are multiple Hadoop clusters at Yahoo and no HDFS file systems are mature reduced jobs are split across multiple data centers, every Hadoop cluster node, bootstraps and Linux image, including the Hadoop distribution work that the cluster performance known to include the index calculations for the Yahoo search engine.
|
||||
In June 2009, Yahoo made the source code of the adoop version it runs available to the public via the open source community in 2010, Facebook claimed that it had the largest Hadoop cluster in the world with 21 petabytes of storage.
|
||||
In June 2012 they announced the data had grown to 100 petabytes. Later they announced they were growing by half a petabyte a day and 2013 to do it began to become widespread in half of the Fortune 50 usage.
|
||||
I'm telling you a lot of people use it now.
|
||||
Hadoop hosting in the cloud, you can do it with azure and with something called the azure HD insight, HD insight uses Nortonworks HPD which will be covered in a separate show.
|
||||
You can use Amazon EC2 or S3 services. You can use the Amazon Elastic Map Reduce as well.
|
||||
Something called the Century League Cloud offers Hadoop via managed and unmanaged model via their Hadoop offering. You can use it at Google.
|
||||
So you can use the Google Cloud platform. There are plenty of third party which use the Google Cloud platform.
|
||||
It's called Rihart and Works and Map R are all used there and you can use something called Google Cloud Data Proc if you want.
|
||||
There's a whole plethora of information on the wiki about different sources and stuff but this is the way that it's going.
|
||||
You've got a company like Facebook that's got 100 petabytes or more of data and they have to figure out a way to source it and such things.
|
||||
This is the first of an installment about Hadoop and what it's going and eventually I think I'll run it over to a product called SAP Vora which I actually can make money with.
|
||||
I hope that you can also try this adoop on your pies or your zeros.
|
||||
This is open source and you can play.
|
||||
I know several people that listen in the hacker public radio community that have a plethora of Raspberry Pi devices.
|
||||
So just go ahead and make your own Hadoop cluster with the pie.
|
||||
Alright, this is JWP. You'll have a great day. Thank you. Bye.
|
||||
You've been listening to Hacker Public Radio at Hacker Public Radio.org.
|
||||
We are a community podcast network that releases shows every weekday Monday through Friday.
|
||||
Today's show, like all our shows, was contributed by an HBR listener like yourself.
|
||||
If you ever thought of recording a podcast then click on our contributing to find out how easy it really is.
|
||||
Hacker Public Radio was founded by the digital dog pound and the Infonomicon Computer Club and is part of the binary revolution at binwreff.com.
|
||||
If you have comments on today's show, please email the host directly, leave a comment on the website or record a follow-up episode yourself.
|
||||
Unless otherwise stated, today's show is released on the Creative Commons, Attribution, Share a Life, 3.0 license.
|
||||
Thank you.
|
||||
Reference in New Issue
Block a user