hpr-knowledge-base/hpr_transcripts/hpr1786.txt

Episode: 1786
Title: HPR1786: What is MapReduce?
Source: https://hub.hackerpublicradio.org/ccdn.php?filename=/eps/hpr1786/hpr1786.mp3
Transcribed: 2025-10-18 09:17:59

---

This in HPR episode 1,786 entitled What in Map Reviews.
It is hosted by Charles Nnj and is about 36 minutes long.
The summary is, Charles Nnj returns in to Outdoor to the AutoExplainer Big Data Concept.
This episode of HPR is brought to you by AnanasThost.com.
Get 15% discount on all shared hosting with the offer code HPR15, that's HPR15.
Better web hosting that's honest and fair at AnanasThost.com.
Hello and welcome to Hacker Public Radio.
This is Charles in New Jersey, coming to you from our studios, our outdoor studios, in
Hamilton, New Jersey.
You may hear the sound of wind coming across the mic.
You'll hear leaves rustling, birds chirping, occasional dog barking, and probably a train
or two going by before we're done.
But it's 73 degrees sunny and beautiful out, and HPR needs shows, so here we go.
The trick for today is what is Map Reduce?
In the business world, I hear a lot about big data and projects like Hadoop and other
approaches to playing around with processing very large data sets, distributed or single
source.
And a lot of them refer to a subclass of jobs called Map Reduce.
Occasionally, I hear people actually equating Hadoop or other systems with Map Reduce, and
that's just wrong.
So I'm going to back up a step and try to lay down a little groundwork by explaining
just what this thing called Map Reduce really is in essence.
And I'm going to do it at a level that's fairly abstract, because you don't really even
need a computer to get the concept, and I'll get to that in just a moment.
So Map Reduce, it's inspired by three approaches from functional programming for applying a function
to each item in a collection of data, or actually a collection of collections if you want.
And those operations are, where those approaches are, Map, clearly, Reduce, and we should probably
include Filter.
And Filter just means that I may not want to apply that function to every single item,
but only those that interest me.
So I don't have to do all the work of doing some kind of expensive calculation on, let's
say, the four out of five items in a list of millions of items that don't interest me.
That would be a lot of work and a lot of money.
And who has time for any of that?
These are just words right now, and they're fairly abstract concepts.
I'm going to try to bring some of these ideas down to Earth.
For the most part, I'm going to use a data structure called a list.
And that list could be items written down on a piece of paper, one item per line using a pencil or pen.
That's the level of generality we're talking about here, or it could be a sequence of items or widgets on a computer.
But a computer is not strictly necessary for what I'm going to talk about.
It's a lot more convenient, but it's not necessary.
OK, now I'm going to use list, but this can really apply equally well to any kind of data source,
whether it's multiple streams from the internet or a number of internal data stores from a bunch of different sites in your company or organization.
I suppose you could even apply the same concept to user keystrokes and mouse moves.
Anything that can be processed either sequentially or in parallel, just if you've got a bunch of things you have to do,
or items that you want to do them to, this set of concepts is for you.
OK, now let's define some terms.
I've thrown around the words map filter and reduce.
Let's look at map.
Now, when I talk about using an expression like map, some function f on a list of data, let's say 1, 2, 3, 4, 5, and 6, the first six counting numbers.
What I'm saying is that I want to apply that function that I've given the name, the very descriptive name, f, to each element in that data.
In this case, we have a list of numbers, but the data could be names, employee records, URLs for internet documents that you want to parse or extract data from.
I'm going to start fairly simply and call f the operation of multiplying x by x.
I'm going to call that function something like square because x times x is x squared.
And the list that I'm going to apply the function to is the sequence of numbers from 1 to 6.
So if I have my expression map, function square of x over the collection, 1, 2, 3, 4, 5, and 6, then the result will be another list that's the square of 1, the square of 2, the square of 3, and so on up to square of 6.
And if you want to do this for yourself, they are paused for you to do it for yourself.
The values would be the list, 1, 4, 9, 16, 25, and 36.
Pretty simple.
Map takes two arguments or two parameters.
One is a function, the other is a bunch of data, and I'm asking you the computer or you the person to apply that function to each item in that collection.
We'll see more of this, so don't worry about getting lost.
Now, the next concept we have is filter.
Filtering the data is, I guess you could think of it as a variation of map.
Here's it's probably easiest to think of it in two stages.
You've got a filtering function or a test function that you apply to each item that maps it to either true or false.
You could think of it as being in or out.
The second step is to use the results of that map operation to drop any item that fails the test.
So anything that's out is out.
Anything that's in is in.
Now, I've said it as if you could take your list, your original list, apply a filter to drop out the items you don't want and then put that into a map.
But in the implementation, you could just do the filtering and the map at the same time and just if it passes the filter test, apply the map function.
If it doesn't, then just drop it and move on.
That could be a more efficient way to implement it.
But I'm not actually going to get into implementation very much because the whole point here is to define this as I want to do this operation on this set of items and I want to filter out items in that original list that don't interest me.
Let's say I only wanted the square of the even numbers.
I could drop the odd ones.
But I haven't really told you how I want it done.
I haven't put those kinds of constraints on you.
And that lets you declare what you want to do without, in other words, you can say what you want to do without specifying how you want to do it without putting a particular implementation on it.
And depending on how you are actually doing the operations, you could save yourself a lot of work.
In some cases, it might be easier to apply the filter up front and then just do the map on what's left or you could combine the two.
As long as that operation gets done, I don't care.
I just don't care.
This is the essential concept behind what I'll call and what is called declarative programming or declarative anything.
I'm declaring what I want.
How you do it, I don't care.
Not going to micromanage you.
So you can pick the most efficient way that you can find to do it as long as everybody that I want processed gets processed.
And nobody that I don't want gets into the mix.
And that's pretty much it for map and filter.
I'm just telling you that I want to do something to a certain group of elements that satisfy this test.
I don't want the others.
Have fun.
Come back to me when you're done.
Now in functional programming tutorials, I guess you would usually see them illustrating a filter by selecting the prime numbers from a list of integers or
or selecting the only those numbers in a list that are not multiples of three.
That's fine if you are processing lists of numbers.
I think it might be more familiar if you think of a search engine and your filter would be applied to documents in a repository or let's say to the to the open internet.
And you your search engine has crawled some number of billions of sites on the internet and you want to return just those that passed the relevance test.
That include the keywords you ask for in a and some sort of relevant relevance index is computed how close those words are together how important the search engine thinks some of the keywords are so you could return some that don't satisfy the entire search expression.
But that comes so close that maybe you won't want those you won't want to eliminate things that that miss on one or two of your words.
It could be a it could be a fuzzy criterion whatever whatever it is if you can say let's say the filter is if the relevant score is greater than 50% keep this thing.
And I want to look at what what what you get if I want it to be tighter I'll I'll increase the relevance index that I'm looking for if I want it looser I'll use a lower relevant score minimum relevant score for my filter.
But the end result is I'm going to get some documents that I accept and then I have a lot that I'm going to skip.
Now in a filter operation that applies to a very very large number of data items it might be smart to drop those items as early as possible.
And there's no law unless you make it that requires you to make those decisions in advance when you when you want to map or filter operation.
It is probably best from the standpoint of the user to just let them tell you what they want and then you can find the most efficient way to implement or the only way you know how which might not be efficient but you'll get the job done.
Maybe.
So in a map reduce context sometimes well it's certainly in the name map and filter will be lumped together as people are describing what's going on.
And that's that's fine I guess because you really don't want to waste processing time to perform a potentially expensive transformation on data or documents that.
You were going to rule out anyway especially if the if the filter function is a lot cheaper to run so you can just immediately say no not that one not that one not that one but these four yeah I want those.
So let's see I guess that's all I want to say at first about map and filter let's go to reduce.
That's a little bit harder to think of it first except that reduce function or reduce operation I should say on a collection of data.
Is an approach that applies an aggregate operation or function whose job it is to boil down all the detail items into one or more maybe a few summary metrics that are computed on the filtered data.
I guess the canonical examples here of a reduce function would be a sum or maybe account of things that match a filter.
But there are other possibilities formally I guess reduce would be defined as an operation on let's see how many things would be I think it would be about three things.
The first would be it would be again a function that's looking for an accumulator and a data item that returns a new accumulator value in other words a sum of three items is the first item the sum of the first item plus the second item.
And then the sum of all three would be the sum of the first two plus the third so it would be that kind of thing it would be somewhat recursive doesn't have to be defined that way but it's it's probably easy to think of that if you're in a functional context you could do it in you could define the thing in an iterative way if you like.
Usually the second thing that you want to reduce after the the accumulating aggregate function is an initial value might be zero if you're doing a sum or one if you're doing a product zero if you're doing account that kind of thing and then the third item is well what do you want to apply it to but usually that's the data itself.
So if you see the reduce operation defined more recursively I guess it would look like and I've got this in in a set of show notes which I have mercifully written up for you.
A reduce operation based on a function F and initial value and a data set which is composed of the first item and then a collection of all the other items is defined as equal to another reduce same function F with a new value.
Equal to F the original F evaluated that the initial value and the first item and the data for that for that right hand side reduce that you're using to define it is all the list of all other items.
So you're always reducing by by one element and eventually you hit the whole list which is fine for lists but you can you can reduce something that's that's on servers around the world.
You can have you can reduce using the subtotals from other servers so they can do the the local reducing let's say if you're summing something up you could have each each regional server define it some and then pass that answer back so that you don't ever transmit all of the data over the internet or over a expensive communications channel to get to the original.
Server to do the work you just have it have all the work done locally and send the the answer to that sub problem back to be merged in.
Systems like Hadoop can handle all of that aggregation and even ordering and so forth if things have to be placed in a certain order.
Which is why you want some of those things but it's not it's not an essential part of map reduce or map filter reduce operations that's an interesting implementation detail.
So when it comes to reduce and this recursive definition you're going to want to look at the show notes I think so do yourself a favor get the show notes get them in front of you and start looking at them.
Go over what what I've said and then say to yourself well Charles you could have done this in a more straightforward way but I guess you never you never do things in a straightforward way.
So okay so we've got map filter and reduce why do we think this is some kind of technological and conceptual advance.
Well if you look at this characterization of map and reduce you'll see the operations that I've talked as I've talked about them are pretty abstract.
The declarations of the statements of what I want you to do you the server typically just say what needs to be done and leave open the implementation steps that tell how it is to be done.
So it's not an imperative micromanaging process it's more of a it's more of a of a delegation of a processing job that needs to be done.
Now when you're doing operations on data items that are fairly independent of each other where you don't need to know when you're processing the first item that comes to you you don't need to know what order it goes in or what's come before or what's going to come after you just need to process what's in front of you.
You might remember what the end result of what you did before if you're aggregating but you don't need to know what the other elements in detail were then you can really spread out the work and do parallel computing because you can have a bunch of different servers all taking on a part of the task.
So you've got a map that's applying a function to a lot of things that are spread out over the world like my seashell collection you might find it you might have seen it on beaches.
I keep it storage throughout the world because I don't have room for it all.
You can just process all of that stuff locally and then send just the summarized results upstream or if you're doing just a map without a reduce operation you can apply the apply the function and just leave the things on the server where you found them if you don't need them in a centralized location if you don't need to apply a sort.
You can just do the operations right there in the same data center store them locally and there's no latency from having to transmit the answer back to the server that made the request in the first place.
You can see how this could save you time and money and transmission costs and bandwidth and so forth.
Depending on what it is you need to get done and what you need to keep all of those things centralized might be if all you need is summary results back at the home office you can have all of the all of the subtotals done at the branches and then transmit just a much smaller package of answers.
I think I've beaten that horse to death but if you think about it you can see that you could save you time if you don't need to transmit the entire problem and everything that all the information that is processed back to a central server have the operations applied and then transmit everything back.
That would be a very nasty situation and the costliest imaginable way to do things sometimes you can't avoid it but when you can you should.
Okay so you can do things concurrently that's a good thing.
Now the basic ground rules for this simplest case I'll go back over it. There are always exceptions and other constraints that you're going to run into in real world but here's the the two things that that let you do the most in terms of what I'll call data parallelism and that is that the computations for each data item don't really depend on the values of the other data items which means it's a no communication or shared.
Memory situation where you don't need the workers don't need to coordinate.
The second is that the order of the computations doesn't matter so I could have five numbers that I want to square.
I could give one of those to Ken, two of those to Kevin and the remaining values to Keith they could do the squaring and if I don't need those back we're done.
And they could do the calculations one of them could do it on the computer.
The other could use a calculator third could use an abacus and as long as those values are found and stored or the operations are done and the side effects that you wanted are achieved I don't care how they did it.
I don't even care where the answers are as long as the operations done that's a very primitive non computer description of parallel processing in a map reduced job actually that was just a map.
If I needed to sum them I would need each of my three employees to transmit the sum of the results that they were of the operations they were assigned and then I would add them up back at the headquarters.
So I guess under these under these conditions your map and reduce operations could be outsourced from some kind of formal map reduced server installation to a fleet of worker computers.
As long as those workers can are powerful enough to take on the pieces of the overall process that they've been assigned and when they're done they could either store the results and go home or send the results back to an aggregating server or what I like to call the boss machine or person and then they're done.
Now doing this on a fleet of machines could give you a really big speed up over running on some kind of specialized cluster could also give you a huge cost advantage of having machines that you can afford that run the pieces as opposed to buying a really expensive hard to maintain computing cluster or mainframe.
So there are speed advantages you can get from pure map reduced jobs and there could actually be cost savings which is very cool.
I will add that sometimes it may be the only way to do it is to split up the work and do it locally depending on what your communications and other constraints are.
So with the right kind of infrastructure you can relax these constraints and still get some of the same benefits on data that when you when you don't have the luxury of being able to to think of the data items is independent and unordered.
You can still do some of the processing work off site and then send the data back to either to the boss machine itself or if it's a really really big job you might have intermediate machines that do the aggregation and ordering of the data to make sure that the final merge doesn't overwhelm your your final server and your communications channels.
Okay another advantage of the boss and workers paradigm for map reduce and other massive data operations and this one maybe less obvious is fault tolerance.
You can give a computer a task and most of the time it will probably do it but sometimes computers can't finish their task.
You can the thing that comes to mind soon is you can lose network connections or you can have a power out it can one of your data centers so that the workers in that center can respond.
In a boss worker setup a worker could send a status report back to the boss machine or one of the intermediate supervisors since even that boss role could be shared in a multi layer setup that contains either a success status flag and the results or a failed flag meaning I wasn't able to finish.
Now if that boss or supervisor gets a failed message it could then reassign that piece of the overall computation to another worker or set of workers or another data center.
In the case of a network outage you may not get that failed that status message that says I failed and so you would never know if you were just waiting for that message to come in.
So you could actually set up a timer and automatically failed that set of workers at worker or set of workers with a timeout event.
Then you could just flush the assignment from that worker and reassign any unfinished tasks to any resources that are now open and give them a new unique ID so that they can do the job that the other workers were not able to do.
Now if the worker has finished the job but does it late well in a system with timeouts you have to have a rule that any homework that's turned in after the timeout event has to be ignored and I don't want to sound like this is the only way to do it but this is just one way to build in parallelism and fault tolerance into a process.
There are other approaches which is the beauty of it you can choose the approach that makes sense to you the implementation approach.
Now another advantage of this kind of vague definition that I'm giving to map reduce.
I guess I've kind of gone over this already but it's the ability to work with distributed data and in particular to use local processing to extent that you can.
You could set up a central server or a hub that would force remote sites to transmit all the original data to headquarters and then wait for the hub, the central processor or mainframe to do the processing and then maybe even have to transfer those finished results back from the headquarters to the remote data repository or back even to the sites where it came from.
It's a lot of network traffic and part of that could be lost so you might have to retransmit perhaps repeatedly.
It could be corrupted so you would have to use some kind of error detection and correction routine to make sure that everything came through all right and worse yet it could be intercepted by third parties if it goes over the open internet.
So if you can do the processing centrally and reduce what you have to transmit you can cut your costs in a lot of different ways and reduce your risk of disclosure of the data that you are processing.
Now when a reduced operation where you're boiling everything down to some small set of summary measures the local site can obviously do a great deal of the processing work and transmit only the job header and the intermediate results back to its supervisor or to the boss back at the hub or the big boss site.
And that would then be integrated according to the aggregation rules to get the final totals over all the workers.
So I guess I'll sum up here map produce is an approach that uses three abstract concepts map take a function apply it to every item in a set of data or objects or widgets or things filter which is defining a
test function that lets you apply your results or return or return the items from a subset of the original data so you don't have to do it you don't have to do a map on everything and reduce which takes a set of data returns one or more aggregated summary results from the detailed data.
These are very loosely defined they might be combined in your implementation code or in your implementation workflow if it's manual because it doesn't have to be on a computer but these are very loosey goosey definitions because you want to leave the implementation details out by making it declarative you can define it in the way that works best for your situation.
So it gives you flexibility and it gives you a way to optimize the way the work is done in a way that works the best for you.
You can optimize the save time even if that means spending more on hardware and communications you can design your implementation to save money using local processing servers that are easier to replace.
Using cheaper equipment and having some fault tolerance built in at a very crude level or you could use very sophisticated algorithms and even hardware solutions to do the demand to the fault tolerance.
Do what you need to do that's the idea behind that produce and declarative approaches.
Now the show notes will also talk about a do in very general terms but I'm not going to go into it here because it's a particular way of addressing jobs that involve a lot of data that's either stored in one place or not.
How's that for a nutshell and extremely vague almost declarative definition of Hadoop but it has Hadoop and its descendants because it's being it's in the process of being superseded by other better approaches now are just you know just an approach to doing data jobs and when it comes to big data.
Do what you can to get your jobs done with the equipment that you can afford and and then be happy and that's about all I have to say on map produce I hope it's cleared up any confusion it's probably caused some additional confusion but I wanted to cut through some of the jargon that people use and sort of reduce this so to speak.
To the least restrictive definitions that get the idea across and I think that's that's all I have to say I want to thank the herons and the frogs and the grackles and the oil that have helped me do today's episode so thanks a lot enjoy the good weather if you have it.
Enjoy the end of fall or perhaps the approach of winter if you don't have nice weather and we'll see you next time on hacker public radio good night.
You've been listening to Hacker Public Radio at HackerPublicRadio.org we are a community podcast network that releases shows every weekday Monday through Friday today show like all our shows was contributed by an HBR listener like yourself if you ever thought of recording a podcast then click on our contributing to find out how easy it really is.
HackerPublic Radio was founded by the digital dog pound and the infonomicon computer club and is part of the binary revolution at binrev.com.
If you have comments on today's show please email the host directly leave a comment on the website or record a follow-up episode yourself unless otherwise status today's show is released on the creative comments, attribution, share a light 3.0 license.
you