MapReduce enables programmers to process huge amounts of data in parallel across distributed processors. It handles details such as parallelization, fault tolerance, data distribution and load
balancing so programmers do not have to worry about them.
MapReduce provides a clear abstraction for programmers. They only have to use two functions, map and reduce:
- Data are fed into the map function as key-value pairs to produce intermediate key/value pairs
- Once the mapping is done, intermediate results from various nodes are reduced to create the final output
The number of mappers is aligned with the number of blocks that the input data takes up. The number of reducers, set by the application developer.
Simple MapReduce Algorithm Example
In our example, we want to review a stack of quarters and count only the quarters that were minted in even years. We could easily just count out this stack by hand.
But what happens when we have to do the same task at scale?
We might want to call on our friends to help process this huge pile. So in this scenario, we give each of our friends part of the pile and have them count them – then we will reduce those subtotals and combine them to find the final tally.
Word Count Process:
The Mapper reads data in the form of key/value pairs (KVPs). It outputs zero or more KVPs. The Mapper may use or completely ignore the input key. For example, a standard pattern is to read a line of a file at a time:
- The key is the byte offset at the start of the line in file
- The value is the contents of the line itself
- Typically the key is irrelevant to this pattern
If the Mapper writes anything out, it must in the form of KVPs. This “intermediate data” is NOT stored in HDFS (local storage only without replication).
After the Map phase is over, all intermediate values of the intermediate key are combined together into a list. This list is given to a single or multiple Reducers.
- All values associated with a particular intermediate key are guaranteed to go to the same Reducer
- The intermediate keys and their value lists are passed in sorted order
The Reducer outputs zero or more KVPs which are written to HDFS. In practice, the Reducer often emits a single KVP for each input key.
MapReduce Word Count Example
To count the number of occurrences of each word in a large amount of data.
The Map Phase:
Each mapper takes a line as input and breaks it into words. It then outputs a key/value pair of the word and 1.
The Reduce Phase:
Each reducer sums the counts for each word and outputs a single key/value of the word and sum.
- MapReduce is the foundational framework for processing data at scale because of its ability to break a large problem into any smaller ones
- Mappers read data in the form of key/value pairs (KVPs) and each call to a Mapper is for a single KVP; it can return 0..m KVPs
- The framework shuffles and sorts the Mappers’ output KVPs with the guarantee that only one Reducer will be asked to process a given Key’s data
- Reducers are given a list of Values for a specific Key; they can return 0..m KVPs