I’m learning Hadoop using the book Hadoop in Practice
, and while reading chapter 1 i came across this diagram:
From the Hadoop docs:(http://hadoop.apache.org/docs/current2/api/org/apache/hadoop/mapred/Reducer.html)
1.Shuffle
Reducer is input the grouped output of a Mapper. In the phase the framework, for each Reducer, fetches the relevant partition of the output of all the Mappers, via HTTP.
2.Sort
The framework groups Reducer inputs by keys (since different Mappers may have output the same key) in this stage. The shuffle and sort phases occur simultaneously i.e. while outputs are being fetched they are merged.
While i understand that shuffle
and sorting
happens at the same time, it’s not clear to me how the framework decides which reducer
receives which mapper
output. From the docs, it seems that each reducer
has a way to know which map
output to collect, but i can’t understand how.
So my question is, given the mappers output above, the final result is always the same for each reducer? If so, what are the steps to achieve this result?
Thanks for any clarifications!
Advertisement
Answer
It is the Partitioner that decides how to distribute the output of mappers to different reducers.
Partitioner controls the partitioning of the keys of the intermediate map-outputs. The key (or a subset of the key) is used to derive the partition, typically by a hash function. The total number of partitions is the same as the number of reduce tasks for the job. Hence this controls which of the m reduce tasks the intermediate key (and hence the record) is sent for reduction.