Skip to content
Advertisement

Map Reduce flow in Hadoop

I’m learning Hadoop using the book Hadoop in Practice, and while reading chapter 1 i came across this diagram:

enter image description here

From the Hadoop docs:(http://hadoop.apache.org/docs/current2/api/org/apache/hadoop/mapred/Reducer.html)

1.Shuffle

Reducer is input the grouped output of a Mapper. In the phase the framework, for each Reducer, fetches the relevant partition of the output of all the Mappers, via HTTP.

2.Sort

The framework groups Reducer inputs by keys (since different Mappers may have output the same key) in this stage. The shuffle and sort phases occur simultaneously i.e. while outputs are being fetched they are merged.

While i understand that shuffle and sorting happens at the same time, it’s not clear to me how the framework decides which reducer receives which mapper output. From the docs, it seems that each reducer has a way to know which mapoutput to collect, but i can’t understand how.

So my question is, given the mappers output above, the final result is always the same for each reducer? If so, what are the steps to achieve this result?

Thanks for any clarifications!

Advertisement

Answer

It is the Partitioner that decides how to distribute the output of mappers to different reducers.

Partitioner controls the partitioning of the keys of the intermediate map-outputs. The key (or a subset of the key) is used to derive the partition, typically by a hash function. The total number of partitions is the same as the number of reduce tasks for the job. Hence this controls which of the m reduce tasks the intermediate key (and hence the record) is sent for reduction.

User contributions licensed under: CC BY-SA
2 People found this is helpful
Advertisement