Skip to Content

How Does Hadoop Automatically Sort Mapper Output Before Reducers?

What Happens to Mapper Outputs by Default in Hadoop MapReduce?

Hadoop sorts mapper outputs by key during buffer spills before shuffle, ensuring grouped data reaches reducers efficiently—core to MapReduce’s default behavior for optimized processing.

Question

What is the default behavior of Hadoop when handling mapper outputs?

A. It compresses outputs into SequenceFiles
B. It discards duplicate keys immediately
C. It sorts mapper outputs by key before reducers
D. It assigns outputs to reducers in round-robin order

Answer

C. It sorts mapper outputs by key before reducers

Explanation

Hadoop’s default behavior for mapper outputs involves automatic sorting by key during the spill phase when the in-memory buffer fills (around 80% capacity), writing sorted intermediate key-value pairs to local disk as spill files before the shuffle phase transfers them to reducers. This local sorting per mapper ensures that during shuffle/sort, data arrives at reducers already grouped and ordered by key (via merge-sort of spill files from all relevant mappers), enabling efficient reducer processing since values for each key are presented sequentially without needing extra sorting in the reduce task. No default compression into SequenceFiles occurs (that’s configurable), duplicates aren’t discarded (grouping happens at reducer), and assignment uses hash partitioning rather than round-robin.