Table of Contents
Why Does Hadoop Replicate Data Across Multiple Nodes in a Cluster?
Learn the exact purpose of Hadoop data replication for your Big Data certification. Understand how HDFS ensures fault tolerance and high availability by storing multiple copies of data blocks across different DataNodes and racks.
Question
Why does Hadoop replicate data across nodes?
A. To ensure fault tolerance and high availability
B. To increase processing speed
C. To run SQL queries faster
D. To reduce storage costs
Answer
A. To ensure fault tolerance and high availability
Explanation
Hadoop replicates data across multiple nodes primarily to ensure fault tolerance and high availability of data. By default, the Hadoop Distributed File System (HDFS) creates three copies of every data block and distributes them across different DataNodes and separate server racks. This rack-aware replication strategy guarantees that if a hard drive crashes, a single node goes down, or even an entire rack loses power, the data remains perfectly safe and immediately accessible from one of the surviving replicas. Replication does not reduce storage costs (it actually triples them by default), nor is it primarily designed to execute SQL queries or increase processing speed, though local data processing does benefit from having data spread across the cluster.