Making HDFS balancing faster

HDFS by default likes to stash one copy of a block on the machine it originated from. This is nice in that you avoid a copy, but not so great when all of your data is originating from a single server (you end up wildly unbalanced).

Thankfully, there is a balancing mechanism provided with Hadoop, but mysteriously, not enabled by default. You can start it running in the background via:

/path/to/hadoop/dir/bin/start-balancer.sh 5

An important detail that I missed originally was the setting of the dfs.balance.bandwidthPerSec flag. By default this limits the amount of bandwidth used for balancing to 1MB/s! No wonder my balancing went so slow. Setting this to a more aggressive 80MB/s:

<property>
  <name>dfs.balance.bandwidthPerSec</name>
  <value>100000000</value>
</property>

Greatly reduces the amount of time it takes to balance.

Leave a Reply

Your email address will not be published. Required fields are marked *