HDFS by default likes to stash one copy of a block on the machine it originated from. This is nice in that you avoid a copy, but not so great when all of your data is originating from a single server (you end up wildly unbalanced).
Thankfully, there is a balancing mechanism provided with Hadoop, but mysteriously, not enabled by default. You can start it running in the background via:
An important detail that I missed originally was the setting of the dfs.balance.bandwidthPerSec flag. By default this limits the amount of bandwidth used for balancing to 1MB/s! No wonder my balancing went so slow. Setting this to a more aggressive 80MB/s:
<property> <name>dfs.balance.bandwidthPerSec</name> <value>100000000</value> </property>
Greatly reduces the amount of time it takes to balance.