Multi-nodes Hadoop cluster on a single host

If you running Hadoop for experimental or else purposes you might face a need to quickly spawn a ‘poor man hadoop’: a cluster with multiple nodes within the same physical or virtual box. A typical use case would look like working on your laptop without access to the company’s data center; another one is running low on the credit card, so you can’t pay for some EC2 instances.

Stop right here, if you are well-versed in Hadoop development environment, tar balls, maven and all that shenanigans. Otherwise, keep on reading…

I will be describing Hadoop cluster installation using standard Unix packaging like .deb or .rpm, produced by the great stack Hadoop platform called Bigtop. If aren’t familiar with Bigtop yet – read about its history and conceptual ideas.

Let’s assume you installed Bigtop 0.5.0 release (or a part of it). Or you might go ahead – shameless plug warning – and use a free off-spring of the Bigtop just introduced by WANdisco. Either way you’ll end up having the following structure:


your mileage might vary if you install more components besides Hadoop. Normal bootstrap process will start a Namenode, Datanode, perhaps SecondaryNamenode, and some YARN jazz like resource manager, node manager, etc. My example will cover only HDFS specifics, because YARN’s namenode would be a copy-cat and I leave it as exercise to the readers.

Now, the trick is to add more Datanodes. With a dev. setup using tarballs and such you would just clone and change some configuration parameters, and then run a bunch of java processes like: --config start datanode

This won’t work in the case of packaged installation, because of higher level of complexity involved. This is what needs to be done:

  1. Clone the config directory cp -r /etc/hadoop/conf /etc/hadoop/conf.dn2
  2. In the cloned copy of hdfs-site.xml, change or add new values for:

    (An easy way to mod the port numbers is to add 1000*)to the default value. So, port 50020 will become 52020, etc.

  4. Go to /etc/init.d and clone hadoop-hdfs-datanode
  5. In the clone init script add the following
  6.   export HADOOP_PID_DIR="/var/run/hadoop-hdfs.dn2"

    and modify

  7. Create and make hdfs:hdfs to be the owner of
  8. run /etc/init.d/hadoop-hdfs-datanode.dn2 start to fire up the second namenode
  9. Repeat steps 1 through 6 if you need more nodes running.
  10. If you need to do this on a regular basis – spare yourself a carpal tunnel and learn Puppet.

Check the logs/HDFS UI/running java processes to make sure that you have achieved what you needed. Don’t try to do it unless you box has sufficient amount of memory and CPU power. Enjoy!


Author: DrCos

Dao-Clinicist, Groovy mon, Sprechstallmeister / Concerns separator / 道可道 非常道 / Disclaimer: all posts are my personal opinion and aren't of my affiliations

2 thoughts on “Multi-nodes Hadoop cluster on a single host”

  1. Thanks for such an informative post! The primary goal of Bigtop is to build a community around the packaging and interoperability testing of Hadoop-related projects. This includes testing at various levels (packaging, platform, runtime, upgrade, etc…) developed by a community with a focus on the system as a whole, rather than individual projects. More at Hadoop Online Training

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s