If you running Hadoop for experimental or else purposes you might face a need to quickly spawn a ‘poor man hadoop’: a cluster with multiple nodes within the same physical or virtual box. A typical use case would look like working on your laptop without access to the company’s data center; another one is running low on the credit card, so you can’t pay for some EC2 instances.
Stop right here, if you are well-versed in Hadoop development environment, tar balls, maven and all that shenanigans. Otherwise, keep on reading…
I will be describing Hadoop cluster installation using standard Unix packaging like .deb or .rpm, produced by the great stack Hadoop platform called Bigtop. If aren’t familiar with Bigtop yet – read about its history and conceptual ideas.
Let’s assume you installed Bigtop 0.5.0 release (or a part of it). Or you might go ahead – shameless plug warning – and use a free off-spring of the Bigtop just introduced by WANdisco. Either way you’ll end up having the following structure:
your mileage might vary if you install more components besides Hadoop. Normal bootstrap process will start a Namenode, Datanode, perhaps SecondaryNamenode, and some YARN jazz like resource manager, node manager, etc. My example will cover only HDFS specifics, because YARN’s namenode would be a copy-cat and I leave it as exercise to the readers.
Now, the trick is to add more Datanodes. With a dev. setup using tarballs and such you would just clone and change some configuration parameters, and then run a bunch of java processes like:
hadoop-daemon.sh --config start datanode
This won’t work in the case of packaged installation, because of higher level of complexity involved. This is what needs to be done:
- Clone the config directory
cp -r /etc/hadoop/conf /etc/hadoop/conf.dn2
- In the cloned copy of
hdfs-site.xml, change or add new values for:
- Go to
- In the clone init script add the following
hdfs:hdfsto be the owner of
/etc/init.d/hadoop-hdfs-datanode.dn2 startto fire up the second namenode
- Repeat steps 1 through 6 if you need more nodes running.
- If you need to do this on a regular basis – spare yourself a carpal tunnel and learn Puppet.
(An easy way to mod the port numbers is to add 1000*)to the default value. So, port 50020 will become 52020, etc.
Check the logs/HDFS UI/running java processes to make sure that you have achieved what you needed. Don’t try to do it unless you box has sufficient amount of memory and CPU power. Enjoy!