In the first part of this article I looked into the development that brought us Hadoop 2. Let’s now try to analyze whether Hadoop 2 is ready for general consumption, or if it’s all just a business hype at this point. Are you better off sticking to the old, not-that-energetic grandpa who, nonetheless, delivers every time or going riding with the younger fella who might be a bit “unstable”?
New features
Hadoop 2 introduces a few very important features such as
- HDFS High Availability (HA) with . This is what it does:
…In order for the Standby node to keep its state synchronized with the Active node in this implementation, both nodes communicate with a group of separate daemons called JournalNodes…In the event of a fail-over, the Standby will ensure that it has read all of the edits from the JournalNodes before promoting itself to the Active state. This ensures that the namespace state is fully synchronized before a fail-over occurs.
There’s an alternative approach to HDFS HA that requires an external filer (an NAS or NFS server to store a copy of the HDFS edit logs). In the case of failure of the primary NameNode, a new one can be brought over and the network-stored copy of the logs can be used to serve the clients. This is essentially a less optimal approach than QJM, as it involves more moving parts and requires more complex dev.ops.
- An HDFS federation that essentially allows to combine multiple namespaces/namenodes to a single logical filesystem. This allows for better utilization of the higher-density storage.
- YARN essentially implements the concept of Infrastructure-As-A-Service. You can deploy your non-MR applications to cluster nodes using YARN resource management and scheduling.
Another advantage is the split of the old JobTracker into two independent services: resource management and job scheduling. It gives a certain advantage in the case of a fail-over and in general is a much cleaner approach to MapReduce framework implementation. YARN is API-compatible with MRv1, hence you don’t need to do anything about your MR applications, just perhaps recompile the code. Just run them on YARN.
Improvements
The majority of the optimizations were made on the HDFS side. Just a few examples:
- overall file system read/write improvements: I’ve seen reports of >30% performance increase from 1.x to 2.x with the same workload
- read improvements for DN and client collocation HDFS-347 (yet to be added to the 2.0.5 release)
Good overall observation on the HDFS road map can be found here
Vendors
Here’s how the bets are spread among commercial vendors, with respect to supported production-ready versions:
Hadoop 1.x | Hadoop 2.x | |
Cloudera | x[1] | x |
Hortonworks | x | – |
Intel | – | x |
MapR | x[1] | x |
Pivotal | – | x |
Yahoo! | – | x[2] |
WANdisco | – | x |
The worldview of software stacks
In any platform ecosystem there are always a few layers: they are like onions; onions have layers 😉
- in the center there’s a core, e.g. OS kernel
- there are few inner layers: the system software, drivers, etc.
- and the external layers of the onion… err, the platform — the user space applications: your web browser and email client and such
The Hadoop ecosystem isn’t that much different from Linux. There’s
- the core: Hadoop
- system software: Hbase, Zookeeper, Spring Batch
- user space applications: Pig, Hive, users’ analytics applications, ETL, BI tools, etc.
The responsibility of bringing all the pieces of the Linux onion together lies on Linux distribution vendors: Canonical, Redhat, SUSE, etc. They pull certain versions of the kernel, libraries, system and user-space software into place and release these collections to the users. But first they make sure everything fits nicely and add some of their secret sauce on top (think Ubuntu Unity, for example). Kernel maintenance is not a part of daily distribution vendors’ business. Yet they are submitting patches and new features. A set of kernel maintainers is then responsible to bring changes to the kernel mainline. Kernel advancements are happening under very strict guidelines. Breaking compatibility with user-space is rewarded by placing a guilty person straight into the 8th circle of Inferno.
Hadoop practices a somewhat different philosophy than Linux, though. Hadoop 1.x is considered stable, and only critical bug fixes are getting incorporated into it (Table2). Whereas Hadoop 2.x is moving forward at a higher pace and most improvements are going there. That comes with at a cost to user-space applications. The situation is supposedly addressed by labeling Hadoop 2 as ‘alpha’ for about a year now. On the other hand, such tagging arguably prevents user feedback from flowing into the development community. Why? Because users and application developers alike are generally scared away by the “alpha” label: they’d rather sit and wait until the magic of stabilization happens. In the meanwhile, they might use Hadoop 1.x.
And, unlike the Canonical or Fedora project, there’s no open-source integration place for the Hadoop ecosystem. Or is there?
Integration
There are 12+ different components in the Hadoop stack (as represented by the BigTop project). All these are moving at their own pace and, more often than not, support both versions of Hadoop. This complicates the development and testing. It creates a large amount of issues for the integration of these projects. Just think about the variety of library dependencies and such that might all of a sudden be at conflict or have bugs (HADOOP-9407 comes to mind). Every component also comes with its own configuration, adding insult to injury for all the tweaks in Hadoop.
All this brings a lot of issues to the DevOps who need to install, maintain, and upgrade your average Hadoop cluster. In many cases, DevOps simply don’t have the capacity or knowledge to build and test a new component of the stack (or a newer version of it) before bringing it to the production environment. Most of the smaller companies and application developers don’t have the expertise to build and install multiple versions from the release source tarballs, configure and performance tune of the installation.
That’s where software integration projects like BigTop come into the spotlight. BigTop was started by Roman Shaposhnik (ASF Bigtop, Chair PMC) and Konstantin Boudnik (ASF Bigtop, PMC) at the Yahoo! Hadoop team back in 2009-2010. It was a continuation of earlier work based on expertise in software integration and OS distributions. BigTop provides a versatile tool for creating software stacks with predefined properties, validates the compatibility of integral parts, and creates native Linux packaging to ease the installation experience.
BigTop includes a set of Puppet recipes — an industry standard configuration management system — that allows to spin up a Hadoop cluster in about 10 minutes. The cluster can be configured for Kerber’ized or non-secure environments. A typical release of BigTop looks like a stack’s bill-of-materials and source code. It lets anyone quickly build and test a packaged Hadoop cluster with a number of typical system and user-space components in it. Most of the modern Hadoop distributions are using BigTop openly or under the hood, making BigTop a de facto integration spot for all upstream projects
Conclusions
Here’s Milind Bhandarkar (Chief Architect at Pivotal):
As part of HAWQ stress and longevity testing, we tested HDFS 2.0 extensively, and subjected it to the loads it had never seen before. It passed with flying colors. Of course, we have been testing the new features in HDFS since 0.22! EBay was the first to test new features in HDFS 2.0, and I had joined Konstantin Schvachko to declare Hadoop 0.22 stable, when the rest of the community called it crazy. Now they are realizing that we were right.
YARN is known for very high stability. Arun Murthy – RM of all of 2.0.x-alpha releases and one of the YARN authors – in the 2.0.3-alpha release email:
# Significant stability at scale for YARN (over 30,000 nodes and 14 million applications so far, at time of release – see here)
And there’s this view that I guess is shared by a number of application developers and users sitting on the sidewalks:
I would expect to have a non-alpha semi-stable release of 2.0 by late June or early July. I am not an expert on this and there are lots of things that could show up and cause those dates to slip.
In the meanwhile, six out of seven vendors are using and selling Hadoop 2.x-based versions of storage and data analytics solutions, system software, and service. Who is right? Why is the “alpha” tag kept on for so long? Hopefully, now you can make your own informed decision.
References:
[1]: EOLed or effectively getting phased out
[2]: Yahoo! is using Hadoop 0.23.x in production, which essentially is very close to the Hadoop 2.x source base