Hadoop 2: "alpha" elephant or not?

Today I will look into the state of Hadoop 2.x and try to understand what has kept it in the alpha state to date. Is it really an “alpha” elephant? This question keeps popping up on the Internet, in conversations with customers and business partners. Let’s start with some facts first.

The first anniversary of the Hadoop 2.0.0-alpha release is around the corner. SHA 989861cd24cf94ca4335ab0f97fd2d699ca18102 was made on May 8th, 2012, marking the first-ever release branch of the Hadoop 2 line (in the interest of full disclosure: the actual release didn’t happen until a few days later, May 23rd).[1]

It was a long-awaited event. And sure enough, the market accepted it enthusiastically.  The commercial vendor Cloudera announced its first Hadoop 2.x-based CDH4.0 at the end of June 2012, according to this statement from Cloudera’s VPoP — just a month after 2.0.0-alpha went live! So, was it solid, fact-based trust of the quality of the code base, or something else?  An interesting nuance: MapReduce v1 (MRv1) was brought back despite the presence of YARN (a new resource scheduler and a replacement for the old MapReduce). One of those things that make you go, “Huh…?”

We’ve just seen the 2.0.4-alpha RC vote getting closed: the fifth release in a row in just under one year. Many great features went in: YARN; HDFS HA; HDFS performance optimizations, to name a few. An incredible amount of stabilization has been done lately, especially in 2.0.4-alpha. Let’s consider some numbers:

Table1: JIRAs committed to Hadoop between 2.0.0-alpha and 2.0.4-alpha releases

HADOOP 383
HDFS 801
MAPREDUCE 219
YARN 138

That’s about 1,500 fixes and features since the beginning. Which was to be expected, considering the scope of implemented changes and the need for smoothing things out.

Let’s for a moment look into Hadoop 1.x — essentially the same old Hadoop 0.20.2xx — per latest genealogy of elephants — a well-respected and stable patriarchy. Hadoop 1.x had 8 releases altogether in 14 months:

  • 1.0.0 released on Dec 12, 2011
  • 1.1.2 released on Feb 15, 2013

Table2: JIRAs committed to Hadoop between 1.0.0 and 1.1.2 releases

HADOOP 110
HDFS 111
MAPREDUCE 84

That’s about five times fewer fixes and improvements than what went into Hadoop 1.x over roughly the same time. If frequency of change is any indication of stability, then perhaps we are onto something.

For the sake of full disclosure here’s the similar statistics for Hadoop 0.23.x. There was 8 “dot”-releases between 01 Nov 2011 (0.23.0) and 16 Apr 2013 (0.23.7).
Table3: JIRAs committed to Hadoop between 0.23.0 and .0.23.7 releases

HADOOP 514
HDFS 687
MAPREDUCE 1240
YARN 92[2]

“Wow,” one might say, “no wonder the ‘alpha’ tag has been so sticky!” Users definitely want to know if the core platform is turbulent and unstable. But wait… wasn’t there that commercial release that happened a month after the first OSS alpha? If it was more stable than the official public alpha, then why did it take the latter another five releases and 1,500 commits to get where it is today? Why wasn’t the stabilization simply contributed back to the community? Or, if both were of the same high quality to begin with, then why is the public Hadoop 2.x still wearing the “alpha” tag one year later?

Before moving any further: all 13 releases — for 1.x and 2.x —  were managed by engineers from Hortonworks. Tipping my hat to those guys and all contributors to the code!

So, is Hadoop 2 that unstable after all? In the second part of this article I will dig into the technical merits of the new development line so we can decide for ourselves. To be continued

References:
[1] All release info is available from official ASF Hadoop release page
[2] First appeared in release 0.23.3 

Advertisements

Author: DrCos

Dao-Clinicist, Groovy mon, Sprechstallmeister / Concerns separator / 道可道 非常道 / Disclaimer: all posts are my personal opinion and aren't of my affiliations

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s