What you wanted to know about Hadoop, but were too afraid to ask: genealogy of elephants.

Hadoop is taking central stage in the discussions about processing of the large amount of unstructured data.

With raising the popularity of the system I found that people are really puzzled with all the multiplicity of Hadoop versions; the small, yet annoying differences introduced by different vendors; the frustration when vendors are trying to lock up their customers using readily available open source data analytic components on top of Hadoop, and on and on.

So, after explaining who was born from whom for the 3rd time – and I tell you, drawing neat pictures on a napkin in a coffee shop isn’t my favorite activity – I put together this little diagram below. Click on it to inspect it in greater details. A warning: the diagram only includes more or less significant releases of Hadoop and Hadoop-derived systems available today. I don’t want to waste any time on some obscure releases or branches which never been accepted at any significant level. The only exception is 0.21 which was a natural continuation of 0.20 and predecessor of recently released 0.22.

Some explanations for the diagram:

  • Green rectangles designate official Apache Hadoop releases openly available for anyone in the world for free
  • Black ovals show Hadoop branches that are not yet officially released by Apache Hadoop (or might not be released ever). However, they are usually available in the form of source code or tar-ball artifacts
  • Red ovals are for commercial Hadoop derivatives which might be based on Hadoop or use Hadoop as a part of custom systems (like in case of MapR). These derivatives can be or can be not compatible with Hadoop and Hadoop data processing stack.

Once you’re presented with the view like this it is getting clear that there are two centers of the gravity in today’s universe of elephants: 0.20.2 based releases and derivatives; and 0.22 based branches, future releases, and derivatives. Also, it becomes quite clear which are likely to be sucked into a black hole.

The transition from 0.20+ to 0.2[1,2] was real critical because of introduced true HDFS append, fault injection, and code injection for system testing. And the fact that 0.21 hasn’t been released for a long time, creating an empty space in the high demand environment. Even after it did come out, it didn’t get any traction in the community. Meanwhile, HDFS append was very critical for HBase to move forward, so 0.20.2-append has been created to support the effort. A quite similar story had happened to 0.22: two different release managers was trying to get it out: first gave up, but the second has actually succeeded in pulling an effort of a part of the community towards it.

As you can see, HDFS append wasn’t available in an official Apache Hadoop release for some time (except for 0.21 with the earlier disclaimer). Eventually it has been merged into 0.20.205 (recently dubbed as Hadoop 1.0) and that allows HBase to be nicely integrated with the official Apache Hadoop without any custom patching process.

The release of 0.20.203 was quite significant because it provided a heavily tested Hadoop security, developed by Yahoo! Hadoop development team (known as HortonWorks nowadays). Bits and pieces of 0.20.203 – even before the official release – were absorbed by at least one commercial vendor to add corporate grade Kerberos security to their derivatives of Hadoop (as in case of Cloudera CDH3).

The diagram above clearly shows a few important gaps of the rest of commercial offerings:

  1. none of them supports Kerberos security (EMC, IBM, and MapR)
  2. unavailability of Hbase due to the lack of HDFS append in their systems (EMC, IBM). In case of MapR you end up using a custom HBase distributed by MapR. I don’t want to make any speculation of the latter in this article.

Apparently, the vacuum of significant releases between 0.20 and 0.22 appeared to be a major urge for Hadoop PMC and now – just days after release of 1.0 – 0.22 got out. With 0.23 already going through release process, championed by HortonWorks team. That release brings in some interesting innovations like Federations and MapReduce 2.0.

Once current alpha of 0.23 (which might become Hadoop 2.0 or even Hadoop 3.0) is ready for the final release I would expect new versions of commercial distributions springing to live as it was the case before. At this point I will update the diagram 🙂

If you can imagine the variety of the other animals such as Pig, and Hive piling on top of Hadoop, you would get astonished by the complexity of inter-component relations and, more importantly, about intricacies of building a stable data processing stack. This is why another Apache project BigTop has been so important and popular ever since it sprung to life last year. Here you can read about Bigtop here or here.

Can SEI really teach you how to be Hadoop contributor?

Or of anything else for that matter?

I am kidding you not… I just got this email from SEI. In the interest of full disclosure – here it is:

To the attention of:

The Software Engineering Institute (SEI) has been asked to conduct a sample survey of committers to the Hadoop Distributed File System. The results will be used to supplement existing documentation that can be used in providing guidance to HDFS contributors as well as support committers in preparing their own HDFS contributions.

You are part of a carefully chosen sample of HDFS committers for the survey. So your participation is necessary for the results to be accurate and useful. Answering all of the questions should take about 15 or 20 minutes. Any information that could identify you or your organization will be held in strict confidence by the SEI under promise of non disclosure.

You will find your personalized form on the World Wide Web at https://feedback.sei.cmu.edu/Hadoop_HDFS_2.asp?id=C8288. Please be sure to complete it at your earliest convenience — right now if you can make the time. You may save your work at any time, and you may return to complete your form over more than one session if necessary for any reason. Everything will be encrypted for secure transfer and storage.

Now, let’s follow the link and dig out some pearls which, I am sure, has to be in the work of such a venerable organization. What are they covering exactly?

  • Reducing unnecessary dependencies and propagation, e.g., identifying cyclic dependencies between classes in the source code 
  • Difficulty in managing data
  • Difficulty in managing namespaces
  • Identifying location of bugs
  • difficulty finding test suites
  • Communication between application
  • Reducing unnecessary dependencies and propagation
  • yada-yada-yada

Ah, I think I got the picture…. boring… 1534th research in a row on how to write effective code. Something, I like in particular:

  • “You are part of a carefully chosen sample of HDFS committers” – no shit, there’s a plenty to select from, of course.
  • Are you familiar with the (HDFS) Architectural Documentation at http://kazman.shidler.hawaii.edu/ArchDoc.html” – what? hawaii.edu? Are you kidding me? How the architectural docs for an ASF project ended up there? Has the design came from Hawaii? Or you could not found it where the project belongs – on Apache site?

Here’s the news, my dear doctors from SEI: just try to sit and write the code, learn from others; grok the best gems written by bright practitioners. That’s pretty much what it takes – one doesn’t need nothing like CMMI in order to create great software. I will let myself to make even a stronger assertion: one needs processes in place to make a bunch of ineffective and inexperienced folks to produce something useless that can be later sold to an idiot customer with a lifetime of support fees attached.

Meanwhile, the reality is that today you see the ratio of three software “managers” graduated by US universities for every decent developer who doesn’t need help in the day one to find his own butt with both hands, a GPS navigator, and a flashlight.

The main reason an open source software is thriving today and constantly kicking ass of companies with established processes is because people aren’t afraid to fail nor to experiment on their own dime and time. In other words, they don’t give a shit about CMU teaching them how to write great code – they just learn it in the field and then do what it takes by learning from others. You don’t a formal training for that, clearly. Perhaps, Khan Academy is what really need.

You know that old saying “If you can’t do a job – go to management; if you can’t manage then teach”. I would amend it by “…; if you can’t teach – go to research of software processes”.

Although, I won’t be totally surprised to see some fat-ass book on how to contribute to Hadoop coming out from CMU very soon. And it might even become a best seller on Amazon or something. But I know for sure that by the time OSS community will be far away onto making the next great thing!

And the other day I shall tell the story of that grad student from Berkley who was all set to write the greatest benchmarking “solution” for Hadoop – that deserves a separate post, because the guy was learning from CMMI, I guess.

Am I too acidic today? Must be this damn sunny California weather or something.