On coming fragmentation of Hadoop platform

I just read this interview with the CEO of HortonWorks in which he expresses a fear about Hadoop fragmentation. He calls attention to the valid issue in the Hadoop ecosystem where forking is getting to the point that product space is likely to get fragmented.

So why should the BigTop community bother? Well, for one, Hadoop is the core upstream component of the BigTop stack. By filling this unique position, it has a profound effect on downstream consumers such as HBase, Oozie, etc. Although projects like Hive and Pig can partially avoid potential harm by statically linking with Hadoop binaries, this isn’t a solution for any sane integration approach. As a side note: I am especially thrilled by Hive’s way of working around multiple incompatibilities in the MR job submission protocol. The protocol has been naturally evolving for quite some time, and no one could even have guaranteed compatibility in versions like 0.19 or 0.20. Anyway, Hive solved the problem by simply generating a job jar, constructing a launch string and then – you got it already, right? – System.exec()’ing the whole thing. On a separate JVM, that is! Don’t believe me? Go check the source code yourself.

Anecdotal evidence aside, there’s a real threat of fracturing the platform. And there’s no good reason for doing so even if you’re incredibly selfish, or stupid, or want to monopolize the market. Which, by the way, doesn’t work for objective reasons even with so-called “IP protection” laws in place. But that’s a topic for another day.

So, what’s HortonWorks’ answer to the problem? Here it comes:

Amid current Hadoop developments—is there any company NOT launching a distribution with some value added software?—Hortonworks stands out. Why? Hortonworks turns over its entire distribution to the Apache open source project.

While it is absolutely necessary for any human endeavor to be collaborative in order to succeed, the open source niche might be a tricky one. There are literally no incentives for all players to play by the book, and there’s always that one very bold guy who might say, “Screw you guys, I’m going home,” because he is just… you know…

Where could these incentives come from? How can we be sure that every new release is satisfactory for everyone’s consumption? How do we guarantee that HBase’s St.Ack and friends won’t be spending their next weekend trying to fix HBase when it loses its marbles because of that tricky change in Hadoop’s behavior?

And here comes a hint of an answer:

We’re building directly in the core trunk, productizing the package, doing QA and releasing.

I have a couple of issues with this statement. But first, a spoiler alert: I am not going to attack neither HortonWorks nor their CEO. I don’t have a chip on my shoulder — not even an ARM one. I am trying to demonstrate the fallacy in the logic and show what doesn’t work and why. And now here’s the laundry list:

  • building directly in the core trunk“: Hadoop isn’t released from the trunk. This is a headache. And this is one of the issues that the BigTop community faced during the most recent stabilization exercise for the Hadoop 2.0.4-alpha release. Why’s that a problem? Well, for one, there’s a policy that “everything should go through the trunk”. It means — in context of Hadoop’s current state — that you have to first commit to the trunk, then back-port to branch-2, which is supposed to be the landing ground for all Hadoop 2.x releases, just like branch-1 is the landing ground for all Hadoop 1.x releases. If it so happens that there’s an active release(s) happening at the moment, one would need to back-port the commit to another release branch(es), such as 2.0.4-alpha in this particular example. Mutatis mutandis, some of the changes are reaching only about 2/3 of the way down. Best-case scenario. This approach also gives fertile ground to all “proponents” of open-source Hadoop because once their patches are committed to the trunk, they are as open-source as the next guy. They might get released in a couple of years, but hey — what’s a few months between friends, right?
  • productizing the package“: is Mr. Bearden aware of when development artifacts for an ongoing Hadoop release were last published in the open? ‘Cause I don’t know of a publication of any such thing to date. Neither does Google, by the way. Even the official source tarballs weren’t available until, like, 3 weeks ago. Why does that constitute a problem? How do you expect to perform any reasonable integration validation if you don’t have an official snapshot of the platform? Once your platform package is “productized”, it is a day late to pull your hair out. If you happen to find some issues — come back later. At the next release, perhaps?
  • doing QA and releasing“: we are trying to build an open-source community here, right? Meaning that the code, the tests and their results, the bug reports, the discussions should be in the open. The only place where the Hadoop ecosystem is being tested at any reasonable length and depth is BigTop. Read here for yourself  And feel free to check the regular builds and test runs for _all_ the components that BigTop releases for both secured and non-secured configurations. What are you testing with and how, Mr. Bearden?

So, what was the solution? Did I miss it in the article? I don’t think so. Because a single player — even one as respected as HortonWorks — can’t solve the issue in question without ensuring that anything produced by the Hadoop project’s developers is always in line with the expectations of downstream players.

That’s how you prevent fracturing: by putting in the open a solid and well-integrated reference implementation of the stack – one that can be installed by anyone using open-standard packaging and loaded with third-party applications without tweaking them every time you go from Cloudera’s cluster to MapR’s. Or another pair of vendors’. Does it sound like I am against making money in open-source software? Not at all: most people in the OSS community do this on the dime of their employers or as part of their own business.

You can consider BigTop’s role in the Hadoop centric environment to be similar to that of Debian in the Linux kernel/distribution ecosystem. By helping to close the gap between the applications and the fast-moving core of the stack, BigTop essentially brings reassurance of the Hadoop 2.x line’s stability into the user space and community. BigTop helps to make sure that vendor products are compatible with each other and with the rest of the world; to avoid vendor lock-in and to guarantee that recent Microsoft stories will not be replayed all over again.
Are there means to achieve the goal of keeping the core contained? Certainly! BigTop does just that. Recent announcements from Intel, Pivotal, WANdisco are living proof of it: they all using BigTop as the integration framework and consolidation point. Can these vendors deviate even under such a top-level integration system? Sure. But this will be immensely harder to do.