Warning [Rant]: YAML is an incredible piece of turd

I spent, hay wasted, an hour of my time today trying to figure out the reason for the following error message from Puppet Hiera:

vmhost05-hbase3: Error: syntax error on line 30, col -1: `’ at /root/bigtop/bigtop-deploy/puppet/manifests/site.pp:17 on node ….
The relevant part of the Hiera site.yaml file is

bigtop::bigtop_yumrepo_uri:  “http://archive.hostname.com/redhat/6/x86_64/7.3.0/”
bigtop::jdk_package_name: ‘jdk-1.7.0_55’

Firstly, as a former compiler developer it hurts every bit of my brain when I see error message like above. Huge “compliment” to the Hiera developers – learn how to write code, dammit.

Secondly, after investigating this literally for an hour I figured out that the separator in uri:  “http was a TAB (ASCII 9) instead of a whitespaces.

Seriously dudes – it’s 21st century. What’s the reason to use formats and parsers that fail so badly on separator terminals? Just imaging if Java or Groovy compiler would be so picky about tabs vs. spaces? I guarantee – the half of the development community would be screaming bloody murder right there. Yet – with frigging YAML POS it is just ok ;(

[/Rant]

Annual review of Bigdata software; what’s in store for 2014

In the couple of days left before the year end I wanted to look back and reflect on what has happened so far in the IT bubble 2.0 commonly referred to as “BigData”. Here are some of my musings.

Let’s start with this simple statement: BigData is misnomer. Most likely it has been put forward by some PR or MBA schmuck with no imagination whatsoever, who thought that terabyte consists of 1000 megabytes 😉 The word has been picked up by pointy-haired bosses all around the world as they need buzzwords to justify their existence to people around. But I digressed…

So what has happened in the last 12 months in this segment of software development? Well, surprisingly you can count real interesting events on one hand. To name a few:

  • Fault tolerance in the distributed systems got to the new level with NonStop Hadoop, introduced by WANdisco earlier this year. The idea of avoiding complex screw-ups by agreeing on the operations up-front is leaving things like Linux HA, Hadoop QJM, and NFS based solutions rolling in the dust in the rear-view mirror.
  • Hadoop HDFS is clearly here to stay: you can see customers shifting from platforms like Teradata towards cheaper and widely supported HDFS network storage; with EMC (VMWare, Greenplum, etc.) offering it as the storage layer under Greenplum’s proprietary PostegSQL cluster and many others.
  • While enjoying a huge head start, HDFS has a strong while not very obvious competitor – CEPH. As some know, there’s a patch that provides CEPH drop-in replacement for HDFS. But where it get real interesting is how systems like Spark (see next paragraph) can work directly on top of CEPH file-system with a relatively small changes in the code. Just picture it:

    distributed Linux file-system high-speed data analytic 

    Drawing conclusions is left as an exercise to the readers.

  • With the recent advent and fast rise of new in memory analytic platform – Apache Spark (incubating) – the traditional, two bit, MapReduce paradigm is loosing the grasp very quickly. The gap is getting wider with new generation of the task and resource schedulers gaining momentum by the day: Mesos, Spark standalone scheduler, Sparrow. The latter is especially interesting with its 5ms scheduling guarantees. That leaves the latest reincarnation of the MR in the predicament.
  • Shark – SQL layer on top of Spark – is winning the day in the BI world, as you can see it gaining more popularity. It seems to have nowhere to go but up, as things like Impala, Tez, ASF Drill are still very far away from being accepted in the data-centers.
  • With all above it is very exciting to see my good friends from AMPlab spinning up a new company that will be focusing on the core platform of Spark, Shark and all things related. All best wishes to Databricks in the coming year!
  • Speaking of BI, it is interesting to see that Bigdata BI and BA companies are still trying to prove their business model and make it self-sustainable. The case in point, Datameer with recent $19M D-round; Platfora’s last year $20M B-round, etc. I reckon we’ll see more fund-raisers in the 107 or perhaps 108 of dollars in the coming year among the application companies and platform ones. Also new letters will be added to the mix: F-rounds, G-rounds, etc. as cheap currency keeps finding its way from the Fed through the financial sector to the pockets of VCs and further down to high-risk sectors like IT and software development. This will lead to over-heated job market in the Silicon Valley and elsewhere followed by a blow-up similar to but bigger than 2000-2001. It will be particularly fascinating to watch big companies scavenging the pieces after the explosion. So duck to avoid shrapnel.
  • Stack integration and validation has became a pain-point for many. And I see the effects of it in shark uptake of the interest and growth of Apache Bigtop community. Which is no surprise, considering that all commercial distributions of Hadoop today are based or directly using Bigtop as the stack producing framework.

While I don’t have a crystal ball (would be handy sometimes) I think a couple of very strong trends are emerging in this segment of the technology:

  • HDFS availability – and software stack availability in general – is a big deal: with more and more companies adding HDFS layer into their storage stack more strict SLAs will emerge. And I am not talking about 5 nines – an equivalent of 5 minutes downtime per year – but rather about 6 and 7 nines. I think Zookeeper based solutions are in for a rough ride.
  • Machine Learning has a huge momentum. Spark summit was a one big evidence of it. With this comes the need to incredibly fast scheduling and hardware utilization. Hence things like Mesos, Spark standalone and Sparrow are going to keep gaining the momentum.
  • Seasonal lemming-like migration to the cloud will continue, I am afraid. The security will become a red-hot issue and an investment opportunity. However, anyone who values their data is unlikely to move to the public cloud, hence – private platforms like OpenStack might be on the rise (if the providers can deal with “design by committee” issues of course).
  • Storage and analytic stack deployment and orchestration will be more pressing than ever (no, I am talking about real orchestration, not cluster management software). That’s why I am looking very closely on that companies like Reactor8 are doing in this space.

So, last year brought a lot of excitement and interesting challenges. 2014, I am sure, will be even more fun. However “living in the interesting times” might a curse and a blessing. Stay safe, my friends!

High Availability is the past; Continuous Availability is the future

Do you know what are SiliconAngle and Wikibon project? If not – check them out soon. These guys have a vision about next generation media coverage. I would call it ‘#1 no-BS Silicon Valley media channel’. These guys are running professional video journalism with a very smart technical setup. And they aren’t your typical loudmouth from the TV: they use and grok technologies they are covering. Say, they run Apache Solr in house for real-time trends processing and searches. Amazing. And they don’t have teleprompters. Nor screenplay writers. How cool is that?

At any rate, I was invited on their show, theCube, last week at the last day of Hadoop Summit. I was talking about High Availability issues in Hadoop. Yup, High Availability has issues, you’ve heard me right. The issue is the lesser than 100% uptime. Basically, even if someone claims to provide 5-9s (that is 99.999% uptime) you still looking at about 6 minutes a year downtime of the mission critical infrastructure.

If you need 100% uptime for you Hadoop, then you should be looking for Continuous Availability. Curiously enough, the solution is found in the past (isn’t that always the case?) in so called Paxos algorithm that has been published by Leslie Lamport back in 1989. However, original Paxos algorithm has some performance issues and generally never been fully embraced by the industry and it is rarely used besides of just a few tech savvy companies. One of them – WANdisco – has applied it first for Subversion replication and now for Hadoop HDFS SPOF problem and made it generally available is the commercial product.

And just think what can be done if the same technology is applied to mission critical analytical platforms such as AMPlab Spark? Anyway, watch the recording of my interview on theCube and learn more.

YDN has posted the video from my Aug’12 talk about Hadoop distros

As the follow up on my last year post I just found the the video of the talk has been posted on YDN website. I apologies for the audio quality – echo and all, but you still should be able to make it out with a higher volume.

And in a bit you should be able to see another talk from May’13 about Hadoop stabilization.

We just invented a new game: "Whack a Hadoop namenode"

I just came back from Strata 2013 BigData conference. A pretty interesting event, considering that Hadoop wars are apparently over. It doesn’t mean that the battlefield is calm. On the contrary!

But this year’s war banner is different. Now it seems to be about Hadoop stack distributions. If I only had an artistic talent, the famous

would be saying something like “Check out how big is my Hadoop distro!”

But judge for yourself: WANdisco announced their WDD about 4 weeks ago, followed yesterday by Intel and Greenplum press releases. WDD has some uniquely cool stuff in it like non-stop namenode, which is the only ‘active-active’ technology for Namenode metadata replication on the market based on full implementation of Paxos algorithm,

And I was having fun during the conference too: we were playing the game ‘whack-a-namenode’. The setup includes a rack of blade Supermicro servers, running WDD cluster with three active namenodes.
While running stock TeraSort load, one of the namenode is killed dead with SIGKILL. Amazingly, TeraSort can’t care less and just keep going without a wince. We played about a 100 rounds of this “game” over the course of two days using live product, with people dropping by all the time to watch.

Looks like it isn’t easy to whack an HDFS cluster anymore.

And nice folks from SiliconAngle and WikiBon stopped at our booth to do the interview with me and my colleagues. Enjoy 😉

Multi-nodes Hadoop cluster on a single host

If you running Hadoop for experimental or else purposes you might face a need to quickly spawn a ‘poor man hadoop’: a cluster with multiple nodes within the same physical or virtual box. A typical use case would look like working on your laptop without access to the company’s data center; another one is running low on the credit card, so you can’t pay for some EC2 instances.

Stop right here, if you are well-versed in Hadoop development environment, tar balls, maven and all that shenanigans. Otherwise, keep on reading…

I will be describing Hadoop cluster installation using standard Unix packaging like .deb or .rpm, produced by the great stack Hadoop platform called Bigtop. If aren’t familiar with Bigtop yet – read about its history and conceptual ideas.

Let’s assume you installed Bigtop 0.5.0 release (or a part of it). Or you might go ahead – shameless plug warning – and use a free off-spring of the Bigtop just introduced by WANdisco. Either way you’ll end up having the following structure:

/etc/hadoop/conf
/etc/init.d/hadoop*
/usr/lib/hadoop
/usr/lib/hadoop-hdfs
/usr/lib/hadoop-yarn

your mileage might vary if you install more components besides Hadoop. Normal bootstrap process will start a Namenode, Datanode, perhaps SecondaryNamenode, and some YARN jazz like resource manager, node manager, etc. My example will cover only HDFS specifics, because YARN’s namenode would be a copy-cat and I leave it as exercise to the readers.

Now, the trick is to add more Datanodes. With a dev. setup using tarballs and such you would just clone and change some configuration parameters, and then run a bunch of java processes like:
  hadoop-daemon.sh --config start datanode

This won’t work in the case of packaged installation, because of higher level of complexity involved. This is what needs to be done:

  1. Clone the config directory cp -r /etc/hadoop/conf /etc/hadoop/conf.dn2
  2. In the cloned copy of hdfs-site.xml, change or add new values for:
  3. dfs.datanode.data.dir
    dfs.datanode.address
    dfs.datanode.http.address
    dfs.datanode.ipc.address

    (An easy way to mod the port numbers is to add 1000*)to the default value. So, port 50020 will become 52020, etc.

  4. Go to /etc/init.d and clone hadoop-hdfs-datanode
  5. In the clone init script add the following
  6.   export HADOOP_PID_DIR="/var/run/hadoop-hdfs.dn2"

    and modify

      CONF_DIR="/etc/hadoop/conf.dn2"
    PIDFILE="/var/run/hadoop-hdfs.dn2/hadoop-hdfs-datanode.pid"
    LOCKFILE="$LOCKDIR/hadoop-datanode.dn2"
  7. Create dfs.datanode.data.dir and make hdfs:hdfs to be the owner of
  8. run /etc/init.d/hadoop-hdfs-datanode.dn2 start to fire up the second namenode
  9. Repeat steps 1 through 6 if you need more nodes running.
  10. If you need to do this on a regular basis – spare yourself a carpal tunnel and learn Puppet.

Check the logs/HDFS UI/running java processes to make sure that you have achieved what you needed. Don’t try to do it unless you box has sufficient amount of memory and CPU power. Enjoy!

HortonWorks is using BigTop: no more secrets!

As my former colleague John Kreisa nicely put in the HortonWorks 1.0 release announcement here (my warmest regards and best wishes to you guys!):

Those who have followed Hortonworks since our initial launch already know that we are absolutely committed to open source and the Apache Software Foundation. You will be glad to know that our commitment remains the same today. We don’t hold anything back. No proprietary code is being developed at Hortonworks.

And indeed. I have asked this questions about HortonWorks using BigTop to power up their platform offering some time ago and later pretty much repeated it in the form of comment to Shaun Connolly blog. To his credit, my question has been answered directly:

As far as BigTop goes, we at Hortonworks are using parts of BigTop for the HDP platform builds, so thanks for the efforts there!

I have meet the gentleman in person at the recent Hadoop Summit and we have a short yet nice chat about enterprise stacks and the role an open-source technology plays there.

So, it is time to put my initial question to rest as the fully answered one.

P.S. On a separate note: I have left a slightly different comment on Cloudera’ blog. Somehow, the comment doesn’t appear to be visible (at least I don’t see anything but “2 comments” line) nor had it been answered publicly (again, perhaps, it has been but I don’t see in on the page). In the Cloudera’s defense I have to say that I got an answering email from one of their execs, which I can’t publish for it was a private message.

Conception and validation of Hadoop BigData stack: putting the record straight.

With more and more people jumping on bandwagon of big data it is very settling to see that Hadoop is gaining momentum by a day.

Even most fascinating is too see how the idea of putting together a bunch of service components on top of Hadoop proper is getting more and more momentum. IT and software development professionals are getting better understanding about benefits that a flexible set of loosely coupled yet compatible components provides when one needs to customize data processing solution at scale.

The biggest problem for most businesses trying to add Hadoop infrastructure into their existing IT is a lack of knowledge, professional support, and/or clear understanding of what’s out there on the market to help you. Essentially, Hadoop exists in one incarnation – this is the open-source project under the umbrella of Apache Software Foundation (ASF). This is where all the innovations in Hadoop are coming from. And essentially this is a source of profit for a few commercial offerings today.

What’s wrong with the picture, you might ask? Well, the main issue with most of these “commercial offerings” are mostly two folds. They are either immature and based on an sometimes unfinished nor unreleased Hadoop code, or provide no significant value add compare to Hadoop proper available in source form from hadoop.apache.org. And no matter if any of above (or both of them together) apply to a commercial solution based on Hadoop, you can be sure of one thing: these solutions will cost you literally tons of money – as much as  $1k/node/year in some cases – for what is essentially available for free.

“What about neat packages I can get from a commercial provider and perhaps some training too?” one might ask. Well, yeah if you are willing to pay top bucks per node for say like this  to get fixed or learn how to install packages on a virtual machine – go ahead by all means.

However, keep in mind that you always can get a set of packages for Hadoop produced by another open source project called Bigtop, hosted by Apache. What essentially you get are packages for your Linux distro, which can be easily installed on your cluster’s nodes. A great benefit is that you can easily trim your Hadoop stack to only include what you need: Hadoop + Hive, or perhaps Hadoop + HBase (which will automatically pick up Zookeper for you).

At any rate, the best part of the story isn’t a set of packages that can be installed: after all this is what packages are usually being created for, right? The problem with the packages or other forms of component distribution is that you don’t know in advance if A-package will nicely work with B-package v.1.2 unless some has tested this assumption before. Even then, testing environment  might be significantly different from your production environment and then all bets are off. Unless – again – you’re willing to pay through your nose to someone who’s willing to get it for you. And that’s where true miracle of something like BigTop is coming for a rescue.

Before I’ll explain more, I wanna step back a bit and take a look at some recent history. A couple of years ago Yahoo’s Hadoop development team had to address an issue of putting together working and well-validated Hadoop stack including a number of components developed by different engineering organizations with their own development schedule and integration criteria. The main integration point of all of the pieces was the operations team which was in charge of big number of cluster deployments, provisioning and support. Without their own QA staff they were oftentimes at mercy of assumed code or configuration quality coming from all the corners of the company. Yet worst, even with a chance of the high quality of all these components there were no guarantees that they will work together as expected once put together on the cluster. And indeed, integration problems were many.

That’s were a small team of engineers including yours truly put together a prototype of a system called FIT (Final Integration Testing). The system essentially allowed you to pick up a packaged component you want to validate against your cluster environment and perform the deployment, configuration, and testing with integration scenarios provided by either component’s owner or your own team.

The approach was so effective that the project was continued and funded further in the form of HIT (Hadoop Integration Testing). At which point two of us have left for what seemed like a greener pasture back then 😦

We thought the idea was real promising so we have continued on the path of developing a less custom and more adoptable technology based on open standards such as Maven and Groovy. Here you can find slides from the talk we gave at eBay about a year ago. The presentation is putting the concept of Hadoop data stack in open writing for the time, as well as stacks customization and validation technology. When this presentation were given we already had well working mechanism of creating, deploying, and validating both packaged and non-packaged Hadoop components.

BigTop – open-sourced for the second time just a few months and based on our project above – has added up a packaging creation layer on top of the stack validation product. This, of course, makes your life even easier. And even more so with a number of Puppet recipes allowing you to deploy and configure your cluster in highly efficient and automatic manner. I encourage you to check it out.

BigTop has been successfully used for validating release of Apache Hadoop 0.20.205 which has become a foundation of coming Hadoop 1.0.0 Another release of Hadoop – 0.22 – was using BigTop for release candidates validation and so on.