I spent, hay wasted, an hour of my time today trying to figure out the reason for the following error message from Puppet Hiera:
vmhost05-hbase3: Error: syntax error on line 30, col -1: `’ at /root/bigtop/bigtop-deploy/puppet/manifests/site.pp:17 on node ….
The relevant part of the Hiera site.yaml file is
Firstly, as a former compiler developer it hurts every bit of my brain when I see error message like above. Huge “compliment” to the Hiera developers – learn how to write code, dammit.
Secondly, after investigating this literally for an hour I figured out that the separator in uri: “http was a TAB (ASCII 9) instead of a whitespaces.
Seriously dudes – it’s 21st century. What’s the reason to use formats and parsers that fail so badly on separator terminals? Just imaging if Java or Groovy compiler would be so picky about tabs vs. spaces? I guarantee – the half of the development community would be screaming bloody murder right there. Yet – with frigging YAML POS it is just ok ;(
I have decided to simplify the elephant genealogy tree by separating pre-Hadoop 2.x part out of it. The new supported version will only be reflecting Hadoop 2.x. The last updated full version of the diagram is available for anyone from my github workspace under the tag WDD4
In the couple of days left before the year end I wanted to look back and reflect on what has happened so far in the IT bubble 2.0 commonly referred to as “BigData”. Here are some of my musings.
Let’s start with this simple statement: BigData is misnomer. Most likely it has been put forward by some PR or MBA schmuck with no imagination whatsoever, who thought that terabyte consists of 1000 megabytes 😉 The word has been picked up by pointy-haired bosses all around the world as they need buzzwords to justify their existence to people around. But I digressed…
So what has happened in the last 12 months in this segment of software development? Well, surprisingly you can count real interesting events on one hand. To name a few:
While I don’t have a crystal ball (would be handy sometimes) I think a couple of very strong trends are emerging in this segment of the technology:
- HDFS availability – and software stack availability in general – is a big deal: with more and more companies adding HDFS layer into their storage stack more strict SLAs will emerge. And I am not talking about 5 nines – an equivalent of 5 minutes downtime per year – but rather about 6 and 7 nines. I think Zookeeper based solutions are in for a rough ride.
- Machine Learning has a huge momentum. Spark summit was a one big evidence of it. With this comes the need to incredibly fast scheduling and hardware utilization. Hence things like Mesos, Spark standalone and Sparrow are going to keep gaining the momentum.
- Seasonal lemming-like migration to the cloud will continue, I am afraid. The security will become a red-hot issue and an investment opportunity. However, anyone who values their data is unlikely to move to the public cloud, hence – private platforms like OpenStack might be on the rise (if the providers can deal with “design by committee” issues of course).
- Storage and analytic stack deployment and orchestration will be more pressing than ever (no, I am talking about real orchestration, not cluster management software). That’s why I am looking very closely on that companies like Reactor8 are doing in this space.
So, last year brought a lot of excitement and interesting challenges. 2014, I am sure, will be even more fun. However “living in the interesting times” might a curse and a blessing. Stay safe, my friends!
Do you know what are SiliconAngle and Wikibon project? If not – check them out soon. These guys have a vision about next generation media coverage. I would call it ‘#1 no-BS Silicon Valley media channel’. These guys are running professional video journalism with a very smart technical setup. And they aren’t your typical loudmouth from the TV: they use and grok technologies they are covering. Say, they run Apache Solr in house for real-time trends processing and searches. Amazing. And they don’t have teleprompters. Nor screenplay writers. How cool is that?
At any rate, I was invited on their show, theCube, last week at the last day of Hadoop Summit. I was talking about High Availability issues in Hadoop. Yup, High Availability has issues, you’ve heard me right. The issue is the lesser than 100% uptime. Basically, even if someone claims to provide 5-9s (that is 99.999% uptime) you still looking at about 6 minutes a year downtime of the mission critical infrastructure.
If you need 100% uptime for you Hadoop, then you should be looking for Continuous Availability. Curiously enough, the solution is found in the past (isn’t that always the case?) in so called Paxos algorithm that has been published by Leslie Lamport back in 1989. However, original Paxos algorithm has some performance issues and generally never been fully embraced by the industry and it is rarely used besides of just a few tech savvy companies. One of them – WANdisco – has applied it first for Subversion replication and now for Hadoop HDFS SPOF problem and made it generally available is the commercial product.
And just think what can be done if the same technology is applied to mission critical analytical platforms such as AMPlab Spark? Anyway, watch the recording of my interview on theCube and learn more.
As the follow up on my last year post I just found the the video of the talk has been posted on YDN website. I apologies for the audio quality – echo and all, but you still should be able to make it out with a higher volume.
And in a bit you should be able to see another talk from May’13 about Hadoop stabilization.
I have just posted this article on ASF blog roller elaborating on why BigTop is becoming a center piece of integration focused on Hadoop-based data analytically stack. Enjoy.
I just came back from Strata 2013 BigData conference. A pretty interesting event, considering that Hadoop wars are apparently over. It doesn’t mean that the battlefield is calm. On the contrary!
But this year’s war banner is different. Now it seems to be about Hadoop stack distributions. If I only had an artistic talent, the famous
would be saying something like “Check out how big is my Hadoop distro!”
But judge for yourself: WANdisco announced their WDD about 4 weeks ago, followed yesterday by Intel and Greenplum press releases. WDD has some uniquely cool stuff in it like non-stop namenode, which is the only ‘active-active’ technology for Namenode metadata replication on the market based on full implementation of Paxos algorithm,
And I was having fun during the conference too: we were playing the game ‘whack-a-namenode’. The setup includes a rack of blade Supermicro servers, running WDD cluster with three active namenodes.
While running stock TeraSort load, one of the namenode is killed dead with SIGKILL. Amazingly, TeraSort can’t care less and just keep going without a wince. We played about a 100 rounds of this “game” over the course of two days using live product, with people dropping by all the time to watch.
Looks like it isn’t easy to whack an HDFS cluster anymore.
And nice folks from SiliconAngle and WikiBon stopped at our booth to do the interview with me and my colleagues. Enjoy 😉