Apache Ignite vs Alluxio (former Tachyon)

Well, ever since the company behind the read-only open-source project called Tachyon has decided to change the name of the project, I was puzzled. If you build something successful, you want the name of it to be recognized, right? In marketing, it is called “brand recognition”.

Why would Coca-Cola rename their product into SludgeWaters? Indeed, it doesn’t make much sense! The most infamous brand-recognition screw-up was when SUNW (Sun Microsystems) got renamed to JAVA on the NASDAQ. And _that_ ended well, for sure. The brilliant idea belonged to the Silicon Valley class-clown with the pony-tail. I am sure you know, whom I refer to.

At any rate, why an allegedly successful software project would change its name in a middle of the rise? I have a hypothesis, that it has been caused by the fact that any time one searches for Tachyon on Google (or elsewhere), the first link popping-up would be to my blog from last year and the close second would point to the story how Tachyon BDFL has decided to remove my benign answer from their public mail list.

So, in the interest of the history preservation, I am putting up the new one, but correcting the name to reflect new reality of Alluxio project. The technical findings stand the same, so just go and read the year old blog to figure where the old application with the new name is falling short.

The last but not least, since the time of the original write-up, Apache Ignite has graduated to Apache TLP project, that’s why the “(incubating)” suffix is dropped as well 😉


Let’s speed up Apache Hive with Apache Ignite & Apache Bigtop

Today we will be looking into how we can speed Hive using Apache Ignite. For this particular exercise I will be using Apache Bigtop stack v1.0 because I don’t care wasting my time with manual cluster setting; nor I do want to use any of the overly complex stuff like Cloudera Manager or Ambari. I am a Unix command-line guy, and CLI leaves all these fancy yet semi-backed contraptions biting the dust. Let’s start.

For the simplicity I’d suggest to use docker. If you don’t know how to use docker you can do the same on your own system and clean the mess later. Or better yet – learn how to use docker (if you’re on Mac – you’re on your own!). Despite all the hype around it, it is still a useful tool in some cases. I’ll be using one from an official Bigtop Ubuntu-14.04 image:

% sudo docker run -t -i -h ‘bigtop1.docker’ bigtop/slaves:ubuntu-14.04 /bin/bash
% git clone https://git-wip-us.apache.org/repos/asf/bigtop.git
% cd bigtop
Now you can follow bigtop-deploy/puppet/README.md on how to deploy your cluster. Make sure you have selected hadoop, yarn, ignite-hadoop, and hive while editing /etc/puppet/hieradata/site.yaml (as specified in the README.md). Once puppet apply command is finished you should have a nice single node cluster, running HDFS, YARN, and ignite-hadoop w/ IGFS. Hive should be configured and ready to run. Let’s do a couple more steps to get the data in place and ready for the experiments:

;; http://hortonworks.com/hadoop-tutorial/how-to-process-data-with-apache-hive/
;; wget http://seanlahman.com/files/database/lahman591-csv.zip
;; unzip -x lahman591-csv.zip
;; First set the data
;;   hadoop fs -copyFromLocal Batting.csv /user/hive
;;   hadoop fs -copyFromLocal Master.csv /user/hive
and now to Hive. Make sure it is executed with proper configuration to take advantage of in-memory data fabric provided by Apache Ignite. Let’s start Hive CLI to work with Ignite cluster, set the tables and run some queries:

% HADOOP_CONF_DIR=/etc/hadoop/ignite.client.conf hive cli
create table temp_batting (col_value STRING);
create table batting (player_id STRING, year INT, runs INT)
LOAD DATA INPATH ‘/user/hive/Batting.csv’ OVERWRITE INTO TABLE temp_batting;
insert overwrite table batting
  regexp_extract(col_value, ‘^(?:([^,]*)\,?){1}’, 1) player_id,
  regexp_extract(col_value, ‘^(?:([^,]*)\,?){2}’, 1) year,
  regexp_extract(col_value, ‘^(?:([^,]*)\,?){9}’, 1) run
from temp_batting;
SELECT COUNT(*) FROM batting WHERE year > 1909 AND year <= 1969;
;; let’s do something more real
SELECT a.year, a.player_id, a.runs from batting a
    JOIN (SELECT year, max(runs) runs FROM batting GROUP BY year ) b
    ON (a.year = b.year AND a.runs = b.runs) ;

Notice the times of both queries.
Quit the hive session and restart it with standard config to run on top of YARN:

% hive cli

;; All the tables are still in place, so let’s just repeat the queries:
SELECT COUNT(*) FROM batting WHERE year > 1909 AND year <= 1969;
SELECT a.year, a.player_id, a.runs from batting a
    JOIN (SELECT year, max(runs) runs FROM batting GROUP BY year ) b
    ON (a.year = b.year AND a.runs = b.runs) ;

Once again: notice the execution times and appreciate the difference! Enjoy!

30+ time faster Hadoop MapReduce application with Bigtop and Ingite

Did you ever wonder how you can deploy Hadoop stack quickly? Or what can be done to speed up that slow MapReduce job? Look no further – with Apache Bigtop you can get a Hadoop cluster stack deployed in a matter of a few minutes with no hassle and no sweat.  And how to run your old MapReduce applications very fast? Apache Ignite (incubating) gives you that option out of the box with its Hadoop Accelerator

The stack being deployed in the following demo is from Apache Bigtop 1.0 RC (Hadoop 2.6, Ignite 1.0, etc.) Enjoy

Apache Ignite vs Apache Spark

Complimentary to my earlier post on Apache Ignite in-memory file-system and caching capabilities I would like to cover the main differentiation points of the Ignite and Spark. I see questions like this coming up repeatedly. It is easier to have them answered, so you don’t need to fish around the Net for the answers.

 – The main different is, of course, that Ignite is an in-memory computing system, e.g. the one that treats RAM as the primary storage facility. Whereas others – Spark included – only use RAM for processing. The former, memory-first approach, is faster because the system can do better indexing, reduce the fetch time, avoid (de)serializations, etc.

 – Ignite’s mapreduce is fully compatible with Hadoop MR APIs which let everyone to simply reuse existing legacy MR code, yet run it with >30x performance improvement. Check this short video demoing an Apache Bigtop in-memory stack, speeding up a legacy MapReduce code

 – Also, unlike Spark’s the streaming in Ignite isn’t quantified by the size of RDD. In other words, you don’t need to form an RDD first before processing it; you can actually do the real streaming. Which means there’s no delays in a stream content processing in case of Ignite

 – Spill-overs are a common issue for in-memory computing systems: after all memory is limited. In Spark where RDDs are immutable, if an RDD got created with its size > 1/2 node’s RAM then a transformation and generation of the consequent RDD’ will likely to fill all the node’s memory. Which will cause the spill-over. Unless the new RDD is created on a different node. Tachyon was essentially an attempt to address it, using old RAMdrive tech. with all its limitations.
Ignite doesn’t have this issue with data spill-overs as its caches can be updated in atomic or transactional manner. However, spill-overs are still possible: the strategies to deal with it are explained here

 – as one of its components Ignite provides the first-class citizen file-system caching layer. Note, I have already addressed the differences between that and Ignite, but for some reason my post got deleted from their user list. I wonder why? 😉

 – Ignite’s uses off-heap memory to avoid GC pauses, etc. and does it highly efficiently

 – Ignite guarantees strong consistency

 – Ignite supports full SQL99 as one of the ways to process the data w/ full support for ACID transactions

– Ignite supports in-memory SQL indexes functionality, which lets to avoid full-scans of data sets, directly leading to very significant performance improvements (also see the first paragraph)

 – with Ignite a Java programmer shouldn’t learn new ropes of Scala. The programming model also encourages the use of Groovy. And I will withhold my professional opinion about the latter in order to keep this post focused and civilized 😉

I can keep on rumbling for a long time, but you might consider reading this and that, where Nikita Ivanov – one of the founders of this project – has a good reflection on other key differences. Also, if you like what you read – consider joining Apache Ignite (incubating) community and start contributing!

Apache Ignite (incubating) vs Tachyon

The post has been updated to use WaybackMachine instead of the Twitter, as I’ve closed my Twitter account.

After the discovery that my explanation of the differences between Apache Ignite (incubating) and Tachyon caching project, I found out that my attempt to clarify the situation was purged as well.
About the same time I got a private email from tachyon-user google group explaining to me that my message “was deleted because it was a marketing message”.

So, looks like any messages even slightly critical to the Tachyon project will be deleted as ‘marketing msgs’ in true FOSS spirit! Looks like the community building got off the wrong foot on that one. So, I have decided to post the original message that of course was sent back via email the moment it got posted in the original thread.

Judge for yourself:

Date: Fri, Apr 10, 2015 at 11:46 PM
Subject: Re: Apche Ignite vs Tachyon
To: tachyon-users@googlegroups.com

You’re just partially correct, actually.

Apache Ignite (incubating) is a fully developed In-Memory Computing (IMC) platform (aka data fabric). “Supporting for Hadoop ecosystem” is one of the components of the fabric. And it has two parts:
– file system caching: fully transparent cache that gives a significant performance boost to HDFS IO. In a way it’s similar to what Tachyon tries to achieve. Unlike Tachyon, the cached data is an integral part of bigger data fabric that can be used by any Ignite services.
– MR accelerator that allows to run “classic” MR jobs on Ignite in-memory engine. Basically, Ignite MR (much list its SQL and other computation components) is just a way to work with data stored in the cluster memory. Shall I mention that Ignite MR is about 30 times – that’s 3000% – faster than Hadoop MR? No code changes is need, BTW 😉

When you say about “Tachyon… support big data stack natively.” you should keep in mind that Ignite Hadoop acceleration is very native as well: you can run MR, Hive, HBase, Spark, etc. on top of the IgniteFS without changing anything.

And here’s the catch BTW: file system caching in Ignite is a part of its ‘data fabric’ paradigm like the services, advanced clustering, distributed messaging, ACID real-time transactions, etc. Adding HDFS and MR acceleration layer was pretty straight-forward as it was build on the advanced Ignite core, which has been in the real-world production for 5+ years. However. it is very hard to achieve the same level of enterprise computing when you start from an in-memory file system like Tachyon. Not bashing anything – just saying.

I would encourage you to check ignite.incubator.apache.org: read the docs, try version 1.0 from https://dist.apache.org/repos/dist/release/incubator/ignite/1.0.0/ (setup is a breeze) and join our Apache community. If you are interested in using Ignite with Hadoop – Apache Bigtop offers this integration, including seamless cluster deployment which let you get started with fully functional cluster in a few minutes.

In the full disclosure: I am an Apache Incubator mentor for the Ignite project.

With best regards,
Konstantin Boudnik

On Thursday, April 9, 2015 at 7:39:00 PM UTC-7, Pengfei Xuan wrote:
> To my understanding, Apache Ignite (GridGain) grows up from traditional

Schema-on-read or schema-on-write?

I was recently asked if schema-on-read is superior to schema-on-write and how it relates to “traditional” storage systems like EMC, Netapp, Teradata seemingly loosing the ground to commodity-based storage systems. Here are some semi-random thoughts

First of all, I think schema-on-read/schema-on-write is a fancy way to say if the data was stored in non-structured or structured way. It all boils down to where there’s a need to store unaltered data or not. If statistics teaches us anything at all, it would be that by  creatively selecting a subset of data, or making changes in a data sample or in a model itself you can prove the correlation between anything imaginable 😉 Hence, there are clear benefits of keeping data ‘as-is’: without any pre-processing, cleaning, dedup’ing, and so on. It will allow you to run different models or apply alternative approximations.

It might appear that schema-on-read approach might be always superior, but there is plenty of cases where it isn’t so. All sorts of scientific, engineering, financial, medical, accounting systems would still enjoy the benefits of data structuring for years to come. And of course, there are good cases for the opposite, non-structured storage way: marketing, social studies, economical modeling (which is an utter nonsense, of course, but people still believe in it for some reason), and so on.

I don’t think that “schema-on-write” technology is inherently so much more expensive. In the overall order of things one might safe some on the pre-processing stage by using commodity hardware and open software, but will have to pay more in direct and indirect costs related to more expensive and slower BA & BI solutions,

For all I know, we might be witnessing an end game for EMC/Netapp & co., but not because of the way they pre-process the data before storing it. Their very challenge is in the huge change of software development landscape, that has happened over the last 20+ years with Gnu, Linux, ASF and other free and open software models. No doubt, these companies have well-developed sales channels and established brands, but it is almost impossible to out-sale something that anyone can download from the net for the cost of the bandwidth, and get up and running in a matter of hours or even faster. And there’s a whole spectrum of such open systems, so you don’t have to lock yourself up to either of them. Now, go and compete with that!

Annual review of Bigdata software; what’s in store for 2014

In the couple of days left before the year end I wanted to look back and reflect on what has happened so far in the IT bubble 2.0 commonly referred to as “BigData”. Here are some of my musings.

Let’s start with this simple statement: BigData is misnomer. Most likely it has been put forward by some PR or MBA schmuck with no imagination whatsoever, who thought that terabyte consists of 1000 megabytes 😉 The word has been picked up by pointy-haired bosses all around the world as they need buzzwords to justify their existence to people around. But I digressed…

So what has happened in the last 12 months in this segment of software development? Well, surprisingly you can count real interesting events on one hand. To name a few:

  • Fault tolerance in the distributed systems got to the new level with NonStop Hadoop, introduced by WANdisco earlier this year. The idea of avoiding complex screw-ups by agreeing on the operations up-front is leaving things like Linux HA, Hadoop QJM, and NFS based solutions rolling in the dust in the rear-view mirror.
  • Hadoop HDFS is clearly here to stay: you can see customers shifting from platforms like Teradata towards cheaper and widely supported HDFS network storage; with EMC (VMWare, Greenplum, etc.) offering it as the storage layer under Greenplum’s proprietary PostegSQL cluster and many others.
  • While enjoying a huge head start, HDFS has a strong while not very obvious competitor – CEPH. As some know, there’s a patch that provides CEPH drop-in replacement for HDFS. But where it get real interesting is how systems like Spark (see next paragraph) can work directly on top of CEPH file-system with a relatively small changes in the code. Just picture it:

    distributed Linux file-system high-speed data analytic 

    Drawing conclusions is left as an exercise to the readers.

  • With the recent advent and fast rise of new in memory analytic platform – Apache Spark (incubating) – the traditional, two bit, MapReduce paradigm is loosing the grasp very quickly. The gap is getting wider with new generation of the task and resource schedulers gaining momentum by the day: Mesos, Spark standalone scheduler, Sparrow. The latter is especially interesting with its 5ms scheduling guarantees. That leaves the latest reincarnation of the MR in the predicament.
  • Shark – SQL layer on top of Spark – is winning the day in the BI world, as you can see it gaining more popularity. It seems to have nowhere to go but up, as things like Impala, Tez, ASF Drill are still very far away from being accepted in the data-centers.
  • With all above it is very exciting to see my good friends from AMPlab spinning up a new company that will be focusing on the core platform of Spark, Shark and all things related. All best wishes to Databricks in the coming year!
  • Speaking of BI, it is interesting to see that Bigdata BI and BA companies are still trying to prove their business model and make it self-sustainable. The case in point, Datameer with recent $19M D-round; Platfora’s last year $20M B-round, etc. I reckon we’ll see more fund-raisers in the 107 or perhaps 108 of dollars in the coming year among the application companies and platform ones. Also new letters will be added to the mix: F-rounds, G-rounds, etc. as cheap currency keeps finding its way from the Fed through the financial sector to the pockets of VCs and further down to high-risk sectors like IT and software development. This will lead to over-heated job market in the Silicon Valley and elsewhere followed by a blow-up similar to but bigger than 2000-2001. It will be particularly fascinating to watch big companies scavenging the pieces after the explosion. So duck to avoid shrapnel.
  • Stack integration and validation has became a pain-point for many. And I see the effects of it in shark uptake of the interest and growth of Apache Bigtop community. Which is no surprise, considering that all commercial distributions of Hadoop today are based or directly using Bigtop as the stack producing framework.

While I don’t have a crystal ball (would be handy sometimes) I think a couple of very strong trends are emerging in this segment of the technology:

  • HDFS availability – and software stack availability in general – is a big deal: with more and more companies adding HDFS layer into their storage stack more strict SLAs will emerge. And I am not talking about 5 nines – an equivalent of 5 minutes downtime per year – but rather about 6 and 7 nines. I think Zookeeper based solutions are in for a rough ride.
  • Machine Learning has a huge momentum. Spark summit was a one big evidence of it. With this comes the need to incredibly fast scheduling and hardware utilization. Hence things like Mesos, Spark standalone and Sparrow are going to keep gaining the momentum.
  • Seasonal lemming-like migration to the cloud will continue, I am afraid. The security will become a red-hot issue and an investment opportunity. However, anyone who values their data is unlikely to move to the public cloud, hence – private platforms like OpenStack might be on the rise (if the providers can deal with “design by committee” issues of course).
  • Storage and analytic stack deployment and orchestration will be more pressing than ever (no, I am talking about real orchestration, not cluster management software). That’s why I am looking very closely on that companies like Reactor8 are doing in this space.

So, last year brought a lot of excitement and interesting challenges. 2014, I am sure, will be even more fun. However “living in the interesting times” might a curse and a blessing. Stay safe, my friends!

High Availability is the past; Continuous Availability is the future

Do you know what are SiliconAngle and Wikibon project? If not – check them out soon. These guys have a vision about next generation media coverage. I would call it ‘#1 no-BS Silicon Valley media channel’. These guys are running professional video journalism with a very smart technical setup. And they aren’t your typical loudmouth from the TV: they use and grok technologies they are covering. Say, they run Apache Solr in house for real-time trends processing and searches. Amazing. And they don’t have teleprompters. Nor screenplay writers. How cool is that?

At any rate, I was invited on their show, theCube, last week at the last day of Hadoop Summit. I was talking about High Availability issues in Hadoop. Yup, High Availability has issues, you’ve heard me right. The issue is the lesser than 100% uptime. Basically, even if someone claims to provide 5-9s (that is 99.999% uptime) you still looking at about 6 minutes a year downtime of the mission critical infrastructure.

If you need 100% uptime for you Hadoop, then you should be looking for Continuous Availability. Curiously enough, the solution is found in the past (isn’t that always the case?) in so called Paxos algorithm that has been published by Leslie Lamport back in 1989. However, original Paxos algorithm has some performance issues and generally never been fully embraced by the industry and it is rarely used besides of just a few tech savvy companies. One of them – WANdisco – has applied it first for Subversion replication and now for Hadoop HDFS SPOF problem and made it generally available is the commercial product.

And just think what can be done if the same technology is applied to mission critical analytical platforms such as AMPlab Spark? Anyway, watch the recording of my interview on theCube and learn more.