30+ time faster Hadoop MapReduce application with Bigtop and Ingite

Did you ever wonder how you can deploy Hadoop stack quickly? Or what can be done to speed up that slow MapReduce job? Look no further – with Apache Bigtop you can get a Hadoop cluster stack deployed in a matter of a few minutes with no hassle and no sweat.  And how to run your old MapReduce applications very fast? Apache Ignite (incubating) gives you that option out of the box with its Hadoop Accelerator

The stack being deployed in the following demo is from Apache Bigtop 1.0 RC (Hadoop 2.6, Ignite 1.0, etc.) Enjoy


Apache Ignite vs Apache Spark

Complimentary to my earlier post on Apache Ignite in-memory file-system and caching capabilities I would like to cover the main differentiation points of the Ignite and Spark. I see questions like this coming up repeatedly. It is easier to have them answered, so you don’t need to fish around the Net for the answers.

 – The main different is, of course, that Ignite is an in-memory computing system, e.g. the one that treats RAM as the primary storage facility. Whereas others – Spark included – only use RAM for processing. The former, memory-first approach, is faster because the system can do better indexing, reduce the fetch time, avoid (de)serializations, etc.

 – Ignite’s mapreduce is fully compatible with Hadoop MR APIs which let everyone to simply reuse existing legacy MR code, yet run it with >30x performance improvement. Check this short video demoing an Apache Bigtop in-memory stack, speeding up a legacy MapReduce code

 – Also, unlike Spark’s the streaming in Ignite isn’t quantified by the size of RDD. In other words, you don’t need to form an RDD first before processing it; you can actually do the real streaming. Which means there’s no delays in a stream content processing in case of Ignite

 – Spill-overs are a common issue for in-memory computing systems: after all memory is limited. In Spark where RDDs are immutable, if an RDD got created with its size > 1/2 node’s RAM then a transformation and generation of the consequent RDD’ will likely to fill all the node’s memory. Which will cause the spill-over. Unless the new RDD is created on a different node. Tachyon was essentially an attempt to address it, using old RAMdrive tech. with all its limitations.
Ignite doesn’t have this issue with data spill-overs as its caches can be updated in atomic or transactional manner. However, spill-overs are still possible: the strategies to deal with it are explained here

 – as one of its components Ignite provides the first-class citizen file-system caching layer. Note, I have already addressed the differences between that and Ignite, but for some reason my post got deleted from their user list. I wonder why? 😉

 – Ignite’s uses off-heap memory to avoid GC pauses, etc. and does it highly efficiently

 – Ignite guarantees strong consistency

 – Ignite supports full SQL99 as one of the ways to process the data w/ full support for ACID transactions

– Ignite supports in-memory SQL indexes functionality, which lets to avoid full-scans of data sets, directly leading to very significant performance improvements (also see the first paragraph)

 – with Ignite a Java programmer shouldn’t learn new ropes of Scala. The programming model also encourages the use of Groovy. And I will withhold my professional opinion about the latter in order to keep this post focused and civilized 😉

I can keep on rumbling for a long time, but you might consider reading this and that, where Nikita Ivanov – one of the founders of this project – has a good reflection on other key differences. Also, if you like what you read – consider joining Apache Ignite (incubating) community and start contributing!

Apache Ignite (incubating) vs Tachyon

The post has been updated to use WaybackMachine instead of the Twitter, as I’ve closed my Twitter account.

After the discovery that my explanation of the differences between Apache Ignite (incubating) and Tachyon caching project, I found out that my attempt to clarify the situation was purged as well.
About the same time I got a private email from tachyon-user google group explaining to me that my message “was deleted because it was a marketing message”.

So, looks like any messages even slightly critical to the Tachyon project will be deleted as ‘marketing msgs’ in true FOSS spirit! Looks like the community building got off the wrong foot on that one. So, I have decided to post the original message that of course was sent back via email the moment it got posted in the original thread.

Judge for yourself:

Date: Fri, Apr 10, 2015 at 11:46 PM
Subject: Re: Apche Ignite vs Tachyon
To: tachyon-users@googlegroups.com

You’re just partially correct, actually.

Apache Ignite (incubating) is a fully developed In-Memory Computing (IMC) platform (aka data fabric). “Supporting for Hadoop ecosystem” is one of the components of the fabric. And it has two parts:
– file system caching: fully transparent cache that gives a significant performance boost to HDFS IO. In a way it’s similar to what Tachyon tries to achieve. Unlike Tachyon, the cached data is an integral part of bigger data fabric that can be used by any Ignite services.
– MR accelerator that allows to run “classic” MR jobs on Ignite in-memory engine. Basically, Ignite MR (much list its SQL and other computation components) is just a way to work with data stored in the cluster memory. Shall I mention that Ignite MR is about 30 times – that’s 3000% – faster than Hadoop MR? No code changes is need, BTW 😉

When you say about “Tachyon… support big data stack natively.” you should keep in mind that Ignite Hadoop acceleration is very native as well: you can run MR, Hive, HBase, Spark, etc. on top of the IgniteFS without changing anything.

And here’s the catch BTW: file system caching in Ignite is a part of its ‘data fabric’ paradigm like the services, advanced clustering, distributed messaging, ACID real-time transactions, etc. Adding HDFS and MR acceleration layer was pretty straight-forward as it was build on the advanced Ignite core, which has been in the real-world production for 5+ years. However. it is very hard to achieve the same level of enterprise computing when you start from an in-memory file system like Tachyon. Not bashing anything – just saying.

I would encourage you to check ignite.incubator.apache.org: read the docs, try version 1.0 from https://dist.apache.org/repos/dist/release/incubator/ignite/1.0.0/ (setup is a breeze) and join our Apache community. If you are interested in using Ignite with Hadoop – Apache Bigtop offers this integration, including seamless cluster deployment which let you get started with fully functional cluster in a few minutes.

In the full disclosure: I am an Apache Incubator mentor for the Ignite project.

With best regards,
Konstantin Boudnik

On Thursday, April 9, 2015 at 7:39:00 PM UTC-7, Pengfei Xuan wrote:
> To my understanding, Apache Ignite (GridGain) grows up from traditional

Schema-on-read or schema-on-write?

I was recently asked if schema-on-read is superior to schema-on-write and how it relates to “traditional” storage systems like EMC, Netapp, Teradata seemingly loosing the ground to commodity-based storage systems. Here are some semi-random thoughts

First of all, I think schema-on-read/schema-on-write is a fancy way to say if the data was stored in non-structured or structured way. It all boils down to where there’s a need to store unaltered data or not. If statistics teaches us anything at all, it would be that by  creatively selecting a subset of data, or making changes in a data sample or in a model itself you can prove the correlation between anything imaginable 😉 Hence, there are clear benefits of keeping data ‘as-is’: without any pre-processing, cleaning, dedup’ing, and so on. It will allow you to run different models or apply alternative approximations.

It might appear that schema-on-read approach might be always superior, but there is plenty of cases where it isn’t so. All sorts of scientific, engineering, financial, medical, accounting systems would still enjoy the benefits of data structuring for years to come. And of course, there are good cases for the opposite, non-structured storage way: marketing, social studies, economical modeling (which is an utter nonsense, of course, but people still believe in it for some reason), and so on.

I don’t think that “schema-on-write” technology is inherently so much more expensive. In the overall order of things one might safe some on the pre-processing stage by using commodity hardware and open software, but will have to pay more in direct and indirect costs related to more expensive and slower BA & BI solutions,

For all I know, we might be witnessing an end game for EMC/Netapp & co., but not because of the way they pre-process the data before storing it. Their very challenge is in the huge change of software development landscape, that has happened over the last 20+ years with Gnu, Linux, ASF and other free and open software models. No doubt, these companies have well-developed sales channels and established brands, but it is almost impossible to out-sale something that anyone can download from the net for the cost of the bandwidth, and get up and running in a matter of hours or even faster. And there’s a whole spectrum of such open systems, so you don’t have to lock yourself up to either of them. Now, go and compete with that!

Warning [Rant]: YAML is an incredible piece of turd

I spent, hay wasted, an hour of my time today trying to figure out the reason for the following error message from Puppet Hiera:

vmhost05-hbase3: Error: syntax error on line 30, col -1: `’ at /root/bigtop/bigtop-deploy/puppet/manifests/site.pp:17 on node ….
The relevant part of the Hiera site.yaml file is

bigtop::bigtop_yumrepo_uri:  “http://archive.hostname.com/redhat/6/x86_64/7.3.0/”
bigtop::jdk_package_name: ‘jdk-1.7.0_55’

Firstly, as a former compiler developer it hurts every bit of my brain when I see error message like above. Huge “compliment” to the Hiera developers – learn how to write code, dammit.

Secondly, after investigating this literally for an hour I figured out that the separator in uri:  “http was a TAB (ASCII 9) instead of a whitespaces.

Seriously dudes – it’s 21st century. What’s the reason to use formats and parsers that fail so badly on separator terminals? Just imaging if Java or Groovy compiler would be so picky about tabs vs. spaces? I guarantee – the half of the development community would be screaming bloody murder right there. Yet – with frigging YAML POS it is just ok ;(


How to mount RAID1 volume on Ubuntu

If you ever need to mount an encrypted partition from a RAID1 NAS on your Ubuntu system (like a laptop or a different server) here’s a simple three steps instruction. Figure out what partition needs to be mounted (you can do it by running parted or similar to figure out what your target should be); for the sake of the example it will be /dev/sdd2. And now:

% sudo mdadm –assemble –run /dev/md0 /dev/sdd2
% sudo cryptsetup -v luksOpen /dev/md0 mapperpoint
% sudo mount /dev/mapper/mapperpoint /mnt/external/

If you need to check the state of the drive while connected via USB enclosure, run

% sudo smartctl -aH -d sat /dev/sdd

The only trick is to add -d sat disk type.

Or to simplify the whole thing, just run Disk Utility and click “Start RAID” button 😉

Finally upgrading from Debian Lenny

If you like me was putting of an upgrade from Debian 5.0 Lenny you might find yourself blocked out, because ftp.us.debian.org/debian/dists/lenny doesn’t exist anymore anywhere on the US mirrors. However, I needed to do one last update before getting on dist-upgrade. 

Luckily enough I was able to find a mirror in Germany which still has Lenny dist around. So, if you find yourself in my shoes edit /etc/apt/sources.list on your system and replace 

then do usual update and then an upgrade. Good luck!