I have came across this post from Platfora which, among other trivialities, says:
Hadoop is irresistible for this reason, but the big question that remains is how to use the data there once you’ve stored it. The challenge is that Hadoop is a very different architecture to traditional data warehouses. It is a batch engine — a lumbering freight train that can process immense amounts of data, but takes a while to get up to speed, so even the simplest question requires minutes of processing.
How lyrical! And then we got a glimpse of The Promised Land laying ahead:
Here at Platfora we are laser focused on this next phase of Hadoop. The result won’t just match the status quo, but exceed it in flexibility and the ability to scale and adapt to changing requirements. Exciting times are ahead – stay tuned.
No, wait – not an exactly promised land: just a promise of one. I wonder if this an attempt to damage control of yesterday’s announcement about a vendor’s support for Spark platform, that I was discussing in my last post? 🙂
Skimming through my emails today I have came across this interesting post on general@hadoop list:
|From MTG dev
||Lightning fast in-memory analytics on HDFS
||Mon, 24 Sep 2012 16:31:56 GMT
Because a lot of people here are using HDFS day in and day out the
following might be quite interesting for some.
Magna Tempus Group has just rolled out a readily available Spark 0.5
(www.spark-project.org) packaged for Ubuntu distribution. Spark delivers up
to 20x faster experience (sic!) using in-memory analytics and a computational
model that is different from MapReduce.
You can read the rest here. If you don’t know about Spark then you sure should check the Spark project website and see how cool is that. If you are lazy to dig through the information, here’s a brief summary for you (taken from the original poster’s Magna Tempus Group website)
- consists of a completely separate codebase optimized for low latency, although it can load data from any Hadoop input source, S3, etc.
- doesn’t have to use Hadoop, actually
- provides a new, highly efficient computational model, with programming interfaces in Scala, Java. We might start working soon on adding Groovy API to the set
- offers a lazy evaluation that allows a “postponed” execution of operations
- can do in-memory caching of data for later high-performance analytics. Yeah, go shopping for more RAM, gents!
- can be run locally on a multicore system or on a Mesos cluster
Yawn, some might say. There are Apache Drill and other things that seems to be highly promising and all. Well, not so fast.
To begin with, I am not aware about any productized version of Drill (merged with Open Dremel or vice versa). Perhaps, there are some other technologies around that are 20x faster than Hadoop – I just haven’t heard about them, so please feel free to correct me on this.
Also, Spark and some of its components (Mesos resource planner and such) have been happily adopted by interesting companies such as Twitter and so on.
What is not said out right is that an adoption of new in-memory high-performance analytics for big data by commercial vendors like Magna Tempus Group opens a completely new page in the BigData storybook.
I would “dare” to go as far as to assert that this new development means that Hadoop isn’t the smartest kid on the block anymore – there are other faster and perhaps clever fellas moving in.
And I can’t help but wonder if the Spark has lit a fire under the yellow elephant yet?