Today we will be looking into how we can speed Hive using Apache Ignite. For this particular exercise I will be using Apache Bigtop stack v1.0 because I don’t care wasting my time with manual cluster setting; nor I do want to use any of the overly complex stuff like Cloudera Manager or Ambari. I am a Unix command-line guy, and CLI leaves all these fancy yet semi-backed contraptions biting the dust. Let’s start.
% git clone https://git-wip-us.apache.org/repos/asf/bigtop.git
% cd bigtop
;;
;; wget http://seanlahman.com/files/database/lahman591-csv.zip
;; unzip -x lahman591-csv.zip
;; First set the data
;; hadoop fs -copyFromLocal Batting.csv /user/hive
;; hadoop fs -copyFromLocal Master.csv /user/hive
create table temp_batting (col_value STRING);
create table batting (player_id STRING, year INT, runs INT)
LOAD DATA INPATH ‘/user/hive/Batting.csv’ OVERWRITE INTO TABLE temp_batting;
insert overwrite table batting
SELECT
regexp_extract(col_value, ‘^(?:([^,]*)\,?){1}’, 1) player_id,
regexp_extract(col_value, ‘^(?:([^,]*)\,?){2}’, 1) year,
regexp_extract(col_value, ‘^(?:([^,]*)\,?){9}’, 1) run
from temp_batting;
SELECT COUNT(*) FROM batting WHERE year > 1909 AND year <= 1969;
;; let’s do something more real
SELECT a.year, a.player_id, a.runs from batting a
JOIN (SELECT year, max(runs) runs FROM batting GROUP BY year ) b
ON (a.year = b.year AND a.runs = b.runs) ;
Notice the times of both queries.
Quit the hive session and restart it with standard config to run on top of YARN:
;; All the tables are still in place, so let’s just repeat the queries:
SELECT COUNT(*) FROM batting WHERE year > 1909 AND year <= 1969;
SELECT a.year, a.player_id, a.runs from batting a
JOIN (SELECT year, max(runs) runs FROM batting GROUP BY year ) b
ON (a.year = b.year AND a.runs = b.runs) ;