Skip to main content

Setting up Hadoop/YARN/Spark/Hive on Mac OSX El Capitan

If you are like me, who loves to have everything you are developing against working locally in a mini-integration environment, read on

Here, we attempt to get some pretty heavy-weight stuff working locally on your mac, namely

  1. Hadoop (Hadoop2/HDFS)
  2. YARN (So you can submit MR jobs)
  3. Spark (We will illustrate with Spark Shell, but should work on YARN mode as well)
  4. Hive (So we can create some tables and play with it) 
We will use the latest stable Cloudera distribution, and work off the jars. Most of the methodology is borrowed from here, we just link the four pieces together nicely in this blog. 

Download Stuff

First off all, make sure you have Java 7/8 installed, with JAVA_HOME variable setup to point to the correct location. You have to download the CDH tarballs for Hadoop, Zookeeper, Hive from the tarball page (CDH 5.4.x page) and untar them under a folder (refered to as CDH_HOME going forward) as hadoop, zookeeper

$ ls $HOME/bin/cdh/5.4.7
hadoop                          hadoop-2.6.0-cdh5.4.7.tar.gz    hive-1.1.0-cdh5.4.7             hive-1.1.0-cdh5.4.7.tar.gz      zookeeper                       zookeeper-3.4.5-cdh5.4.7.tar.gz

While you are at it, also grab what version of Spark (pre-built for Hadoop 2.6x) from here, and untar to a directory like below, which we will call $SPARK_INSTALL

$ ls $HOME/bin/spark-1.5.0-bin-hadoop2.6/
CHANGES.txt LICENSE     NOTICE      R    RELEASE     bin         conf        data        ec2         examples    lib         python      sbin

You may also want to setup a bunch of variables early on, to be of use later

export HADOOP_HOME="$HOME/bin/cdh/${CDH}/hadoop"
export ZK_HOME="$HOME/bin/cdh/${CDH}/zookeeper"
export SPARK_INSTALL="$HOME/bin/spark-1.5.0-bin-hadoop2.6"
export PATH=${JAVA_HOME}/bin:${HADOOP_HOME}/bin:${HADOOP_HOME}/sbin:${ZK_HOME}/bin:${SPARK_INSTALL}/bin:${PATH}

Tip 1: If you are using jenv to manage your versions, then you might need the following additional lines in your .bashrc/.bash_profile. 

eval "$(jenv init -)"
export JAVA_HOME="$HOME/.jenv/versions/`jenv version-name`"
alias jenv_set_java_home='export JAVA_HOME="$HOME/.jenv/versions/`jenv version-name`"'

Tip 2: Don't accidentally, name your Spark install dir, SPARK_HOME, Hive does things with it, which you may not like.

Setup Hadoop/YARN 

The page we pointed to before, is an excellent resource for doing this already, I will just point out some additional configs I had to add, as I brought in Hive, to make things easier to debug

To etc/hadoop/core-site.xml (to let Hive queries impersonate) 


To etc/hadoop/yarn-site.xml (to let Hive queries leave a debuggable log)

  <description>Where to aggregate logs to.</description>
  <description>Number of seconds to retain logs for</description>

Make sure, you can start HDFS & YARN locally

Setup Hive  

Go into the CDH_HOME/hive-1.1.0-cdh5.4.7 folder and follow the quickstart to build Hive. Basically a command like below

mvn clean package -Phadoop-2,dist

Once you are past the basic steps of quickstart, make a hive-site.xml like below and copy to your hadoop install

$ cat $HADOOP_HOME/etc/hadoop/hive-site.xml
<?xml version="1.0" encoding="UTF-8"?>

Once this is done, you should be able to start a metastore server

[apache-hive-1.1.0-cdh5.4.7-bin]$ bin/hive --service metastore -p 10000

Open up a cli (create table & do a small query)

[apache-hive-1.1.0-cdh5.4.7-bin]$ bin/hive --hiveconf hive.metastore.uris=thrift://localhost:10000
readlink: illegal option -- f
usage: readlink [-n] [file ...]

WARNING: Hive CLI is deprecated and migration to Beeline is recommended.
hive> CREATE TABLE pokes (foo INT, bar STRING);
Time taken: 0.651 seconds

hive> select count(*) from pokes;
Query ID = vinoth_20160523115454_527e550c-7318-4ffc-a49f-248ca119c5a8
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2016-05-23 11:54:28,958 Stage-1 map = 0%,  reduce = 0%
2016-05-23 11:54:34,119 Stage-1 map = 100%,  reduce = 0%
2016-05-23 11:54:39,249 Stage-1 map = 100%,  reduce = 100%
Ended Job = job_1464029642280_0001
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1  Reduce: 1   HDFS Read: 6476 HDFS Write: 2 SUCCESS
Total MapReduce CPU Time Spent: 0 msec
Time taken: 18.299 seconds, Fetched: 1 row(s)

Setup Spark

Spark is super simple, just need to point Spark to the Hadoop installation, that has not only the Hadoop configs, but also the Hive config (this is why we cp-ed hive-site.xml before)

$ export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
$ spark-shell --driver-class-path $HADOOP_CONF_DIR 

scala> sqlContext.sql("show tables").show()
scala> sqlContext.sql("describe pokes").show()
scala> sqlContext.sql("select count(*) from pokes").show()

Voila!! (not really a quick thing to do, but once you have done it once, then you can setup debugger etc and its all golden)


  1. Casino Hotel - Jammy
    Enjoy all of the fun, entertainment and indulgence you can expect from a 평택 출장안마 resort right at the center 오산 출장안마 of New Jersey. 시흥 출장샵 Enjoy the all-new casino features and 경주 출장샵 more 울산광역 출장마사지

  2. Excellent post. I really enjoy reading and also appreciate your work. I will keep visiting this blog.Keep sharing this kind of articles, are you looking to buy weed online ? therefore;

    buy shark cake strain

    buy weed online

    buy shoreline strain

    buy joy strain

    buy marijuana online

    buy sherbacio strain

    buy Forbidden fruit strain

    buy God’s Gift strain

    buy black orchid strain


Post a Comment

Popular posts from this blog

Learning Spark Streaming #1

I have been doing a lot of Spark in the past few months, and of late, have taken a keen interest in Spark Streaming . In a series of posts, I intend to cover a lot of details about Spark streaming and even other stream processing systems in general, either presenting technical arguments/critiques, with any micro benchmarks as needed. Some high level description of Spark Streaming (as of 1.4),  most of which you can find in the programming guide .  At a high level, Spark streaming is simply a spark job run on very small increments of input data (i.e micro batch), every 't' seconds, where t can be as low as 1 second. As with any stream processing system, there are three big aspects to the framework itself. Ingesting the data streams : This is accomplished via DStreams, which you can think of effectively as a thin wrapper around an input source such as Kafka/HDFS which knows how to read the next N entries from the input. The receiver based approach is a little compl

Thoughts On Adding Spatial Indexing to Voldemort

This weekend, I set out to explore something that has always been a daemon running at the back of my head. What would it mean to add Spatial Indexing support to Voldemort , given that Voldemort supports a pluggable storage layer.. Would it fit well with the existing Voldemort server architecture? Or would it create a frankenstein freak show where two systems essentially exist side by side under one codebase... Let's explore.. Basic Idea The 50000 ft blueprint goes like this. Implement a new Storage Engine on top Postgres sql (Sorry innoDB, you don't have true spatial indexes yet and Postgres is kick ass) Implement a new smart partitioning layer that maps a given geolocation to a subset of servers in the cluster (There are a few ways to do this. But this needs to be done to get an efficient solution. I don't believe in naive spraying of results to all servers) Support "geolocation" as a new standard key serializer type in Voldemort. The values will sti