Skip to main content

Posts

Setting up Hadoop/YARN/Spark/Hive on Mac OSX El Capitan

If you are like me, who loves to have everything you are developing against working locally in a mini-integration environment, read on Here, we attempt to get some pretty heavy-weight stuff working locally on your mac, namely Hadoop (Hadoop2/HDFS) YARN (So you can submit MR jobs) Spark (We will illustrate with Spark Shell, but should work on YARN mode as well) Hive (So we can create some tables and play with it)  We will use the latest stable Cloudera distribution, and work off the jars. Most of the methodology is borrowed from here , we just link the four pieces together nicely in this blog.  Download Stuff First off all, make sure you have Java 7/8 installed, with JAVA_HOME variable setup to point to the correct location. You have to download the CDH tarballs for Hadoop, Zookeeper, Hive from the tarball page (CDH 5.4.x page ) and untar them under a folder (refered to as CDH_HOME going forward) as hadoop, zookeeper $ ls $HOME /bin/cdh/5.4.7 hadoop
Recent posts

Learning Spark Streaming #1

I have been doing a lot of Spark in the past few months, and of late, have taken a keen interest in Spark Streaming . In a series of posts, I intend to cover a lot of details about Spark streaming and even other stream processing systems in general, either presenting technical arguments/critiques, with any micro benchmarks as needed. Some high level description of Spark Streaming (as of 1.4),  most of which you can find in the programming guide .  At a high level, Spark streaming is simply a spark job run on very small increments of input data (i.e micro batch), every 't' seconds, where t can be as low as 1 second. As with any stream processing system, there are three big aspects to the framework itself. Ingesting the data streams : This is accomplished via DStreams, which you can think of effectively as a thin wrapper around an input source such as Kafka/HDFS which knows how to read the next N entries from the input. The receiver based approach is a little compl

HDFS Client Configs for talking to HA Hadoop NameNodes

One more simple thing, that had relatively scarce documentation out on the Internet. As you might know, Hadoop NameNodes finally became HA in 2.0 . The HDFS client configuration, which is already a little bit tedious, became more complicated. Traditionally, there were two ways to configure a HDFS client (lets stick to Java) Copy over the entire Hadoop config directory with all the xml files, place it somewhere in the classpath of your app or construct a Hadoop Configuration object by manually adding in those files. Simply provide the HDFS NameNode URI and let the client do the rest.          Configuration conf = new Configuration(false);         conf.set("fs.default.name", "hdfs://localhost:8020"); // this is deprecated now         conf.set("fs.defaultFS", "hdfs://localhost:8020");         FileSystem fs = FileSystem.get(conf); Most people prefer 2, unless you need way more configs from the actual xml config files, at which po

Remote Debugging A Java Process Using IntelliJ

This is a brief post on something that is rather very important. Your company probably handed you a macbook or laptop and have a Linux VM hosted somewhere, that you will do all your development on. And now the circus begins. You like to stay on your laptop since you get all the nice IDEs and Code diffing tools and what not. But, your code only runs on the VM, rightfully so in the highly SOA ( Service Oriented Architecture , basically meaning everything is REST and has nagios alerts). So, here's how to get the best of both worlds. Pre-requisite: Use a tool like unison  or Bittorrent sync  or roll your own scripts to rsync your local repo on laptop with a directory on your VM. End state would be You git clone or svn checkout on VM Syncing gets it down to your laptop From there on you make code changes on your IDE and syncing reflects them on VM Most of all, it means that intellij can look for the source code for the classes you debug, on your repo on your laptop. No

Setting up a Play Framework application on RedHat's Openshift

Play framework is an interesting web development option, using Netty directly as a web server, providing a basic MVC framework to build web applications on. A Here's how you marry them both.. The documentation out there ( git quick starts  ) seem to be outdated. But, its simple really and truly DIY.. So, thought I would let people know. 1. Install Play  2.0  Spin up the play server and make sure something renders on the localhost. 2. Open an account with openshift.com and create a DIY application When you checkout the git projects, by default, you get the following directory structure with a Ruby script to serve up a index.html page. $ ls -a   .  ..  .git  .openshift  README  diy  misc $ ls diy   logs   index.html   testrubyserver.rb $ ls .openshift/action_hooks   build   deploy   post_deploy   pre_build   start   stop 3. Package your Play app $ play stage $ target/start Now, copy target folder to $ cp -rf target $OPENSHIFT_PROJECT_GIT_REPO/diy/ Thro

Thoughts On Adding Spatial Indexing to Voldemort

This weekend, I set out to explore something that has always been a daemon running at the back of my head. What would it mean to add Spatial Indexing support to Voldemort , given that Voldemort supports a pluggable storage layer.. Would it fit well with the existing Voldemort server architecture? Or would it create a frankenstein freak show where two systems essentially exist side by side under one codebase... Let's explore.. Basic Idea The 50000 ft blueprint goes like this. Implement a new Storage Engine on top Postgres sql (Sorry innoDB, you don't have true spatial indexes yet and Postgres is kick ass) Implement a new smart partitioning layer that maps a given geolocation to a subset of servers in the cluster (There are a few ways to do this. But this needs to be done to get an efficient solution. I don't believe in naive spraying of results to all servers) Support "geolocation" as a new standard key serializer type in Voldemort. The values will sti

A simple non blocking event counter

I am writing share something I implemented recently, to monitor streaming operations into the Voldemort server. What I needed was a simple avg and event rate counter The idea is very simple. Have atomic longs for the event value sum and event count, to be updated by the application threads. Whats the big deal right?  The deal (not sure how big, depends on who you are) is to have this counter reset itself, every so often. This is a very common usage pattern in data systems, where we would like to monitor performance over small periods of time (so we get enough samples to smoothen the outliers) and reset every so often. Also, one should make sure the "monitoring" thread always reports statistics for a complete interval. The trick is to avoid resetting the atomic long every interval, since doing so would mean missing some concurrent updates from other application threads in the meantime. Also, since we are maintaining state about the previous interval, it becomes necessary