Skip to main content

Posts

Showing posts from 2015

Learning Spark Streaming #1

I have been doing a lot of Spark in the past few months, and of late, have taken a keen interest in Spark Streaming . In a series of posts, I intend to cover a lot of details about Spark streaming and even other stream processing systems in general, either presenting technical arguments/critiques, with any micro benchmarks as needed. Some high level description of Spark Streaming (as of 1.4),  most of which you can find in the programming guide .  At a high level, Spark streaming is simply a spark job run on very small increments of input data (i.e micro batch), every 't' seconds, where t can be as low as 1 second. As with any stream processing system, there are three big aspects to the framework itself. Ingesting the data streams : This is accomplished via DStreams, which you can think of effectively as a thin wrapper around an input source such as Kafka/HDFS which knows how to read the next N entries from the input. The receiver based approach is a little compl

HDFS Client Configs for talking to HA Hadoop NameNodes

One more simple thing, that had relatively scarce documentation out on the Internet. As you might know, Hadoop NameNodes finally became HA in 2.0 . The HDFS client configuration, which is already a little bit tedious, became more complicated. Traditionally, there were two ways to configure a HDFS client (lets stick to Java) Copy over the entire Hadoop config directory with all the xml files, place it somewhere in the classpath of your app or construct a Hadoop Configuration object by manually adding in those files. Simply provide the HDFS NameNode URI and let the client do the rest.          Configuration conf = new Configuration(false);         conf.set("fs.default.name", "hdfs://localhost:8020"); // this is deprecated now         conf.set("fs.defaultFS", "hdfs://localhost:8020");         FileSystem fs = FileSystem.get(conf); Most people prefer 2, unless you need way more configs from the actual xml config files, at which po

Remote Debugging A Java Process Using IntelliJ

This is a brief post on something that is rather very important. Your company probably handed you a macbook or laptop and have a Linux VM hosted somewhere, that you will do all your development on. And now the circus begins. You like to stay on your laptop since you get all the nice IDEs and Code diffing tools and what not. But, your code only runs on the VM, rightfully so in the highly SOA ( Service Oriented Architecture , basically meaning everything is REST and has nagios alerts). So, here's how to get the best of both worlds. Pre-requisite: Use a tool like unison  or Bittorrent sync  or roll your own scripts to rsync your local repo on laptop with a directory on your VM. End state would be You git clone or svn checkout on VM Syncing gets it down to your laptop From there on you make code changes on your IDE and syncing reflects them on VM Most of all, it means that intellij can look for the source code for the classes you debug, on your repo on your laptop. No