Skip to main content

Learning Spark Streaming #1

I have been doing a lot of Spark in the past few months, and of late, have taken a keen interest in Spark Streaming. In a series of posts, I intend to cover a lot of details about Spark streaming and even other stream processing systems in general, either presenting technical arguments/critiques, with any micro benchmarks as needed.

Some high level description of Spark Streaming (as of 1.4),  most of which you can find in the programming guide.  At a high level, Spark streaming is simply a spark job run on very small increments of input data (i.e micro batch), every 't' seconds, where t can be as low as 1 second.

As with any stream processing system, there are three big aspects to the framework itself.

  1. Ingesting the data streams : This is accomplished via DStreams, which you can think of effectively as a thin wrapper around an input source such as Kafka/HDFS which knows how to read the next N entries from the input.
    • The receiver based approach is a little complicated IMHO, and my recommendation is to avoid that route completely, since for the most part, it should be possible to refetch the data from the input source such as Kafka/HDFS

  2. Computing the result : Once new input is available, typically you execute the Spark DAG (which is basically your streaming job), computing your final result and optionally outputting that to a serving store such as a NoSQL datastore periodically. 
  3. Updating the internal state : Along the way during DAG execution, you would frequently need access to the results from previous computations, that you want to 'update'  based on new input. For e.g.: if you were tracking the page views in the last hour, then you need to increment some counters based on the last 't' second of data. Spark Streaming currently implements this via the StateDStream/StateRDD, which is basically like any other normal Spark RDD. 

Some good things about this model : (the paper is still a better source)

  1. Spark Streaming computes the entire computation once, for each micro batch. Thus, there are very clear semantics of what the result means, whereas in a per-event model (such as Storm/Samza) different input/output/intermediate streams can be in different points in time. 
  2. Once you read data off your input stream, you are no longer bound by the number of partitions in say your Kafka stream. You can repartition your data as you like, and let Spark & in-memory glory take over. 

So, what could be bad? (sticking to design level problems only)
  • Given the tight coupling between different computations in the DAG, if a single stage in the DAG were to be slow, you could end up with back pressure issues. (Good news inside bad news is, since next batches cannot be scheduled, it self throttles)
  • RDDs are built for immutability, reusing the same constructs for internal state maintenance, can be very inefficient
  • The common criticism is that its not pure streaming, but simply "micro batch" alluding to increased latency, when you compare to something like Storm. (My take on this is, if you want sub-second latency, then your service must be talking to a database i.e state store directly to begin with). But its a fair point and remains to be seen in what shape/form this truly impacts practical applications.

Let's take baby steps, towards understanding how Spark Streaming works under the hoods, by writing our very own CountingDStream, which does one simple thing, keeps counting. We will evolve this as we go to understand aspects like Checkpointing, State management, Recovery, Caching and so on.

Its actually very straightforward. First we implement the InputDStream trait/interface. The meat is in the compute() method, which is tasked to return the next set of inputs as an RDD every micro batch.

Now, the driver program that runs this and just prints out the counters. (Exercise: Increase the number of partitions to parallelize(..) from 1, to observe the effect on the counts printed out)

Next, lets add checkpoints to this DStream, so we will be able to kill and resume our counting..


  1. How does Spark Streaming know to fetch the entire record from the stream? Is it possible for micro batch to have partial record (the last record can be incomplete)?

  2. Thanks for sharing, nice post!

    Chia sẻ các bạn về bệnh thủy đậu với blog hay bà bầu uống thuốc thì nên lưu ý với blog hay để đảm bảo giấc ngủ cho trẻ em các mẹ tham khảo blog hay bệnh thủy đậu lưu ý kiêng cử gì với bài người bị bệnh thủy đậu nên ăn gì hay trẻ em nhiệt miệng kiêng ko ăn gì với bài trẻ bị nhiệt miệng nên ăn gì hay những ai làm món kim chi hàn quốc thì tham khảo bài ớt bột hàn quốc mua ở đâu giá bao nhiêu hay bột ca cao giá bao nhiêu bột ca cao có tác dụng gì, bột ca cao mua ở đâu hay bột gelatin có bán ở siêu thị không bột gelatin mua ở đâu, bột gelatin giá bao nhiêu tinh bột nghệ có bán ở siêu thị không tinh bột nghệ có bán ở hiệu thuốc không, tinh bột nghệ giá bao nhiêu hay bột trà xanh có bán ở siêu thị không bột trà xanh matcha giá bao nhiêu hay baking soda có bán ở siêu thị không baking soda có bán ở hiệu thuốc không.

  3. Nice info regarding spark streaming My sincere thanks for sharing this post Please Continue to share this post
    Hadoop Training in Chennai

  4. nice blog has been shared by you. before i read this blog i didn't have any knowledge about this but now i got some knowledge. so keep on sharing such kind of an interesting blogs.
    hadoop training in chennai

  5. Hi, I am really happy to found such a helpful and fascinating post that is written in well manner. Thanks for sharing such an informative post..Big Data Hadoop Training in Bangalore | Data Science Training in Bangalore

  6. If you want to play like me and not lose, you should try playing BGAOC. free gambling with us Playing here you not only win but also get a lot of fun.

  7. Современная диодная лента по всем стандартам отличного качества я обычно беру у компании Ekodio

  8. I am reading your post from the beginning, it was so interesting to read & I feel thanks to you for posting such a good blog, keep updates regularly.
    Java Training in Chennai
    Java Training in Coimbatore
    Java Training in Bangalore


Post a Comment

Popular posts from this blog

Thoughts On Adding Spatial Indexing to Voldemort

This weekend, I set out to explore something that has always been a daemon running at the back of my head. What would it mean to add Spatial Indexing support to Voldemort, given that Voldemort supports a pluggable storage layer.. Would it fit well with the existing Voldemort server architecture? Or would it create a frankenstein freak show where two systems essentially exist side by side under one codebase... Let's explore..

Basic Idea The 50000 ft blueprint goes like this.

Implement a new Storage Engine on top Postgres sql (Sorry innoDB, you don't have true spatial indexes yet and Postgres is kick ass)Implement a new smart partitioning layer that maps a given geolocation to a subset of servers in the cluster (There are a few ways to do this. But this needs to be done to get an efficient solution. I don't believe in naive spraying of results to all servers)Support "geolocation" as a new standard key serializer type in Voldemort. The values will still be  opaque b…

Setting up Hadoop/YARN/Spark/Hive on Mac OSX El Capitan

If you are like me, who loves to have everything you are developing against working locally in a mini-integration environment, read on

Here, we attempt to get some pretty heavy-weight stuff working locally on your mac, namely

Hadoop (Hadoop2/HDFS)YARN (So you can submit MR jobs)Spark (We will illustrate with Spark Shell, but should work on YARN mode as well)Hive (So we can create some tables and play with it) We will use the latest stable Cloudera distribution, and work off the jars. Most of the methodology is borrowed from here, we just link the four pieces together nicely in this blog. 
Download StuffFirst off all, make sure you have Java 7/8 installed, with JAVA_HOME variable setup to point to the correct location. You have to download the CDH tarballs for Hadoop, Zookeeper, Hive from the tarball page (CDH 5.4.x page) and untar them under a folder (refered to as CDH_HOME going forward) as hadoop, zookeeper

$ ls $HOME/bin/cdh/5.4.7 hadoop hadoop-2.6.0-cdh5.4.7.…