I have been doing a lot of Spark in the past few months, and of late, have taken a keen interest in Spark Streaming. In a series of posts, I intend to cover a lot of details about Spark streaming and even other stream processing systems in general, either presenting technical arguments/critiques, with any micro benchmarks as needed.
Some high level description of Spark Streaming (as of 1.4), most of which you can find in the programming guide. At a high level, Spark streaming is simply a spark job run on very small increments of input data (i.e micro batch), every 't' seconds, where t can be as low as 1 second.
As with any stream processing system, there are three big aspects to the framework itself.
- Ingesting the data streams : This is accomplished via DStreams, which you can think of effectively as a thin wrapper around an input source such as Kafka/HDFS which knows how to read the next N entries from the input.
- The receiver based approach is a little complicated IMHO, and my recommendation is to avoid that route completely, since for the most part, it should be possible to refetch the data from the input source such as Kafka/HDFS
Some good things about this model : (the paper is still a better source)
- Spark Streaming computes the entire computation once, for each micro batch. Thus, there are very clear semantics of what the result means, whereas in a per-event model (such as Storm/Samza) different input/output/intermediate streams can be in different points in time.
- Once you read data off your input stream, you are no longer bound by the number of partitions in say your Kafka stream. You can repartition your data as you like, and let Spark & in-memory glory take over.
So, what could be bad? (sticking to design level problems only)
- Given the tight coupling between different computations in the DAG, if a single stage in the DAG were to be slow, you could end up with back pressure issues. (Good news inside bad news is, since next batches cannot be scheduled, it self throttles)
- RDDs are built for immutability, reusing the same constructs for internal state maintenance, can be very inefficient.
- The common criticism is that its not pure streaming, but simply "micro batch" alluding to increased latency, when you compare to something like Storm. (My take on this is, if you want sub-second latency, then your service must be talking to a database i.e state store directly to begin with). But its a fair point and remains to be seen in what shape/form this truly impacts practical applications.
Let's take baby steps, towards understanding how Spark Streaming works under the hoods, by writing our very own CountingDStream, which does one simple thing, keeps counting. We will evolve this as we go to understand aspects like Checkpointing, State management, Recovery, Caching and so on.
Its actually very straightforward. First we implement the InputDStream trait/interface. The meat is in the compute() method, which is tasked to return the next set of inputs as an RDD every micro batch.
Now, the driver program that runs this and just prints out the counters. (Exercise: Increase the number of partitions to parallelize(..) from 1, to observe the effect on the counts printed out)
Next, lets add checkpoints to this DStream, so we will be able to kill and resume our counting..