I have been doing a lot of Spark in the past few months, and of late, have taken a keen interest in Spark Streaming. In a series of posts, I intend to cover a lot of details about Spark streaming and even other stream processing systems in general, either presenting technical arguments/critiques, with any micro benchmarks as needed.
Some high level description of Spark Streaming (as of 1.4), most of which you can find in the programming guide. At a high level, Spark streaming is simply a spark job run on very small increments of input data (i.e micro batch), every 't' seconds, where t can be as low as 1 second.
As with any stream processing system, there are three big aspects to the framework itself.
- Ingesting the data streams : This is accomplished via DStreams, which you can think of effectively as a thin wrapper around an input source such as Kafka/HDFS which knows how to read the next N entries from the input.
- The receiver based approach is a little complicated IMHO, and my recommendation is to avoid that route completely, since for the most part, it should be possible to refetch the data from the input source such as Kafka/HDFS
- Computing the result : Once new input is available, typically you execute the Spark DAG (which is basically your streaming job), computing your final result and optionally outputting that to a serving store such as a NoSQL datastore periodically.
- Updating the internal state : Along the way during DAG execution, you would frequently need access to the results from previous computations, that you want to 'update' based on new input. For e.g.: if you were tracking the page views in the last hour, then you need to increment some counters based on the last 't' second of data. Spark Streaming currently implements this via the StateDStream/StateRDD, which is basically like any other normal Spark RDD.
Some good things about this model : (the paper is still a better source)
- Spark Streaming computes the entire computation once, for each micro batch. Thus, there are very clear semantics of what the result means, whereas in a per-event model (such as Storm/Samza) different input/output/intermediate streams can be in different points in time.
- Once you read data off your input stream, you are no longer bound by the number of partitions in say your Kafka stream. You can repartition your data as you like, and let Spark & in-memory glory take over.
So, what could be bad? (sticking to design level problems only)
- Given the tight coupling between different computations in the DAG, if a single stage in the DAG were to be slow, you could end up with back pressure issues. (Good news inside bad news is, since next batches cannot be scheduled, it self throttles)
- RDDs are built for immutability, reusing the same constructs for internal state maintenance, can be very inefficient.
- The common criticism is that its not pure streaming, but simply "micro batch" alluding to increased latency, when you compare to something like Storm. (My take on this is, if you want sub-second latency, then your service must be talking to a database i.e state store directly to begin with). But its a fair point and remains to be seen in what shape/form this truly impacts practical applications.
Let's take baby steps, towards understanding how Spark Streaming works under the hoods, by writing our very own CountingDStream, which does one simple thing, keeps counting. We will evolve this as we go to understand aspects like Checkpointing, State management, Recovery, Caching and so on.
Its actually very straightforward. First we implement the InputDStream trait/interface. The meat is in the compute() method, which is tasked to return the next set of inputs as an RDD every micro batch.
Now, the driver program that runs this and just prints out the counters. (Exercise: Increase the number of partitions to parallelize(..) from 1, to observe the effect on the counts printed out)
Next, lets add checkpoints to this DStream, so we will be able to kill and resume our counting..
How does Spark Streaming know to fetch the entire record from the stream? Is it possible for micro batch to have partial record (the last record can be incomplete)?
ReplyDeleteIEEE Final Year projects Project Center in Chennai are consistently sought after. Final Year Students Projects take a shot at them to improve their aptitudes. Final Year Project Domains for IT
DeleteJavaScript Training in Chennai
JavaScript Training in Chennai
The Angular Training covers a wide range of topics including Components, project projects for cse. Angular Training
Nice info regarding spark streaming My sincere thanks for sharing this post Please Continue to share this post
ReplyDeleteHadoop Training in Chennai
nice blog has been shared by you. before i read this blog i didn't have any knowledge about this but now i got some knowledge. so keep on sharing such kind of an interesting blogs.
ReplyDeletehadoop training in chennai
I am reading your post from the beginning, it was so interesting to read & I feel thanks to you for posting such a good blog, keep updates regularly.
ReplyDeleteJava Training in Chennai
Java Training in Coimbatore
Java Training in Bangalore
Awesome Post!!! Thanks for sharing this great post with us.
ReplyDeleteJAVA Training in Chennai
Java training institute in chennai
JAVA J2EE Training in Chennai
J2EE Training in Chennai
java training in vadapalani
java training in porur
Python Training in Chennai
Hadoop Training in Chennai
Big data training in chennai
Selenium Training in Chennai
Really a awesome blog for the freshers. Thanks for posting the information.
ReplyDeletebusiness card price in india
online paper bag printing india
desktop rental in chennai
macbook for rent in chennai
online company registration in india
company registration consultants in india
Nice
ReplyDeletefreeinplanttrainingcourseforECEstudents
internship-in-chennai-for-bsc
inplant-training-for-automobile-engineering-students
freeinplanttrainingfor-ECEstudents-in-chennai
internship-for-cse-students-in-bsnl
application-for-industrial-training
This is the first & best article to make me satisfied by presenting good content. I feel so happy and delighted. Thank you so much for this article.
ReplyDeleteLearn Best Digital Marketing Course in Chennai
Digital Marketing Course Training with Placement in Chennai
Best Big Data Course Training with Placement in Chennai
Big Data Analytics and Hadoop Course Training in Chennai
Best Data Science Course Training with Placement in Chennai
Data Science Online Certification Course Training in Chennai
Learn Best Android Development Course Training Institute in Chennai
Android Application Development Programming Course Training in Chennai
Learn Best AngularJS 4 Course Online Training and Placement Institute in Chennai
Learn Digital Marketing Course Training in Chennai
Digital Marketing Training with Placement Institute in Chennai
Learn Seo Course Training Institute in Chennai
Learn Social Media Marketing Training with Placement Institute in Chennai
Great blog thanks for sharing Instagram and Facebook have provided an amazing place for new brands to grow and flourish. We can find the perfect niche for your brand on the best social media platforms. Marketing through social media brings forth global audience without all these physical boundaries. Analyze and take over the competition with ease with Adhuntt Media’s digital marketing tools and strategies.
ReplyDeletedigital marketing company in chennai
Excellent blog thanks for sharing Looking for the best place in Chennai to get your cosmetics at wholesale? The Pixies Beauty Shop is the premium wholesale cosmetics shop in Chennai that has all the international brands your salon deserves.
ReplyDeletebeauty Shop in Chennai
very useful post... thank you for giving this post....
ReplyDeleteinplant training in chennai
inplant training in chennai
inplant training in chennai for it
Australia hosting
mexico web hosting
moldova web hosting
albania web hosting
andorra hosting
australia web hosting
denmark web hosting
Nice blog,I understood the topic very clearly,And want to study more like this.
ReplyDeleteData Scientist Course
Thank you for sharing this blog, it is very useful for understanding the java programming.
ReplyDeleteEven we also provide some tutorial related to this topic. For more information visit trishana technology.
I like viewing web sites which comprehend the price of delivering the excellent useful resource free of charge. I truly adored reading your posting. Thank you!
ReplyDeleteSimple Linear Regression
Correlation vs Covariance
I just loved your article on the beginners guide to starting a blog.If somebody take this blog article seriously in their life, he/she can earn his living by doing blogging.thank you for this article.
ReplyDeleteangular js training in chennai
angular js training in velachery
full stack training in chennai
full stack training in velachery
php training in chennai
php training in velachery
photoshop training in chennai
photoshop training in velachery
Extremely overall quite fascinating post. I was searching for this sort of data and delighted in perusing this one. Continue posting. A debt of gratitude is in order for sharing. cloud computing course in jaipur
ReplyDeleteThe ideas are very useful and helpful. Share more
ReplyDeleteSpoken English Classes in Velachery
Spoken English Classes in Chennai Velachery
Great post. keep sharing such a worthy information.
ReplyDeleteAngularjs Training in Chennai
Learn Angularjs Online
Angularjs Training In Bangalore
I am truly getting a charge out of perusing your elegantly composed articles. It would seem that you burn through a ton of energy and time on your blog. I have bookmarked it and I am anticipating perusing new articles. Keep doing awesome.
ReplyDeletedata science coaching in hyderabad
This post is so interactive and informative.keep update more information…
ReplyDeleteHadoop Training in Anna Nagar
Big data training in chennai
I simply stumbled upon your weblog and desired to say that I have really enjoyed surfing your blog articles.
ReplyDeletePositive site, where did u come up with the info on this uploading?
yorkie puppies for sale
yorkies for sale
yorkie for sale
yorkshire terrier for sale
yorkie puppy for sale
teacup yorkies for sale
teacup yorkie for sale
yorkie teacup for sale
https://yorkshireterriepuppyforsale.com/
ReplyDeletehttps://yorkshireterriepuppyforsale.com/
https://yorkshireterriepuppyforsale.com/yorkshire-terriers-for-sale/
https://yorkshireterriepuppyforsale.com/yorkie-for-sale-near-me/
https://yorkshireterriepuppyforsale.com/yorkie-puppies-for-sale/
https://yorkshireterriepuppyforsale.com/yorkies-puppy-for-sale/
https://yorkshireterriepuppyforsale.com/yorkie-for-sale/
https://yorkshireterriepuppyforsale.com/yorkie-poo-for-sale/
https://yorkshireterriepuppyforsale.com/yorkie-puppy-for-sale/
https://yorkshireterriepuppyforsale.com/yorkie-teacup-for-sale/
ReplyDeletehttps://www.chihuahuapuppiesforsale1.com/
https://www.chihuahuapuppiesforsale1.com/chihuahua-puppies-for-sale-near-me/
https://www.chihuahuapuppiesforsale1.com/teacup-chihuahuas-puppies-for-sale/
https://www.chihuahuapuppiesforsale1.com/chihuahuas-for-sale/
https://www.chihuahuapuppiesforsale1.com/teacup-chihuahuas-for-sale/
https://www.chihuahuapuppiesforsale1.com/chihuahua-pupies-for-sale/
https://www.chihuahuapuppiesforsale1.com/chihuahua-puppies-near-me/
https://www.chihuahuapuppiesforsale1.com/chihuahua-for-sale/
https://www.chihuahuapuppiesforsale1.com/teacup-chihuahua-puppies-for-sale-2/
Great post. keep sharing such a worthy information.
ReplyDeleteManual Testing Online Course
çekmeköy samsung klima servisi
ReplyDeleteataşehir samsung klima servisi
maltepe bosch klima servisi
kadıköy bosch klima servisi
kadıköy arçelik klima servisi
kartal samsung klima servisi
ümraniye samsung klima servisi
kartal mitsubishi klima servisi
ümraniye mitsubishi klima servisi
Sony Vegas Pro Crack is a powerful software for video and audio editing as well as for DVD and HD or Blu-ray disc burning. Cracks In Sony Firmware
ReplyDeleteI have only one sorrow in my life, for I wish I had met you sooner.If only I knew how happy I would be with you; if only I knew then how wonderful you are, I would have roamed the streets till I found you. Wife Birthday Wishes
ReplyDeletePC Reviver License Key is the most recent several applications. It is used to generate a thorough image resolution of the whole pc program. Pc Reviver License Key Generator
ReplyDeleteOverall, this piece was extremely fascinating. I like reading this because I was looking for information of this nature. Post more often. It is appropriate to express gratitude for sharing.
ReplyDeleteBest CA Foundation Coaching in Hyderabad