Skip to main content

Learning Spark Streaming #1



I have been doing a lot of Spark in the past few months, and of late, have taken a keen interest in Spark Streaming. In a series of posts, I intend to cover a lot of details about Spark streaming and even other stream processing systems in general, either presenting technical arguments/critiques, with any micro benchmarks as needed.

Some high level description of Spark Streaming (as of 1.4),  most of which you can find in the programming guide.  At a high level, Spark streaming is simply a spark job run on very small increments of input data (i.e micro batch), every 't' seconds, where t can be as low as 1 second.

As with any stream processing system, there are three big aspects to the framework itself.


  1. Ingesting the data streams : This is accomplished via DStreams, which you can think of effectively as a thin wrapper around an input source such as Kafka/HDFS which knows how to read the next N entries from the input.
    • The receiver based approach is a little complicated IMHO, and my recommendation is to avoid that route completely, since for the most part, it should be possible to refetch the data from the input source such as Kafka/HDFS

  2. Computing the result : Once new input is available, typically you execute the Spark DAG (which is basically your streaming job), computing your final result and optionally outputting that to a serving store such as a NoSQL datastore periodically. 
  3. Updating the internal state : Along the way during DAG execution, you would frequently need access to the results from previous computations, that you want to 'update'  based on new input. For e.g.: if you were tracking the page views in the last hour, then you need to increment some counters based on the last 't' second of data. Spark Streaming currently implements this via the StateDStream/StateRDD, which is basically like any other normal Spark RDD. 

Some good things about this model : (the paper is still a better source)

  1. Spark Streaming computes the entire computation once, for each micro batch. Thus, there are very clear semantics of what the result means, whereas in a per-event model (such as Storm/Samza) different input/output/intermediate streams can be in different points in time. 
  2. Once you read data off your input stream, you are no longer bound by the number of partitions in say your Kafka stream. You can repartition your data as you like, and let Spark & in-memory glory take over. 

So, what could be bad? (sticking to design level problems only)
  • Given the tight coupling between different computations in the DAG, if a single stage in the DAG were to be slow, you could end up with back pressure issues. (Good news inside bad news is, since next batches cannot be scheduled, it self throttles)
  • RDDs are built for immutability, reusing the same constructs for internal state maintenance, can be very inefficient
  • The common criticism is that its not pure streaming, but simply "micro batch" alluding to increased latency, when you compare to something like Storm. (My take on this is, if you want sub-second latency, then your service must be talking to a database i.e state store directly to begin with). But its a fair point and remains to be seen in what shape/form this truly impacts practical applications.

Let's take baby steps, towards understanding how Spark Streaming works under the hoods, by writing our very own CountingDStream, which does one simple thing, keeps counting. We will evolve this as we go to understand aspects like Checkpointing, State management, Recovery, Caching and so on.


Its actually very straightforward. First we implement the InputDStream trait/interface. The meat is in the compute() method, which is tasked to return the next set of inputs as an RDD every micro batch.



Now, the driver program that runs this and just prints out the counters. (Exercise: Increase the number of partitions to parallelize(..) from 1, to observe the effect on the counts printed out)




Next, lets add checkpoints to this DStream, so we will be able to kill and resume our counting..









Comments

  1. How does Spark Streaming know to fetch the entire record from the stream? Is it possible for micro batch to have partial record (the last record can be incomplete)?

    ReplyDelete
    Replies
    1. IEEE Final Year projects Project Center in Chennai are consistently sought after. Final Year Students Projects take a shot at them to improve their aptitudes. Final Year Project Domains for IT

      JavaScript Training in Chennai

      JavaScript Training in Chennai

      The Angular Training covers a wide range of topics including Components, project projects for cse. Angular Training

      Delete
  2. Thanks for sharing, nice post!


    Chia sẻ các bạn về bệnh thủy đậu với blog http://benhthuydauco.blogspot.com/ hay bà bầu uống thuốc thì nên lưu ý với blog http://babauconenuongthuoc.blogspot.com/ hay để đảm bảo giấc ngủ cho trẻ em các mẹ tham khảo blog http://giacngucuatreem.blogspot.com/ hay bệnh thủy đậu lưu ý kiêng cử gì với bài người bị bệnh thủy đậu nên ăn gì hay trẻ em nhiệt miệng kiêng ko ăn gì với bài trẻ bị nhiệt miệng nên ăn gì hay những ai làm món kim chi hàn quốc thì tham khảo bài ớt bột hàn quốc mua ở đâu giá bao nhiêu hay bột ca cao giá bao nhiêu bột ca cao có tác dụng gì, bột ca cao mua ở đâu hay bột gelatin có bán ở siêu thị không bột gelatin mua ở đâu, bột gelatin giá bao nhiêu tinh bột nghệ có bán ở siêu thị không tinh bột nghệ có bán ở hiệu thuốc không, tinh bột nghệ giá bao nhiêu hay bột trà xanh có bán ở siêu thị không bột trà xanh matcha giá bao nhiêu hay baking soda có bán ở siêu thị không baking soda có bán ở hiệu thuốc không.

    ReplyDelete
  3. Nice info regarding spark streaming My sincere thanks for sharing this post Please Continue to share this post
    Hadoop Training in Chennai

    ReplyDelete
  4. nice blog has been shared by you. before i read this blog i didn't have any knowledge about this but now i got some knowledge. so keep on sharing such kind of an interesting blogs.
    hadoop training in chennai

    ReplyDelete
  5. Hi, I am really happy to found such a helpful and fascinating post that is written in well manner. Thanks for sharing such an informative post..Big Data Hadoop Training in Bangalore | Data Science Training in Bangalore

    ReplyDelete
  6. If you want to play like me and not lose, you should try playing BGAOC. free gambling with us Playing here you not only win but also get a lot of fun.

    ReplyDelete
  7. I am reading your post from the beginning, it was so interesting to read & I feel thanks to you for posting such a good blog, keep updates regularly.
    Java Training in Chennai
    Java Training in Coimbatore
    Java Training in Bangalore

    ReplyDelete
  8. For Hadoop Training in Bangalore Visit : Hadoop Training in Bangalore

    ReplyDelete
  9. Great blog thanks for sharing Instagram and Facebook have provided an amazing place for new brands to grow and flourish. We can find the perfect niche for your brand on the best social media platforms. Marketing through social media brings forth global audience without all these physical boundaries. Analyze and take over the competition with ease with Adhuntt Media’s digital marketing tools and strategies.
    digital marketing company in chennai

    ReplyDelete
  10. Nice blog thanks for sharing Buy unique easy to maintain indoor plants and exotic garden plants all at the same place - Karuna Nursery - the best nursery garden in Chennai. With our exclusive collection of plants and special deals on bulk orders - you know you are at the right place.
    plant nursery in chennai

    ReplyDelete
  11. Excellent blog thanks for sharing Looking for the best place in Chennai to get your cosmetics at wholesale? The Pixies Beauty Shop is the premium wholesale cosmetics shop in Chennai that has all the international brands your salon deserves.
    beauty Shop in Chennai

    ReplyDelete
  12. Great Article. Thank you for sharing! Really an awesome post for every one.

    IEEE Final Year projects Project Centers in Chennai are consistently sought after. Final Year Students Projects take a shot at them to improve their aptitudes, while specialists like the enjoyment in interfering with innovation. For experts, it's an alternate ball game through and through. Smaller than expected IEEE Final Year project centers ground for all fragments of CSE & IT engineers hoping to assemble. Final Year Project Domains for IT It gives you tips and rules that is progressively critical to consider while choosing any final year project point.

    Spring Framework has already made serious inroads as an integrated technology stack for building user-facing applications. Spring Framework Corporate TRaining the authors explore the idea of using Java in Big Data platforms.
    Specifically, Spring Framework provides various tasks are geared around preparing data for further analysis and visualization. Spring Training in Chennai

    ReplyDelete
  13. You have provided a nice article, Thank you very much for this one. And I hope this will be useful for many people. And I am waiting for your next post keep on updating these kinds of knowledgeable things
    Android Training in Chennai
    Android Course in Chennai
    Android Training in Bangalore
    Android Course in Bangalore
    Android Training in Coimbatore
    Android Course in Coimbatore
    Android Training in Madurai

    ReplyDelete
  14. Nice blog, it’s so knowledgeable, informative, and good looking site. I appreciate your hard work. Good job. Thank you for this wonderful sharing with us. Keep Sharing.
    Digital Marketing Course In Kolkata
    Web Design Course In Kolkata

    ReplyDelete
  15. Nice blog,I understood the topic very clearly,And want to study more like this.
    Data Scientist Course

    ReplyDelete
  16. Thank you for sharing this blog, it is very useful for understanding the java programming.
    Even we also provide some tutorial related to this topic. For more information visit trishana technology.

    ReplyDelete
  17. Cool stuff you have and you keep overhaul every one of us

    data science course

    ReplyDelete
  18. Thanks a lot for sharing such a good source with all, i appreciate your efforts taken for the same. I found this worth sharing and must share this with all.


    Dot Net Training in Chennai | Dot Net Training in anna nagar | Dot Net Training in omr | Dot Net Training in porur | Dot Net Training in tambaram | Dot Net Training in velachery





    ReplyDelete
  19. I like viewing web sites which comprehend the price of delivering the excellent useful resource free of charge. I truly adored reading your posting. Thank you!

    Simple Linear Regression

    Correlation vs Covariance

    ReplyDelete
  20. Amazing Article ! I would like to thank you for the efforts you had made for writing this awesome article. This article inspired me to read more. keep it up.
    Correlation vs Covariance
    Simple Linear Regression
    data science interview questions
    KNN Algorithm

    ReplyDelete
  21. It is amazing and wonderful to visit your site.Thanks for sharing this information,this is useful to me...data science courses

    ReplyDelete
  22. I just loved your article on the beginners guide to starting a blog.If somebody take this blog article seriously in their life, he/she can earn his living by doing blogging.thank you for this article.


    angular js training in chennai

    angular js training in velachery

    full stack training in chennai

    full stack training in velachery

    php training in chennai

    php training in velachery

    photoshop training in chennai

    photoshop training in velachery

    ReplyDelete
  23. Very nice blogs!!! i have to learning for lot of information for this sites…Sharing for wonderful information.Thanks for sharing this valuable information to our vision. You have posted a trust worthy blog keep sharing, data scientist courses

    ReplyDelete
  24. Attend The Data Analyst Course From ExcelR. Practical Data Analyst Course Sessions With Assured Placement Support From Experienced Faculty. ExcelR Offers The Data Analyst Course.
    Data Analyst Course

    ReplyDelete
  25. I really thank you for the valuable info on this great subject and look forward to more great posts
    data scientist training in hyderabad

    ReplyDelete
  26. This was a wonderful post being shared. The entire content in this blog is extremely helpful for me and gave me a clear idea on the concepts.
    Data Science Training in Hyderabad
    Data Science Course in Hyderabad

    ReplyDelete
  27. The best thing is that your blog really informative thanks for your great information!
    DevOps Training in Hyderabad
    DevOps Course in Hyderabad

    ReplyDelete
  28. Hi,
    Good job & thank you very much for the new information, i learned something new. Very well written. It was SOOO good to read and use full to improve knowledge. Who want to learn this information most helpful?
    Python Training in Hyderabad
    Python Course in Hyderabad

    ReplyDelete
  29. Thanks for bringing such innovative content which truly attracts the readers towards you. Certainly, your blog competes with your co-bloggers to come up with the newly updated info. Finally, kudos to you.

    Data Science Course in Varanasi

    ReplyDelete
  30. Thanks for posting the best information and the blog is very good.artificial intelligence course in hyderabad

    ReplyDelete
  31. Thanks for posting the best information and the blog is very good.data science institutes in hyderabad

    ReplyDelete
  32. Extremely overall quite fascinating post. I was searching for this sort of data and delighted in perusing this one. Continue posting. A debt of gratitude is in order for sharing. cloud computing course in jaipur

    ReplyDelete
  33. It's a great pleasure reading your post.Its full of information I am looking for and I love to post a comment that "The content of your post is awesome" Great work.
    data science institutes in hyderabad

    ReplyDelete
  34. I am truly getting a charge out of perusing your elegantly composed articles. It would seem that you burn through a ton of energy and time on your blog. I have bookmarked it and I am anticipating perusing new articles. Keep doing awesome.
    data science coaching in hyderabad

    ReplyDelete
  35. This post is so interactive and informative.keep update more information…
    Hadoop Training in Anna Nagar
    Big data training in chennai

    ReplyDelete
  36. I simply stumbled upon your weblog and desired to say that I have really enjoyed surfing your blog articles.
    Positive site, where did u come up with the info on this uploading?
    yorkie puppies for sale
    yorkies for sale
    yorkie for sale
    yorkshire terrier for sale
    yorkie puppy for sale
    teacup yorkies for sale
    teacup yorkie for sale
    yorkie teacup for sale
    https://yorkshireterriepuppyforsale.com/

    ReplyDelete


  37. This Blog Is really informative for us. Thanks For sharing this blog.
    Hii this is my first time visiting this web page this blog is really informative for me.
    teacup chihuahua for sale
    chihuahua puppies for sale
    teacup chihuahua puppies for sale
    chihuahua for sale
    teacup chihuahuas for sale
    tea cup chihuahua for sale
    chihuahua for sale near me
    applehead chihuahua for sale
    apple head chihuahua for sale
    https://www.yorkiespuppiessale.com/

    ReplyDelete



  38. Thanks very much for sharing an amazing content with us. we really do appreciate.
    teacup yorkies for sale
    teacup yorkies for sale near me
    yorkie teacup for sale
    yorkie puppies for sale
    yorkie puppy for sale
    teacup yorkie for sale
    yorkie for sale near me
    teacup yorkie near me
    teacup yorkie for sale near me
    Nice Post..Thanks for sharing this useful information! This is really interesting information to read.
    https://www.newdaypuppies.com/


    ReplyDelete


  39. https://yorkshireterriepuppyforsale.com/
    https://yorkshireterriepuppyforsale.com/yorkshire-terriers-for-sale/
    https://yorkshireterriepuppyforsale.com/yorkie-for-sale-near-me/
    https://yorkshireterriepuppyforsale.com/yorkie-puppies-for-sale/
    https://yorkshireterriepuppyforsale.com/yorkies-puppy-for-sale/
    https://yorkshireterriepuppyforsale.com/yorkie-for-sale/
    https://yorkshireterriepuppyforsale.com/yorkie-poo-for-sale/
    https://yorkshireterriepuppyforsale.com/yorkie-puppy-for-sale/
    https://yorkshireterriepuppyforsale.com/yorkie-teacup-for-sale/



    ReplyDelete

  40. https://www.yorkiespuppiessale.com/
    https://www.yorkiespuppiessale.com/teacup-yorkie-for-sale/
    https://www.yorkiespuppiessale.com/yorkie-puppies-for-sale/
    https://www.yorkiespuppiessale.com/yorkies-puppy-for-sale/
    https://www.yorkiespuppiessale.com/yorkshire-puppies-for-sale/
    https://www.yorkiespuppiessale.com/yorkie-puppy-for-sale/
    https://www.yorkiespuppiessale.com/yorkies-for-sale/
    https://www.yorkiespuppiessale.com/yorkie-for-sale/
    https://www.yorkiespuppiessale.com/teacup-yorkies-for-sale/
    https://www.yorkiespuppiessale.com/yorkshire-for-sale/


    ReplyDelete



  41. https://www.chihuahuapuppiesforsale1.com/
    https://www.chihuahuapuppiesforsale1.com/chihuahua-puppies-for-sale-near-me/
    https://www.chihuahuapuppiesforsale1.com/teacup-chihuahuas-puppies-for-sale/
    https://www.chihuahuapuppiesforsale1.com/chihuahuas-for-sale/
    https://www.chihuahuapuppiesforsale1.com/teacup-chihuahuas-for-sale/
    https://www.chihuahuapuppiesforsale1.com/chihuahua-pupies-for-sale/
    https://www.chihuahuapuppiesforsale1.com/chihuahua-puppies-near-me/
    https://www.chihuahuapuppiesforsale1.com/chihuahua-for-sale/
    https://www.chihuahuapuppiesforsale1.com/teacup-chihuahua-puppies-for-sale-2/




    ReplyDelete
  42. Heya i'm for the first time here. I came across this board and I find
    It truly useful & it helped me out much. I hope to give something back
    and help others like you aided me.
    yorkies for sale near me
    yorkie for sale near me
    yorkie puppies near me
    yorkies near me
    yorkshire terrier for sale
    yorkie puppy for sale near me
    yorkie puppies for sale near me
    teacup puppies for sale near me
    teacup yorkie for sale
    https://www.chihuahuapuppiesforsale1.com/

    ReplyDelete
  43. Great post. keep sharing such a worthy information.
    Manual Testing Online Course

    ReplyDelete
  44. Excellent post. I really enjoy reading and also appreciate your work. I will keep visiting this blog.Keep sharing this kind of articles, are you looking to buy weed online ? therefore;

    buy shark cake strain

    buy weed online

    buy shoreline strain

    buy joy strain

    buy marijuana online

    buy sherbacio strain

    buy Forbidden fruit strain

    buy God’s Gift strain

    buy black orchid strain

    ReplyDelete
  45. Thanks for posting the best information and the blog is very good.data science course in udaipur

    ReplyDelete
  46. Thanks for sharing this great article we appreciate it, we provide instagram reels download freely and unlimited.

    ReplyDelete
  47. Sony Vegas Pro Crack is a powerful software for video and audio editing as well as for DVD and HD or Blu-ray disc burning. Cracks In Sony Firmware

    ReplyDelete
  48. I have only one sorrow in my life, for I wish I had met you sooner.If only I knew how happy I would be with you; if only I knew then how wonderful you are, I would have roamed the streets till I found you. Wife Birthday Wishes

    ReplyDelete
  49. PC Reviver License Key is the most recent several applications. It is used to generate a thorough image resolution of the whole pc program. Pc Reviver License Key Generator

    ReplyDelete

Post a Comment

Popular posts from this blog

Setting up Hadoop/YARN/Spark/Hive on Mac OSX El Capitan

If you are like me, who loves to have everything you are developing against working locally in a mini-integration environment, read on Here, we attempt to get some pretty heavy-weight stuff working locally on your mac, namely Hadoop (Hadoop2/HDFS) YARN (So you can submit MR jobs) Spark (We will illustrate with Spark Shell, but should work on YARN mode as well) Hive (So we can create some tables and play with it)  We will use the latest stable Cloudera distribution, and work off the jars. Most of the methodology is borrowed from here , we just link the four pieces together nicely in this blog.  Download Stuff First off all, make sure you have Java 7/8 installed, with JAVA_HOME variable setup to point to the correct location. You have to download the CDH tarballs for Hadoop, Zookeeper, Hive from the tarball page (CDH 5.4.x page ) and untar them under a folder (refered to as CDH_HOME going forward) as hadoop, zookeeper $ ls $HOME /bin/cdh/5.4.7 hadoop

Thoughts On Adding Spatial Indexing to Voldemort

This weekend, I set out to explore something that has always been a daemon running at the back of my head. What would it mean to add Spatial Indexing support to Voldemort , given that Voldemort supports a pluggable storage layer.. Would it fit well with the existing Voldemort server architecture? Or would it create a frankenstein freak show where two systems essentially exist side by side under one codebase... Let's explore.. Basic Idea The 50000 ft blueprint goes like this. Implement a new Storage Engine on top Postgres sql (Sorry innoDB, you don't have true spatial indexes yet and Postgres is kick ass) Implement a new smart partitioning layer that maps a given geolocation to a subset of servers in the cluster (There are a few ways to do this. But this needs to be done to get an efficient solution. I don't believe in naive spraying of results to all servers) Support "geolocation" as a new standard key serializer type in Voldemort. The values will sti