{"id":47997501,"url":"https://github.com/rrohitramsen/firehose","last_synced_at":"2026-04-04T12:02:28.508Z","repository":{"id":83884894,"uuid":"99832173","full_name":"rrohitramsen/firehose","owner":"rrohitramsen","description":"Firehose - Spark streaming 2.2 + Kafka 0.8_2","archived":false,"fork":false,"pushed_at":"2017-08-09T17:12:07.000Z","size":26817,"stargazers_count":9,"open_issues_count":0,"forks_count":7,"subscribers_count":1,"default_branch":"master","last_synced_at":"2026-03-23T23:24:19.728Z","etag":null,"topics":["apache-spark","avro-kafka","cassandra","cassandra-java","firehose","java-8","java-spark","javastreamingcontext","json-schema","kafka","kafka-connect","kafka-consumer","kafka-producer","kafka-streams","scheduling-delay","spark","spark-streaming","stock-data","stock-market"],"latest_commit_sha":null,"homepage":"","language":"Java","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/rrohitramsen.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2017-08-09T16:54:00.000Z","updated_at":"2021-11-25T08:38:55.000Z","dependencies_parsed_at":"2023-03-05T08:30:16.476Z","dependency_job_id":null,"html_url":"https://github.com/rrohitramsen/firehose","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/rrohitramsen/firehose","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rrohitramsen%2Ffirehose","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rrohitramsen%2Ffirehose/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rrohitramsen%2Ffirehose/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rrohitramsen%2Ffirehose/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/rrohitramsen","download_url":"https://codeload.github.com/rrohitramsen/firehose/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rrohitramsen%2Ffirehose/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":31228492,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-03-31T08:35:14.124Z","status":"ssl_error","status_checked_at":"2026-03-31T08:34:00.887Z","response_time":111,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["apache-spark","avro-kafka","cassandra","cassandra-java","firehose","java-8","java-spark","javastreamingcontext","json-schema","kafka","kafka-connect","kafka-consumer","kafka-producer","kafka-streams","scheduling-delay","spark","spark-streaming","stock-data","stock-market"],"created_at":"2026-04-04T12:02:27.879Z","updated_at":"2026-04-04T12:02:28.499Z","avatar_url":"https://github.com/rrohitramsen.png","language":"Java","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Firehose ?\n`The firehose API is a steady stream of all available data from a source in realtime –  a giant spigot that delivers data to any number of subscribers at a time. The stream is constant, delivering new, updated data as it happens. The amount of data in the firehose can vary with spikes and lows, but nonetheless, the data continues to flow through the firehose until it is crunched. Once crunched, that data can be visualized, published, graphed; really anything you want to do with it, all in realtime.`\n\n## How we can build firehose ?\n\n```[Data] + [Queue] + [Streaming] = Firehose```\n\n1. Data\n    * Weather and temperature data\n    * Stock quote prices\n    * Public transportation time and location data\n    * RSS and blog feeds\n    * Multiplayer game player position and state\n    * Internet of Things sensor network data\n\n2. Queue server support\n   * ActiveMQ\n   * Amazon SQS\n   * Apache Kafka\n   * RabbitMQ\n\n3. Streaming Server support\n    * Amazon Kinesis\n    * Apache Spark\n    * Apache Storm\n    * Google DataFlow\n\n### Our use case - Process 1 million stock market data.\n* [Bombay stock exchange historical stock price data.](http://www.bseindia.com/markets/equity/EQReports/StockPrcHistori.aspx?scripcode=512289\u0026flag=sp\u0026Submit=G)\n* Apache kafka\n* Apache Spark\n\n## Kafka Setup\n\n#### Download kafka at your machine [download kafka](https://kafka.apache.org/quickstart)\n\n\n### Kafka setting up multiple broker server\n\n* First we make a config file for each of the brokers\n\n```\n$ cp config/server.properties config/server-1.properties\n$ cp config/server.properties config/server-2.properties\n$ cp config/server.properties config/server-3.properties\n```\n\n* Now edit these new files and set the following properties:\n\n```\nconfig/server-1.properties:\n    broker.id=1\n    listeners=PLAINTEXT://:9093\n    log.dir=/tmp/kafka-logs-1\n\nconfig/server-2.properties:\n    broker.id=2\n    listeners=PLAINTEXT://:9094\n    log.dir=/tmp/kafka-logs-2\n\nconfig/server-3.properties:\n    broker.id=3\n    listeners=PLAINTEXT://:9095\n    log.dir=/tmp/kafka-logs-3\n```\n\n### Start the zookeeper\n\n```\n$ bin/zookeeper-server-start.sh config/zookeeper.properties\n```\n\n### Start the server\n\n```\n$ bin/kafka-server-start.sh config/server-1.properties\n\n$ bin/kafka-server-start.sh config/server-2.properties\n\n$ bin/kafka-server-start.sh config/server-3.properties\n```\n\n### create a new topic with a replication factor of three:\n\n```\nbin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 3 --partitions 1 --topic test-firehose\n```\n\n\n### Kafka Producer Properties\n\n* Update Kafka producer properties in [application.yml](/src/main/resources/application.yml)\n\n```\nbootstrap servers: : localhost:9093,localhost:9094,localhost:9095\n\n```\n\n### Spark consumer Properties\n\n* Update Spark consumer properties in [application.yml](/src/main/resources/application.yml)\n\n```\nbootstrap servers: : localhost:9093,localhost:9094,localhost:9095\nzookeeper connect: localhost:2181\n\n```\n\n### Stock market data location\n* Update data location in [application.yml](/src/main/resources/application.yml)\n\n```\ndata location: /Users/rohitkumar/Work/code-brusher/firehose/src/main/resources/data\n```\n\n\n### How to run the firehose\n\n\n```\n$  mvn spring-boot:run\n\n```\n\n### See Spark UI\n\n* [spark-web-ui](http://localhost:4040) for the firehose job stats.\n\n\n## firehose Statistics on my machine - Processed 1 million stock price records.\n\n#### 1 Million Stock Price Record Processed in `4 min and 16` seconds.\n\n* Machine mac-book pro Processor - 2.7 GHz Intel dual Core i5 and Ram - 8 GB 1867 MHz DDR3 and 128 GB SSD.\n\n![Spark one million](/src/main/resources/spark_stats/Spark-1.png \"Spark UI\")\n![Spark scheduling delay](/src/main/resources/spark_stats/spark-2.png \"Spark UI\")\n![Spark Bath status](/src/main/resources/spark_stats/spark-3.png \"Spark UI\")\n![Complete Data Processed](/src/main/resources/spark_stats/spark-4.png \"Spark UI\")\n\n### Performance Tuning spark streaming.\n\n#### Batch Interval Parameter :\n\nStart with some intuitive batch interval say 5 or 10 seconds.\nIf your overall processing time \u003c batch interval time, then application is stable.\nIn my case 15 seconds suited my processing and I got the performance.\n\n```\nJavaStreamingContext streamingContext = new JavaStreamingContext(sparkContext, new Duration(15000));\n```\n\n#### ConcurrentJobs Parameter :\n\nBy default the number of concurrent jobs is 1 which means at a time only 1 job will be active and till its not finished,\nother jobs will be queued up even if the resources are available and idle. However this parameter is intentionally not\ndocumented in Spark docs as sometimes it may cause weird behaviour as Spark Streaming creator Tathagata discussed in\nthis useful[thread](http://stackoverflow.com/questions/23528006/how-jobs-are-assigned-to-executors-in-spark-streaming).\nTune it accordingly keeping side-effects in mind.\n\nRunning concurrent jobs brings down the overall processing time and\nscheduling delay even if a batch takes processing time slightly more than batch interval.\n\nIn my case :\n\n```\n\"spark.streaming.concurrentJobs\",\"1\" - Scheduling Delay around - 3.43 second and Processing Time - 9.8 Seconds\n\"spark.streaming.concurrentJobs\",\"2 - Improved - Scheduling Delay - 1 milli-second - 8.8 seconds\n```\n\n#### Backpressure Parameter :\nSpark gives a very powerful feature called backpressure .\nHaving this property enabled means spark streaming will tell kafka to slow down rate of sending messages if the processing\ntime is coming more than batch interval and scheduling delay is increasing. Its helpful in cases like when there is sudden\nsurge in data flow and is a must have property to have in production to avoid application being over burdened. However this\nproperty should be disabled during development and staging phase otherwise we cannot test the limit of the maximum load our\napplication can and should handle.\n\n```\nset(spark.streaming.backpressure.enabled\",\"true\")\n```\n\n\n\n\n\n\n\n\n\n\n\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Frrohitramsen%2Ffirehose","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Frrohitramsen%2Ffirehose","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Frrohitramsen%2Ffirehose/lists"}