https://github.com/kenriortega/spark-streaming-course
https://github.com/kenriortega/spark-streaming-course
Last synced: about 1 month ago
JSON representation
- Host: GitHub
- URL: https://github.com/kenriortega/spark-streaming-course
- Owner: kenriortega
- Created: 2022-07-16T15:57:20.000Z (almost 3 years ago)
- Default Branch: main
- Last Pushed: 2022-07-20T19:42:59.000Z (almost 3 years ago)
- Last Synced: 2025-02-03T06:53:00.928Z (3 months ago)
- Language: Scala
- Size: 15.6 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Repository for the Rock the JVM Spark Streaming with Scala course
### Tech used in this project
- apache kakfa
- apache cassandra
- postgresql
- netcat### Spark Streaming principles
> Declarative API
- write `what` needs to be computed, let the library decide `how`
- alternative: *RAAT* (record-at-a-time)
- set of APIs to process each incoming elements as it arrives
- low-level: maintaining state & resource usage is your responsibility
- hard to develop> Event time vs Processing time API
- event time = when the event was produced
- processing time = when the event arrives
- event time is critical: allow detection of late data points> Continuous vs micro-batch execution
- continuous = include each data point as it arrives `lower latency`
- micro-batch = wait for a few data points, process them all in the new result `higher throughput`> Low-level (`DStreams`) vs High-level API (`Structured Streaming`)
Spark streaming operates on micro-batches
`continuous executions is experimental` [2020]### Structured Streaming Principles
> Lazy evaluation
Transformation and Action
- transformations describe of how new DFs are obtained
- actions start executing/running spark codeInput sources e.g:
- kafka, flume
- a distributed file sustem
- socketsOutput sinks e.g:
- a distributed file systemd
- databases
- kafka
- testing sinks e.g: console, memory> Streaming I/O
Outputs modes
- append = only add new records
- update = modify records in place `if query has no aggregations, eqquivalent with append`
- complete = rewrite everythingNot all queries and sinks support all output modes
- e.g: aggregations and append modeTriggers = when new data is written
- default: write as soon as the current micro-batch has been processed
- once: write a single micro-batch and stop
- processing-time: lok for new data at fixed intervals
- continuous (currently experimental `2020`)### Discretized Streams
Never ending sequence of RDDs
- nodes clocks are synchronized
- batches are triggered at the same time in the cluster
- each batch is an RDDEssentially a distributed collection of elements of the same type
- functional operators e.g: map, flatMap, filter, reduce
- accessors to each RDD
- more advance operators (later)Needs a receiver to perform computations
- one receiver per DStream
- fetches data from the source, sends to spark, create blocks
- is managed by the `StreamingContext` on the driver
- occupies one core on the machine!