Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/cleberzumba/data-analysis-with-apache-spark-and-databricks

San Francisco Fire Calls. Creating a Spark application on the Databricks using PySpark and SQL for common data analytics patterns and operations on a San Francisco Fire Department Calls dataset.
https://github.com/cleberzumba/data-analysis-with-apache-spark-and-databricks

databricks pyspark spark sql

Last synced: 3 months ago
JSON representation

San Francisco Fire Calls. Creating a Spark application on the Databricks using PySpark and SQL for common data analytics patterns and operations on a San Francisco Fire Department Calls dataset.

Host: GitHub
URL: https://github.com/cleberzumba/data-analysis-with-apache-spark-and-databricks
Owner: cleberzumba
Created: 2024-06-21T16:47:35.000Z (8 months ago)
Default Branch: main
Last Pushed: 2024-06-21T21:22:46.000Z (8 months ago)
Last Synced: 2024-10-19T03:06:15.929Z (4 months ago)
Topics: databricks, pyspark, spark, sql
Homepage:
Size: 3.49 MB
Stars: 1
Watchers: 1
Forks: 1
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# San Francisco Fire Calls ETL and Analysis

#### SUMMARY

Fire Calls-For-Service includes all fire units' responses to 911 calls from the city's Computer-Aided Dispatch (“CAD”) system. This includes responses to Medical Incidents requiring EMS staff. Each record includes the call number, incident number, address, unit identifier, call type, and disposition. All relevant time intervals are also included. Because this dataset is based on responses, and since most calls involve multiple units, there are multiple records for each call number. Addresses are associated with an intersection or call box, not a specific address.

#### HOW TO USE THIS DATASET

This dataset is based on responses, and since most calls involve multiple units, there are multiple records for each call number. The most common call types are Medical Incidents, Alarms, Structure Fires, and Traffic Collisions.

![imagem](images/etl-process-image.png)

* This pipeline uses the San Francisco Fire Department's call event dataset and demonstrates:
* *End-to-end Data Engineering pipeline covers the extraction, transformation and loading (ETL) steps of large volumes of data, using PySpark for transformation and Spark SQL for queries. Caching techniques were implemented to optimize query performance, and data analysis was conducted to gain insights.*
* *How to answer questions by analizing data using Spark SQL*

* Benefits of the Techniques Used:
* *Partitioning: Improves data reading and writing by dividing data into smaller, more manageable partitions.*
* *Spark Settings: Tweaks like spark.sql.shuffle.partitions and spark.sql.autoBroadcastJoinThreshold help optimize
shuffle and join operations.*
* *Parquet Format: Parquet format storage improves reading and writing performance due to its columnar nature and
support compression.*
* *Cache: Caching frequently used DataFrames reduces subsequent data reading time.*
* *Integrated Analysis: Analysis can be performed directly in Databricks, with integrated visualizations for easy interpretation of the results.*
* *Using Databricks and Spark allows the pipeline to easily scale to large volumes of data.*