Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/782e616c6d/covid-d.a
Academic project, using Apache Spark for ETL and Data Studio for data analysis.
https://github.com/782e616c6d/covid-d.a
academic analytics automation cluster covid-19 data database etl python spark sql
Last synced: about 2 months ago
JSON representation
Academic project, using Apache Spark for ETL and Data Studio for data analysis.
- Host: GitHub
- URL: https://github.com/782e616c6d/covid-d.a
- Owner: 782e616c6d
- License: gpl-3.0
- Created: 2022-06-21T22:17:25.000Z (over 2 years ago)
- Default Branch: main
- Last Pushed: 2024-09-13T02:34:07.000Z (4 months ago)
- Last Synced: 2024-09-13T15:12:38.019Z (4 months ago)
- Topics: academic, analytics, automation, cluster, covid-19, data, database, etl, python, spark, sql
- Language: Python
- Homepage:
- Size: 22.2 MB
- Stars: 1
- Watchers: 2
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
Detailed Project Description.
![spark-logo-rev](https://user-images.githubusercontent.com/76137086/174940667-b6b5f635-71a4-434d-8e1b-e9c8e83acee0.svg)
1 - Install Apache Spark dependencies.
2 - Install Apache Spark. (Im opted for the "Stand Alone Cluster" Mode, as it suited me, but feel free to check and suggest other simpler and more efficient installation modes).
3 - If, like me, you chose the "Stand Alone" mode, follow the steps in the documentation.
Doc. Link:
https://spark.apache.org/docs/latest/spark-standalone.html
4 - After installation, the cluster will be able to run and perform its proper functions, run your startup script "start-all.sh" and wait. (Access can be done via WebUi, or through the terminal, at your discretion).
5 - With the cluster in full operation, submit your applications through "spark-submit".
6 - To shut down the entire cluster, run "stop-all.sh".
![0_Dnt6wUWlARdI1wim](https://user-images.githubusercontent.com/76137086/174943043-f9a2b98b-a2eb-41db-a167-9db342350dda.png)
Link to databases:
1 - Community Mobility Reports (Br).
Database: Google.
Link: https://www.google.com.br/covid19/mobility/
2 - Variation of Cases (Covid-19).
Database: Fiocruz.
Link: https://bigdata-covid19.icict.fiocruz.br/Period: January/2020 - December/2020.
However, it can be easily extrapolated, due to constant data updates.
![0](https://user-images.githubusercontent.com/76137086/174943501-d5fd7b9d-31a0-41ba-bad4-cc47fb9299a4.png)
Data display and analysis: Data Studio.
Link: https://datastudio.google.com/reporting/a55071c6-62c4-4bd9-860d-08cb4a4116d8
![images-removebg-preview](https://user-images.githubusercontent.com/76137086/174942117-e71f2707-54ac-4c9d-996d-7fddb1b1f1c4.png)
The ETL process was done through Apache Spark, but specifically with PySpark, and other awesome Python tools.
Note:
1 - "/Brute"
Description: Location where raw data will be allocated.
2 - "/Processing"
Description: Location where the data will remain, until the end of processing.
3 - "/Final"
Description: Location where the data, already processed, will be allocated.
4 - As you can see, "Cases.csv" and "Deaths.csv" were downloaded directly into the directory where the processing will take place, this is due to the fact that, as they are isolated datasets, they do not need to pass an initial filter, a necessary process to first base.
5 - Code Formatter: Yapf.
6 - For automation processes, the "cron" task scheduler can be used, in the case of Linux distributions.
![png-transparent-ubuntu-server-edition-long-term-support-installation-linux-linux-lamp-linux-ubuntu-16-removebg-preview](https://user-images.githubusercontent.com/76137086/175204618-59d2eb0b-4973-403e-9549-2956eaeaa177.png)
Hardware Settings:
1 - 2 CPU Cores.
2 - 2 Gb Ram.
3 - 10 Gb HD.
4 - OS: Ubuntu Server 22.04.
Note: Hyper-V was used for this project, acting as a hypervisor, building a cluster with 2 nodes, the configuration above is equivalent to a node.
![images-removeb2)](https://user-images.githubusercontent.com/76137086/174941919-db3bd0a0-cc4b-44d1-8f09-66e1b1d0b325.png)
"Project for academic purposes, using Google and Fiocruz databases, to verify the relationship between the mobility of the Brazilian population, and the variation of cases and deaths".
Data providers and maintainers:
https://bigdata-covid19.icict.fiocruz.br/
SIVEP-Gripe.
eSUS-VE.
Google LLC "Google COVID-19 Community Mobility Reports".
https://www.google.com/covid19/mobility/
Accessed: 12/06/2022.