https://github.com/kabeera1007/bike_data_play
End to End Data Engineering project with multiple ETL & ELT pipelines.
https://github.com/kabeera1007/bike_data_play
airflow anaconda bigquery cloud dbt docker gcs python spark terraform
Last synced: 2 months ago
JSON representation
End to End Data Engineering project with multiple ETL & ELT pipelines.
- Host: GitHub
- URL: https://github.com/kabeera1007/bike_data_play
- Owner: kabeera1007
- Created: 2024-12-28T11:57:31.000Z (over 1 year ago)
- Default Branch: master
- Last Pushed: 2025-01-27T19:04:28.000Z (over 1 year ago)
- Last Synced: 2025-02-22T06:14:45.084Z (over 1 year ago)
- Topics: airflow, anaconda, bigquery, cloud, dbt, docker, gcs, python, spark, terraform
- Language: Jupyter Notebook
- Homepage:
- Size: 349 KB
- Stars: 1
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Project Name: Bike Data Play - Divvy Bike-Sharing Analysis
## Description
This project involves processing and analyzing Divvy bike-sharing data from Chicago (2020–2024) using various tools and technique.

## Workflow
This project integrates several tools and processes to manage the workflow:
***Tools***:
- **DBT**: Data transformation and analysis.
- **Airflow**: Task scheduling and orchestration.
- **Spark**: Data processing.
- **Docker**: Containerization.
- **Terraform**: Infrastructure management.
- **GCS**: Cloud computing.
***Steps***:
- **Step 1**: ETL and ELT Using spark.
- **Step 2**: ELT Using dbt and gcs.
- **Step 3**: ELT Using docker, terraform, airflow .
## Project Structure
The project structure is organized as follows:
- **analyses/**: Contains DBT analysis scripts.
- **dags/**: Airflow DAGs for task scheduling.
- **macros/**: Custom DBT macros.
- **models/**: DBT models for data transformation.
- **scripts/**: Project setup scripts.
- **seeds/**: Raw data for seeding DBT models.
- **snapshots/**: DBT snapshots for table versioning.
- **spark_notebooks/**: Jupyter Notebooks for Spark-based analysis.
- **terraf/**: Terraform configuration files.
- **tests/**: DBT tests for data quality.
- **.gitignore**: Git ignore file for unwanted files.
- **Dockerfile**: Docker configuration for the project.
- **docker-compose.yaml**: Docker Compose configuration for container orchestration.
- **requirements.txt**: Python dependencies for the project.
## Data
The dataset contains Divvy bike-sharing trip data from 2020 to 2024.
- **Rows**: 20 million +
The columns include:
- **ride_id**: Unique ID assigned to each Divvy trip.
- **rideable_type**: Type of vehicle used (bike or scooter).
- **started_at**: Start date and time of the trip.
- **ended_at**: End date and time of the trip.
- **start_station_name**: Name of the start station.
- **start_station_id**: Unique ID of the start station.
- **end_station_name**: Name of the end station.
- **end_station_id**: Unique ID of the end station.
- **start_lat**: Latitude of the start station.
- **start_lng**: Longitude of the start station.
- **end_lat**: Latitude of the end station.
- **end_lng**: Longitude of the end station.
- **member_casual**: Whether the rider is a Divvy member or a casual user.
[Link to Dataset](https://divvy-tripdata.s3.amazonaws.com/index.html)
## Installation
The complete project is hosted on google cloud.
## Visualization :
[Link to Visualization](https://lookerstudio.google.com/reporting/ccd00616-ec8b-443f-b6e3-c6e6446bfc8c)
### Prerequisites
To run this project, ensure the following tools are installed:
1. **Python** (version X.X.X)
2. **Docker** (for containerization)
3. **DBT** (for data transformation)
4. **Terraform** (for infrastructure management)
5. **Airflow** (for task scheduling)
6. **Spark** (analytics engine)