https://github.com/pragmaai/yelp-datapipeline
π½οΈ Yelp Data Pipeline & Analytics Dashboard End-to-end data engineering pipeline processing Yelp dataset with Rust transforms, Apache Airflow orchestration, and interactive Streamlit analytics. Features business insights, user engagement analysis, and city performance comparisons. π Docker-ready β’ π Interactive Dashboard β’ β‘ High-performance R
https://github.com/pragmaai/yelp-datapipeline
airflow data-engineering data-pipeline data-visualization datafusion docker rust streamlit yelp yelp-dataset
Last synced: 3 months ago
JSON representation
π½οΈ Yelp Data Pipeline & Analytics Dashboard End-to-end data engineering pipeline processing Yelp dataset with Rust transforms, Apache Airflow orchestration, and interactive Streamlit analytics. Features business insights, user engagement analysis, and city performance comparisons. π Docker-ready β’ π Interactive Dashboard β’ β‘ High-performance R
- Host: GitHub
- URL: https://github.com/pragmaai/yelp-datapipeline
- Owner: PragmaAI
- Created: 2025-06-28T11:38:41.000Z (3 months ago)
- Default Branch: main
- Last Pushed: 2025-06-28T11:43:36.000Z (3 months ago)
- Last Synced: 2025-06-28T12:38:05.171Z (3 months ago)
- Topics: airflow, data-engineering, data-pipeline, data-visualization, datafusion, docker, rust, streamlit, yelp, yelp-dataset
- Language: Jupyter Notebook
- Homepage:
- Size: 0 Bytes
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# π½οΈ Yelp Data Pipeline & Analytics Dashboard
A comprehensive data engineering pipeline that processes Yelp dataset JSON files, transforms them using Rust for high performance, orchestrates workflows with Apache Airflow, and provides interactive analytics through a Streamlit dashboard.
## π― Project Overview
This project demonstrates a modern data engineering stack for processing and analyzing Yelp business data:
- **Data Ingestion**: JSON to Parquet conversion for efficient storage
- **Data Transformation**: High-performance Rust-based data processing
- **Workflow Orchestration**: Apache Airflow DAGs for reliable pipeline execution
- **Data Visualization**: Interactive Streamlit dashboard for business insights
- **Containerization**: Docker Compose for easy deployment## ποΈ Architecture
```
βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ
β Raw JSON β β Parquet β β Analytics β
β Data Files βββββΆβ Conversion βββββΆβ Dashboard β
βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ
β
βΌ
βββββββββββββββββββ
β Rust Data β
β Transform β
βββββββββββββββββββ
β
βΌ
βββββββββββββββββββ
β Apache β
β Airflow β
βββββββββββββββββββ
```## π Quick Start
### Prerequisites
- Docker and Docker Compose
- Python 3.8+
- Rust (for local development)
- Git### 1. Clone the Repository
```bash
git clone https://github.com/PragmaAI/yelp-datapipeline.git
cd yelp-datapipeline
```### 2. Prepare Your Data
Place your Yelp dataset JSON files in the `data/raw/` directory:
```
data/
βββ raw/
β βββ business.json
β βββ review.json
β βββ user.json
β βββ tip.json
βββ processed/
βββ (will be created automatically)
```## π³ Running with Docker Compose
### Start Airflow
```bash
# Start Airflow services
docker-compose up -d# Access Airflow UI
open http://localhost:8080
# Default credentials: airflow/airflow
```### Run the Data Pipeline
1. **Navigate to Airflow UI**: http://localhost:8080
2. **Enable DAGs**: Click the toggle switch next to each DAG
3. **Trigger DAGs** in this order:
- `json_to_parquet_dag` - Converts JSON to Parquet
- `rust_transform_dag` - Runs Rust data transformations
- `yelp_rolling_etl` - Performs rolling ETL operations### Monitor Pipeline Execution
- **DAGs Tab**: View all available workflows
- **Graph View**: Visualize DAG dependencies
- **Logs**: Check task execution logs
- **XCom**: View data passed between tasks## π Streamlit Analytics Dashboard
### Start the Dashboard
```bash
# Navigate to streamlit app directory
cd streamlit_app# Install dependencies
pip install -r requirements.txt# Run the dashboard
./run_app.sh
# or manually:
streamlit run app.py --server.port 8501 --server.address 0.0.0.0
```### Access the Dashboard
Open your browser and navigate to: **http://localhost:8501**
## π¨ Dashboard Features
### π Dashboard Overview
- **Key Metrics**: Business counts, user engagement, elite users
- **Business Performance**: City-wise comparison charts
- **User Engagement**: Distribution analysis
- **Top Performers**: Best-rated businesses and active users### π’ Business Analytics
- **Interactive Filtering**: Filter by city, category, and rating
- **Performance Metrics**: Rating distribution, review analysis
- **Category Insights**: Business category performance
- **City Comparison**: Cross-city business analysis### π₯ User Analytics
- **User Engagement**: Activity patterns and user categories
- **Elite Users**: Analysis of elite user characteristics
- **Sentiment Analysis**: User sentiment patterns
- **User Compliments**: Recognition and engagement metrics
- **Activity Timeline**: User activity over time## π§ Manual Development Setup
### Local Airflow Setup
```bash
# Install Airflow
pip install apache-airflow# Initialize Airflow database
airflow db init# Create admin user
airflow users create \
--username admin \
--firstname Admin \
--lastname User \
--role Admin \
--email admin@example.com \
--password admin# Start Airflow webserver
airflow webserver --port 8080# Start Airflow scheduler (in another terminal)
airflow scheduler
```### Rust Development
```bash
# Navigate to Rust project
cd scripts/transform# Build the project
cargo build --release# Run tests
cargo test# Run the transform
cargo run --release
```### Python Dependencies
```bash
# Install Python dependencies
pip install -r requirements.txt# For development
pip install -r requirements-dev.txt # if available
```## π Project Structure
```
yelp-datapipeline/
βββ airflow/ # Airflow Docker configuration
β βββ Dockerfile
β βββ entrypoint.sh
βββ dags/ # Airflow DAGs
β βββ json_to_parquet_dag.py
β βββ rust_transform_dag.py
β βββ yelp_rolling_etl.py
β βββ README.md
βββ data/ # Data storage
β βββ raw/ # Raw JSON files
β βββ processed/ # Processed Parquet files
βββ notebooks/ # Jupyter notebooks
β βββ analysis.ipynb
βββ scripts/ # Data processing scripts
β βββ json_to_parquet.py # Python JSON converter
β βββ transform/ # Rust data transformer
β βββ Cargo.toml
β βββ src/main.rs
βββ streamlit_app/ # Streamlit dashboard
β βββ app.py
β βββ requirements.txt
β βββ run_app.sh
β βββ README.md
βββ docker-compose.yml # Docker services
βββ requirements.txt # Python dependencies
βββ run_pipeline.sh # Pipeline runner
βββ start_airflow.sh # Airflow starter
```## π Data Pipeline Flow
### 1. Data Ingestion
- **Input**: Yelp JSON files (business, review, user, tip)
- **Process**: Convert to Parquet format for efficient storage
- **Output**: Parquet files in `data/processed/`### 2. Data Transformation
- **Input**: Parquet files from ingestion
- **Process**: Rust-based transformations for high performance
- **Output**: Enhanced analytics datasets### 3. Analytics Processing
- **Input**: Transformed data
- **Process**: Generate business insights, user analytics, city comparisons
- **Output**: Analytics-ready datasets for dashboard### 4. Visualization
- **Input**: Analytics datasets
- **Process**: Streamlit dashboard rendering
- **Output**: Interactive web interface## π Key Analytics Features
### Business Insights
- Top-performing businesses by city
- Rating distribution analysis
- Category performance comparison
- Review sentiment analysis### User Analytics
- User engagement patterns
- Elite user characteristics
- User sentiment analysis
- Activity timeline tracking### City Performance
- Cross-city business comparison
- Rating tier analysis
- Review volume analysis
- Business density metrics## π οΈ Configuration
### Environment Variables
Create a `.env` file for custom configuration:
```bash
# Airflow Configuration
AIRFLOW_UID=50000
AIRFLOW_GID=0# Database Configuration
POSTGRES_USER=airflow
POSTGRES_PASSWORD=airflow
POSTGRES_DB=airflow# Data Paths
DATA_RAW_PATH=./data/raw
DATA_PROCESSED_PATH=./data/processed
```### Docker Configuration
The `docker-compose.yml` includes:
- **Airflow Webserver**: Web UI for DAG management
- **Airflow Scheduler**: Executes DAGs
- **PostgreSQL**: Metadata database
- **Redis**: Celery backend (if using distributed execution)## π Troubleshooting
### Common Issues
1. **Port Conflicts**
```bash
# Check if ports are in use
lsof -i :8080 # Airflow
lsof -i :8501 # Streamlit
```2. **Permission Issues**
```bash
# Fix file permissions
sudo chown -R $USER:$USER data/
chmod +x run_pipeline.sh start_airflow.sh
```3. **Docker Issues**
```bash
# Clean up Docker
docker-compose down -v
docker system prune -f
```4. **Data Loading Errors**
- Ensure JSON files are in `data/raw/`
- Check file permissions
- Verify JSON format is valid### Logs and Debugging
```bash
# Airflow logs
docker-compose logs airflow-webserver
docker-compose logs airflow-scheduler# Streamlit logs
streamlit run app.py --logger.level debug
```## π Performance Optimization
### Rust Transformations
- **Parallel Processing**: Multi-threaded data processing
- **Memory Efficiency**: Optimized for large datasets
- **Type Safety**: Compile-time error checking### Data Storage
- **Parquet Format**: Columnar storage for fast queries
- **Compression**: Efficient data compression
- **Partitioning**: Optimized data partitioning### Dashboard Performance
- **Caching**: Streamlit caching for faster loading
- **Lazy Loading**: Load data on demand
- **Optimized Queries**: Efficient data filtering## π€ Contributing
1. Fork the repository
2. Create a feature branch (`git checkout -b feature/amazing-feature`)
3. Commit your changes (`git commit -m 'Add amazing feature'`)
4. Push to the branch (`git push origin feature/amazing-feature`)
5. Open a Pull Request## π License
This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
## π Acknowledgments
- **Yelp Dataset**: For providing the open dataset
- **Apache Airflow**: For workflow orchestration
- **Rust**: For high-performance data processing
- **Streamlit**: For interactive data visualization## π Support
For questions and support:
- Create an issue on GitHub
- Check the documentation in each component directory
- Review the troubleshooting section above---
**Happy Data Engineering! π**