{"id":27648182,"url":"https://github.com/95xin/data-engineering-project---automatic-batch-data-processing","last_synced_at":"2026-04-28T08:01:44.484Z","repository":{"id":289141541,"uuid":"968707988","full_name":"95xin/Data-Engineering-Project---Automatic-Batch-Data-Processing","owner":"95xin","description":"Data Engineering Project - Automated Batch Data Processing","archived":false,"fork":false,"pushed_at":"2025-04-21T18:18:32.000Z","size":1020,"stargazers_count":1,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-04-24T02:40:42.048Z","etag":null,"topics":["airflow","bigquery","data-engineering","data-pipeline","data-schema","elt","postgresql-database","pyspark","python"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/95xin.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2025-04-18T15:26:15.000Z","updated_at":"2025-04-21T18:18:36.000Z","dependencies_parsed_at":"2025-04-21T19:39:45.788Z","dependency_job_id":null,"html_url":"https://github.com/95xin/Data-Engineering-Project---Automatic-Batch-Data-Processing","commit_stats":null,"previous_names":["95xin/data-engineering-project---automatic-batch-data-processing"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/95xin/Data-Engineering-Project---Automatic-Batch-Data-Processing","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/95xin%2FData-Engineering-Project---Automatic-Batch-Data-Processing","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/95xin%2FData-Engineering-Project---Automatic-Batch-Data-Processing/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/95xin%2FData-Engineering-Project---Automatic-Batch-Data-Processing/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/95xin%2FData-Engineering-Project---Automatic-Batch-Data-Processing/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/95xin","download_url":"https://codeload.github.com/95xin/Data-Engineering-Project---Automatic-Batch-Data-Processing/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/95xin%2FData-Engineering-Project---Automatic-Batch-Data-Processing/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":32371672,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-04-27T20:07:02.737Z","status":"online","status_checked_at":"2026-04-28T02:00:07.250Z","response_time":56,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["airflow","bigquery","data-engineering","data-pipeline","data-schema","elt","postgresql-database","pyspark","python"],"created_at":"2025-04-24T02:34:13.819Z","updated_at":"2026-04-28T08:01:44.478Z","avatar_url":"https://github.com/95xin.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Data Engineering Project - Automated Batch Data Processing\n\nThis project implements a complete end-to-end data processing pipeline for New York taxi data. The pipeline uses Airflow for orchestration, Spark for large-scale data processing, PostgreSQL as an intermediate storage layer, and BigQuery as the final data warehouse.\n\n## Project Architecture\n\nThe entire data pipeline follows this workflow:\n\n```\nData Ingestion → Spark Preprocessing → PostgreSQL Storage → Data Modeling → BigQuery Sync → Visualization\n```\n\nThe following image shows the Airflow DAG execution graph for the NYC taxi data pipeline:\n\n![NYC Taxi Pipeline DAG](images/nyc_taxi_pipeline_dag.png)\nThe image should show the workflow with tasks including:\n- download_data\n- spark_clean_data\n- load_to_postgres\n- check_data_quality\n- transform_in_postgres\n- upload_to_bq_from_spark\n- upload_to_bq_from_postgres \n\n## Technology Stack\n\n- **Apache Airflow**: Workflow orchestration tool\n- **Apache Spark**: Distributed data processing engine\n- **PostgreSQL**: Relational database used as intermediate storage\n- **Google BigQuery**: Cloud data warehouse\n- **Docker**: Containerized deployment\n- **Python**: Primary programming language\n- **Grafana/Looker Studio**: Data visualization tools (optional)\n\n## Project Structure\n\n```\n.\n├── config/             # Configuration files (cloud credentials, database connection info)\n├── dags/               # Airflow DAG definitions (NYC taxi data processing workflow)\n├── data/               # Data storage directory (raw and processed data)\n├── plugins/            # Airflow plugins (custom Operators, etc.)\n├── spark-apps/         # Spark applications (data cleaning and transformation logic)\n├── spark-data/         # Spark data files (processing results and temporary data)\n├── lab/                # Experimental code (proof of concepts and tests)\n├── docker-compose.yaml # Docker configuration (service definitions and dependencies)\n├── Dockerfile          # Docker image definition (environment setup)\n├── .env                # Environment variables (configuration parameters)\n└── README.md           # Project documentation\n```\n\n## Installation and Setup\n\n### Prerequisites\n\n- Docker and Docker Compose\n- Python 3.8+\n- Configured Google Cloud account (for BigQuery access)\n- PostgreSQL database (can be deployed via Docker)\n\n### Quick Start\n\n1. Clone the repository\n   ```bash\n   git clone https://github.com/95xin/Data-Engineering-Project---Automatic-Batch-Data-Processing.git\n   cd Data-Engineering-Project---Automatic-Batch-Data-Processing\n   ```\n\n2. Configure environment variables\n   ```bash\n   cp .env.example .env\n   # Edit the .env file with your configuration information\n   ```\n\n3. Add Google Cloud credential file to the config directory\n   - Download the service account key JSON file from Google Cloud Console\n   - Place the file in the config/ directory and update the path in the .env file\n\n4. Start the environment\n   ```bash\n   docker-compose up -d\n   ```\n\n5. Access the Airflow Web interface\n   ```\n   http://localhost:8080\n   ```\n   Default username and password can be found in the docker-compose.yaml file\n\n## Detailed Pipeline Stages\n\n### 1. Data Ingestion Stage\n\nUsing Python's requests library to download raw Parquet data and store it in a predefined path. This task ensures we have fresh monthly data for further processing.\n\nExample code:\n```python\ndef download_data():\n    url = \"https://data-source.com/path/to/taxi_data.parquet\"\n    response = requests.get(url)\n    with open(\"/path/to/data/raw_taxi_data.parquet\", \"wb\") as f:\n        f.write(response.content)\n```\n\n### 2. Preprocessing with Spark\n\nUsing SparkSubmitOperator to submit a PySpark job for data cleaning:\n- Remove rows with null values\n- Standardize datetime formats\n- Deduplicate records\n- Partition by month or date to allow multiple workers to process independently\n- Finally, use coalesce(1) to ensure downstream tasks handle only one file\n\nExample code:\n```python\ndf.dropna() \\\n  .dropDuplicates() \\\n  .repartition(\"year\", \"month\") \\\n  .write.parquet(\"/path/to/partitioned/data/\")\n\n# Final output as a single CSV file\ndf.coalesce(1).write.csv(\"/path/to/output.csv\", header=True)\n```\n\n### 3. PostgreSQL Storage Stage (Staging Layer)\n\nThis stage loads the cleaned data into PostgreSQL, acting as a controlled environment for further SQL transformations before sending the data to BigQuery.\n\nUsing chunk-based processing to avoid memory overload and improve performance:\n```python\ndf_iter = pd.read_csv(csv_file, chunksize=chunksize, iterator=True)\nwhile True:\n    try:\n        chunk = next(df_iter)\n        # Clean column names: remove spaces\n        chunk.columns = [col.strip() for col in chunk.columns]  \n        # Convert strings to datetime\n        chunk[\"tpep_pickup_datetime\"] = pd.to_datetime(chunk[\"tpep_pickup_datetime\"], errors=\"coerce\")\n        chunk[\"tpep_dropoff_datetime\"] = pd.to_datetime(chunk[\"tpep_dropoff_datetime\"], errors=\"coerce\")\n        # Insert data\n        chunk.to_sql(\"table_name\", engine, if_exists=\"append\", index=False, method=\"multi\")\n    except StopIteration:\n        print(\"All chunks processed. PostgreSQL loading complete!\")\n        break\n```\n\nAfter uploading large datasets, creating indexes on commonly filtered or joined columns (such as B-Tree indexes):\n```sql\nCREATE INDEX idx_pickup_datetime ON taxi_data(tpep_pickup_datetime);\n```\n\n### 4. Data Modeling Phase\n\nIn this stage, we perform data modeling and enrichment:\n- Create a main table structure, specifying pickup_date as the partition key\n- Create sub-tables\n- Use SQL statements with CASE WHEN logic to enrich the dataset\n\n```sql\n-- Add time bucket column\nALTER TABLE taxi_data ADD COLUMN time_bucket VARCHAR(20);\n\n-- Populate values using CASE WHEN\nUPDATE taxi_data\nSET time_bucket = CASE\n    WHEN EXTRACT(HOUR FROM tpep_pickup_datetime) BETWEEN 7 AND 10 THEN 'Morning Rush'\n    WHEN EXTRACT(HOUR FROM tpep_pickup_datetime) BETWEEN 16 AND 19 THEN 'Evening Rush'\n    ELSE 'Other'\nEND;\n```\n\n### 5. BigQuery Synchronization Stage (Warehouse Layer)\n\nConfiguring LoadJobConfig to enable schema autodetection and automatically skip header rows during ingestion:\n\n```python\nfrom google.cloud import bigquery\n\nclient = bigquery.Client()\njob_config = bigquery.LoadJobConfig(\n    source_format=bigquery.SourceFormat.CSV,\n    skip_leading_rows=1,\n    autodetect=True,\n)\n\nwith open(csv_file, \"rb\") as source_file:\n    load_job = client.load_table_from_file(\n        source_file,\n        \"project.dataset.table\",\n        job_config=job_config\n    )\n\nload_job.result()  # Wait for the job to complete\n```\n\n### 6. Visualization Stage (Optional)\n\nUsing Grafana or Looker Studio to connect to BigQuery datasets and build dashboards for insights like trip volume, average fare by time buckets, etc.\n\nWhile this part is optional, it helps validate the entire pipeline by visually checking the processed results.\n\n## Monitoring and Maintenance\n\nThe project includes a monitoring layer to track pipeline health, task failures, data anomalies, and performance metrics:\n\n```python\ndefault_args = {\n    'owner': 'airflow',\n    'email': ['your_email@example.com'],\n    'email_on_failure': True,\n    'retries': 1,\n    'retry_delay': timedelta(minutes=5),\n}\n```\n\nAdditionally, we use:\n- Airflow's task logs to analyze execution times\n- Spark UI to check stages with high shuffle read/write times and optimize accordingly\n\n## Incremental Loading Mechanism\n\nWhen the data pipeline runs automatically the next time, the mechanism for uploading new data is incremental loading. This ensures that only new or changed data is processed, improving efficiency and reducing resource consumption.\n\n## Summary\n\nIn summary, this project covers a complete batch processing workflow from ingestion to warehouse loading. It demonstrates hands-on experience with orchestration (Airflow), distributed processing (Spark), relational storage (Postgres), cloud warehouse (BigQuery), and end-to-end pipeline design.\n\n## Contributing\n\nContributions and improvement suggestions are welcome! Please follow these steps:\n\n1. Fork the repository\n2. Create your feature branch (`git checkout -b feature/AmazingFeature`)\n3. Commit your changes (`git commit -m 'Add some AmazingFeature'`)\n4. Push to the branch (`git push origin feature/AmazingFeature`)\n5. Open a Pull Request\n\n## License\n\n[MIT](LICENSE) \n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2F95xin%2Fdata-engineering-project---automatic-batch-data-processing","html_url":"https://awesome.ecosyste.ms/projects/github.com%2F95xin%2Fdata-engineering-project---automatic-batch-data-processing","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2F95xin%2Fdata-engineering-project---automatic-batch-data-processing/lists"}