{"id":51274783,"url":"https://github.com/reza-saeedi-coding/aws-financial-operations-data-pipeline","last_synced_at":"2026-06-29T20:30:27.774Z","repository":{"id":363205504,"uuid":"1257093991","full_name":"reza-saeedi-coding/aws-financial-operations-data-pipeline","owner":"reza-saeedi-coding","description":"End-to-end cloud data engineering pipeline for financial operations analytics using Python, Amazon S3, AWS Glue, Athena, Lambda, CloudWatch, Docker, and GitHub Actions.","archived":false,"fork":false,"pushed_at":"2026-06-07T21:34:25.000Z","size":4022,"stargazers_count":1,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-06-07T23:20:30.931Z","etag":null,"topics":["amazon-athena","amazon-s3","aws","aws-glue","aws-lambda","cloudwatch","data-engineering","data-pipeline","docker","etl","github-actions","parquet","portfolio-project","python","sql","streamlit"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/reza-saeedi-coding.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-06-02T11:05:51.000Z","updated_at":"2026-06-07T21:54:13.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/reza-saeedi-coding/aws-financial-operations-data-pipeline","commit_stats":null,"previous_names":["reza-saeedi-coding/aws-financial-operations-data-pipeline"],"tags_count":null,"template":false,"template_full_name":null,"purl":"pkg:github/reza-saeedi-coding/aws-financial-operations-data-pipeline","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/reza-saeedi-coding%2Faws-financial-operations-data-pipeline","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/reza-saeedi-coding%2Faws-financial-operations-data-pipeline/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/reza-saeedi-coding%2Faws-financial-operations-data-pipeline/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/reza-saeedi-coding%2Faws-financial-operations-data-pipeline/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/reza-saeedi-coding","download_url":"https://codeload.github.com/reza-saeedi-coding/aws-financial-operations-data-pipeline/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/reza-saeedi-coding%2Faws-financial-operations-data-pipeline/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":34942665,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-06-29T02:00:05.398Z","response_time":58,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["amazon-athena","amazon-s3","aws","aws-glue","aws-lambda","cloudwatch","data-engineering","data-pipeline","docker","etl","github-actions","parquet","portfolio-project","python","sql","streamlit"],"created_at":"2026-06-29T20:30:25.497Z","updated_at":"2026-06-29T20:30:27.766Z","avatar_url":"https://github.com/reza-saeedi-coding.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# AWS Financial Operations Data Pipeline\n\nCloud Data Pipeline for Business Operations Analytics\n\n## Overview\n\nThis project is an end-to-end data engineering pipeline that simulates how a small business collects, validates, cleans, stores, and analyzes financial operations data.\n\nThe pipeline uses realistic synthetic data for customers, invoices, payments, and expenses. It demonstrates a data lake-style workflow with raw data, validation, quarantine handling, processed datasets, curated Parquet outputs, business metrics, SQL analytics, automated tests, Docker support, a Streamlit dashboard, and an AWS cloud analytics extension.\n\nThe project first runs as a complete local data engineering pipeline. Its output layers are then uploaded to Amazon S3 and queried through Amazon Athena using external tables registered in the AWS Glue Data Catalog.\n\nThe AWS extension was later strengthened with least-privilege IAM, Python-based S3 upload automation, an AWS Glue ETL job, AWS Lambda orchestration, S3 event triggering, CloudWatch logging, and a safe Terraform / Infrastructure as Code layer.\n\n## Pipeline Flow\n\n```text\nRaw CSV data\n-\u003e Raw data validation\n-\u003e Rejected records saved to quarantine\n-\u003e Data cleaning\n-\u003e Processed CSV layer\n-\u003e Processed data validation\n-\u003e Raw and processed quality reports\n-\u003e Pipeline run summary\n-\u003e Partitioned curated Parquet layer\n-\u003e Business metrics\n-\u003e Streamlit dashboard\n-\u003e Amazon S3 data lake upload\n-\u003e AWS Glue Data Catalog external tables\n-\u003e Amazon Athena SQL analytics\n```\n\n## AWS Phase 2 Event-Driven Flow\n\n```text\nLocal project outputs\n-\u003e Python upload_to_s3.py\n-\u003e Amazon S3 raw layer\n-\u003e S3 ObjectCreated event on raw/invoices.csv\n-\u003e AWS Lambda trigger\n-\u003e AWS Glue Job\n-\u003e Partitioned Parquet output in S3 curated layer\n-\u003e CloudWatch Logs monitoring\n```\n\n## Datasets\n\nThe project uses four synthetic financial operations datasets:\n\n* `customers.csv`\n* `invoices.csv`\n* `payments.csv`\n* `expenses.csv`\n\n## Data Layers\n\nThe project generates and maintains the following local data layers:\n\n```text\ndata/raw/\ndata/processed/\ndata/quarantine/\ndata/quality_reports/\ndata/reports/\ndata/curated/\n```\n\nThese local layers are mirrored in Amazon S3 as a cloud data lake layout.\n\n## Current Features\n\n* Synthetic financial data generation for customers, invoices, payments, and expenses\n* Intentional dirty data injection for testing data quality handling\n* Raw data validation with rejected records preserved in a quarantine layer\n* Data cleaning and processed CSV output generation\n* Raw and processed data quality reports\n* Pipeline run summary with run ID, timestamp, row counts, rejected row counts, quality status, and final pipeline status\n* Referential integrity checks between customers, invoices, and payments\n* Amount, status, category, date, and duplicate validation\n* Curated Parquet output layer\n* Partitioned Parquet datasets for invoices, payments, and expenses by year and month\n* Business KPI generation from curated data\n* Athena-style SQL business queries\n* Local Streamlit dashboard for financial operations analytics\n* Pytest test suite for pipeline outputs and processed data quality\n* GitHub Actions CI for automated testing\n* PowerShell task runner for common local commands\n* Docker and Docker Compose support for reproducible local execution\n* Amazon S3 data lake storage for local pipeline outputs\n* AWS Glue Data Catalog metadata registration through Athena external tables\n* Amazon Athena SQL querying over curated Parquet data in S3\n* Least-privilege IAM setup for project AWS access\n* Python S3 upload automation using `boto3`\n* Secure local configuration pattern using `.env.example`\n* AWS Glue job for transforming raw invoice CSV data into partitioned Parquet\n* AWS Lambda function for starting the Glue job\n* S3 event trigger for event-driven Lambda orchestration\n* CloudWatch logs for Glue and Lambda monitoring\n* AWS evidence screenshots for Athena queries, IAM setup, S3 automation, Glue execution, Lambda triggering, and CloudWatch logs\n* Terraform / Infrastructure as Code structure with provider configuration, variables, outputs, read-only Glue/Athena references, and import-ready templates\n\n## Technologies\n\n* Python\n* pandas\n* Faker\n* PyArrow / Parquet\n* Streamlit\n* pytest\n* SQL\n* PowerShell\n* Docker\n* Docker Compose\n* GitHub Actions\n* boto3\n* python-dotenv\n* Amazon S3\n* AWS IAM\n* AWS Glue Data Catalog\n* AWS Glue Jobs\n* Amazon Athena\n* AWS Lambda\n* Amazon CloudWatch Logs\n* Terraform\n\n## Important Outputs\n\nProcessed datasets:\n\n```text\ndata/processed/customers_cleaned.csv\ndata/processed/invoices_cleaned.csv\ndata/processed/payments_cleaned.csv\ndata/processed/expenses_cleaned.csv\n```\n\nQuarantine datasets:\n\n```text\ndata/quarantine/customers_rejected.csv\ndata/quarantine/invoices_rejected.csv\ndata/quarantine/payments_rejected.csv\ndata/quarantine/expenses_rejected.csv\n```\n\nQuality and observability outputs:\n\n```text\ndata/quality_reports/raw_quality_report.csv\ndata/quality_reports/processed_quality_report.csv\ndata/reports/pipeline_run_summary.json\n```\n\nCurated Parquet outputs:\n\n```text\ndata/curated/customers/customers.parquet\ndata/curated/invoices/year=*/month=*/*.parquet\ndata/curated/payments/year=*/month=*/*.parquet\ndata/curated/expenses/year=*/month=*/*.parquet\n```\n\nAWS Glue generated Parquet output:\n\n```text\ns3://aws-finops-reza-saeedi-20260603/curated/glue/invoices/\n```\n\nBusiness metrics:\n\n```text\ndata/curated/reports/business_metrics.csv\n```\n\n## Business Metrics\n\nThe dashboard and reporting layer include financial operations KPIs such as:\n\n* Total customers\n* Total invoices\n* Total payments\n* Total expenses\n* Total invoiced amount\n* Total collected amount\n* Open invoice amount\n* Net cash flow\n\n## SQL Analytics\n\nThe project includes SQL queries for business analysis, including:\n\n* Monthly collected revenue\n* Monthly expenses\n* Monthly net cash flow\n* Open invoice amount\n* Overdue invoice amount\n* Invoice aging buckets\n* Payment delay analysis\n* Top customers\n* Expenses by category\n* Invoice-payment reconciliation\n\nLocal SQL queries are stored in:\n\n```text\nsql/business_queries.sql\n```\n\nAWS Athena SQL files are stored in:\n\n```text\nsql/aws/create_athena_tables.sql\nsql/aws/business_queries.sql\n```\n\n## AWS Cloud Extension\n\nThis project was extended from a local data engineering pipeline into an AWS-based cloud data lake workflow.\n\nThe local pipeline generates raw, processed, quarantine, quality report, business report, and curated Parquet outputs. These outputs are uploaded to Amazon S3 using a structured data lake layout.\n\n### AWS Services Used\n\n* **Amazon S3**: Used as the cloud data lake storage layer.\n* **AWS IAM**: Used for least-privilege access control.\n* **AWS Glue Data Catalog**: Used as the metadata catalog for Athena external tables.\n* **AWS Glue Jobs**: Used for cloud ETL processing from raw CSV to partitioned Parquet.\n* **Amazon Athena**: Used to query curated Parquet datasets directly from S3 using SQL.\n* **AWS Lambda**: Used for lightweight orchestration of the Glue job.\n* **Amazon CloudWatch Logs**: Used for job and function execution monitoring.\n\n### S3 Data Lake Layout\n\n```text\ns3://aws-finops-reza-saeedi-20260603/raw/\ns3://aws-finops-reza-saeedi-20260603/processed/\ns3://aws-finops-reza-saeedi-20260603/quarantine/\ns3://aws-finops-reza-saeedi-20260603/quality_reports/\ns3://aws-finops-reza-saeedi-20260603/reports/\ns3://aws-finops-reza-saeedi-20260603/curated/\ns3://aws-finops-reza-saeedi-20260603/athena-results/\ns3://aws-finops-reza-saeedi-20260603/scripts/glue_jobs/\n```\n\n### Athena Database\n\nThe Athena database used for the project is:\n\n```text\nfinancial_ops_db\n```\n\nThis database stores metadata only. The actual data remains in Amazon S3.\n\n### Athena External Tables\n\nThe following external tables were created in Athena under the `financial_ops_db` database:\n\n* `customers`\n* `invoices`\n* `payments`\n* `expenses`\n* `business_metrics`\n\nThe `invoices`, `payments`, and `expenses` tables are partitioned by:\n\n```text\nyear\nmonth\n```\n\nPartition metadata was registered in Athena using:\n\n```sql\nMSCK REPAIR TABLE financial_ops_db.invoices;\nMSCK REPAIR TABLE financial_ops_db.payments;\nMSCK REPAIR TABLE financial_ops_db.expenses;\n```\n\n### Athena Business Queries\n\nBusiness queries were executed in Athena to analyze:\n\n* Monthly invoice totals\n* Payment status summary\n* Monthly cash flow\n* Invoice status summary\n\nEvidence screenshots are stored in:\n\n```text\ndocs/aws_evidence/monthly_invoice_totals.png\ndocs/aws_evidence/payment_status_summary.png\ndocs/aws_evidence/monthly_cash_flow.png\ndocs/aws_evidence/invoice_status_summary.png\n```\n\n### AWS Query Evidence\n\nThe AWS evidence screenshots show Athena query results, query completion status, data scanned, and the Athena/Frankfurt region context.\n\nThese screenshots demonstrate that the curated Parquet data was successfully queried from Amazon S3 through Athena.\n\n## Phase 2 - AWS Automation and Orchestration\n\nPhase 2 extends the AWS version of the project from a manually configured cloud data lake into a more automated and event-driven data engineering workflow.\n\n### Implemented AWS Phase 2 Components\n\n* Least-privilege IAM setup for project access\n* Dedicated IAM user and group for local S3 automation\n* Dedicated IAM role for AWS Glue job execution\n* Dedicated IAM role for AWS Lambda orchestration\n* Secure `.env.example` configuration pattern\n* Python-based S3 upload automation using `boto3`\n* AWS Glue job for transforming raw invoice CSV data into partitioned Parquet\n* CloudWatch Logs monitoring for Glue job execution\n* AWS Lambda function for triggering the Glue job\n* S3 event trigger for starting the Lambda function when `raw/invoices.csv` is uploaded or overwritten\n\n### AWS Phase 2 Flow\n\n```text\nLocal project outputs\n        |\n        v\nPython upload_to_s3.py\n        |\n        v\nAmazon S3 raw layer\n        |\n        v\nS3 ObjectCreated event\n        |\n        v\nAWS Lambda\n        |\n        v\nAWS Glue Job\n        |\n        v\nPartitioned Parquet in S3 curated layer\n        |\n        v\nCloudWatch Logs monitoring\n```\n\n### Main Scripts\n\n```text\nscripts/upload_to_s3.py\nscripts/glue_jobs/invoices_to_parquet_glue_job.py\nscripts/lambda_functions/trigger_glue_job.py\n```\n\n### Phase 2 Documentation\n\n```text\ndocs/aws_phase2_iam.md\ndocs/aws_phase2_s3_automation.md\ndocs/aws_phase2_glue_job.md\ndocs/aws_phase2_lambda_orchestration.md\n```\n\n### Phase 2 Evidence\n\nAWS evidence screenshots are stored in:\n\n```text\ndocs/aws_evidence/\n```\n\nThe evidence includes IAM setup, S3 upload automation, Glue job execution, CloudWatch logs, Lambda manual test, and S3-triggered Lambda orchestration.\n\nExample Phase 2 evidence files:\n\n```text\ndocs/aws_evidence/iam_group_policy_attached.png\ndocs/aws_evidence/iam_project_user_group_membership.png\ndocs/aws_evidence/s3_upload_automation_result.png\ndocs/aws_evidence/glue_role_s3_policy_attached.png\ndocs/aws_evidence/glue_script_uploaded_to_s3.png\ndocs/aws_evidence/glue_job_run_succeeded.png\ndocs/aws_evidence/glue_parquet_output_s3.png\ndocs/aws_evidence/cloudwatch_glue_job_logs.png\ndocs/aws_evidence/lambda_manual_test_started_glue_job.png\ndocs/aws_evidence/s3_trigger_lambda_logs.png\ndocs/aws_evidence/s3_trigger_glue_job_succeeded.png\n```\n\n### AWS Glue Job\n\nThe AWS Glue job is named:\n\n```text\nfinancial-ops-invoices-to-parquet-job\n```\n\nIt reads raw invoice data from:\n\n```text\ns3://aws-finops-reza-saeedi-20260603/raw/invoices.csv\n```\n\nIt writes partitioned Parquet output to:\n\n```text\ns3://aws-finops-reza-saeedi-20260603/curated/glue/invoices/\n```\n\nThe output is partitioned by:\n\n```text\nyear\nmonth\n```\n\nThe Glue job uses overwrite mode for its target path so repeated event-driven test runs do not duplicate analytical output.\n\n### AWS Lambda Orchestration\n\nThe Lambda function is named:\n\n```text\nfinancial-ops-trigger-glue-job\n```\n\nIt starts the Glue job when `raw/invoices.csv` is uploaded or overwritten in S3.\n\nThe Lambda function uses an IAM execution role and does not store AWS credentials in code.\n\n### Security Notes\n\nThis project avoids storing AWS credentials in source code.\n\nThe real `.env` file is ignored by Git and must not be committed.\n\nThe committed `.env.example` file contains only non-secret example configuration.\n\nIAM permissions are scoped to the project bucket, the project Glue job, and the required AWS services.\n\nScreenshots should not expose AWS account IDs, full ARNs, access keys, secret keys, emails, or unblurred account details.\n\n### Cost-Aware AWS Usage\n\nThis project intentionally uses low-cost, serverless or on-demand AWS services:\n\n* Amazon S3 for object storage\n* AWS Glue Data Catalog for metadata\n* Amazon Athena for serverless SQL queries\n* AWS Glue Jobs for on-demand ETL execution\n* AWS Lambda for event-driven orchestration\n* Amazon CloudWatch Logs\n* Terraform for execution monitoring\n\nNo EC2, RDS, Redshift, EMR, or always-running compute services are required for the current AWS version.\n\nAthena queries are kept small and targeted, and curated analytical data is stored in Parquet with year/month partitioning to reduce scanned data.\n\n## Phase 3 - Terraform / Infrastructure as Code\n\nPhase 3 adds Terraform as a safe Infrastructure as Code layer for the existing AWS implementation.\n\nThe goal of this phase is not to rebuild the AWS environment from scratch. The core AWS resources were already created manually through the AWS Console during Phase 1 and Phase 2. Terraform is introduced gradually to document the infrastructure, reference existing resources safely, and prepare the project for future imports.\n\n### Implemented Terraform Components\n\n* Terraform project folder under `terraform/`\n* AWS provider configuration\n* Region and project variables\n* Safe outputs for project metadata\n* Read-only references to existing Glue/Athena tables\n* Import-ready documentation templates for:\n  * AWS Glue Job\n  * AWS Lambda function\n  * CloudWatch log groups\n  * S3 ObjectCreated orchestration flow\n\n### Terraform Folder Structure\n\n```text\nterraform/\n  README.md\n  versions.tf\n  providers.tf\n  variables.tf\n  outputs.tf\n  terraform.tfvars.example\n  s3.tf\n  glue_tables.tf\n  glue_job.tf\n  lambda.tf\n  cloudwatch.tf\n  orchestration.tf\n```\n\n### Validated Terraform Commands\n\nThe following Terraform commands were tested successfully:\n\n```powershell\nterraform init\nterraform fmt\nterraform validate\nterraform plan\n```\n\nThe current Terraform plan reads existing Glue/Athena metadata and shows output values without changing real AWS infrastructure.\n\n### Existing AWS Resources Referenced\n\nTerraform currently references or documents:\n\n* S3 bucket: `aws-finops-reza-saeedi-20260603`\n* Glue/Athena database: `financial_ops_db`\n* Glue/Athena tables:\n  * `customers`\n  * `invoices`\n  * `payments`\n  * `expenses`\n  * `business_metrics`\n* Glue job: `financial-ops-invoices-to-parquet-job`\n* Lambda function: `financial-ops-trigger-glue-job`\n* CloudWatch log retention: `14 days`\n* Event-driven workflow: S3 ObjectCreated trigger -\u003e Lambda -\u003e Glue Job\n\n### Terraform Safety Strategy\n\nTerraform is currently used in a non-destructive way.\n\nThe project avoids:\n\n* running `terraform apply`\n* deleting AWS resources\n* recreating existing AWS resources\n* committing Terraform state files\n* committing local `.tfvars` files\n* exposing AWS credentials, full ARNs, or AWS Account ID\n\nSome AWS resources already exist. Before Terraform actively manages them, they should either be imported with `terraform import` or left as documentation templates.\n\nExample future import command:\n\n```powershell\nterraform import aws_glue_job.invoices_to_parquet financial-ops-invoices-to-parquet-job\n```\n\nThis import command was documented but not executed during the current phase.\n\n### Phase 3 Documentation\n\nDetailed Phase 3 documentation is stored in:\n\n```text\ndocs/aws_phase3_terraform.md\n```\n\n## Local Execution\n\nThe project can be executed locally in multiple ways.\n\n### Run the full local pipeline with Python\n\n```powershell\npython scripts/run_local_pipeline.py\n```\n\n### Run tests locally\n\n```powershell\npytest\n```\n\n### Run the Streamlit dashboard locally\n\n```powershell\nstreamlit run dashboard/app.py\n```\n\nThen open:\n\n```text\nhttp://localhost:8501\n```\n\n## AWS S3 Upload Automation\n\nThe project includes a Python script for uploading local output folders to Amazon S3:\n\n```powershell\npython scripts/upload_to_s3.py\n```\n\nThe script reads local configuration from environment variables and a local `.env` file.\n\nThe `.env` file must not be committed to Git.\n\nThe safe template file is:\n\n```text\n.env.example\n```\n\n## PowerShell Task Runner\n\nThe project includes a PowerShell task runner for common development commands.\n\nInstall dependencies:\n\n```powershell\n.\\tasks.ps1 install\n```\n\nRun the pipeline:\n\n```powershell\n.\\tasks.ps1 pipeline\n```\n\nRun tests:\n\n```powershell\n.\\tasks.ps1 test\n```\n\nRun pipeline and tests:\n\n```powershell\n.\\tasks.ps1 check\n```\n\nRun the dashboard:\n\n```powershell\n.\\tasks.ps1 dashboard\n```\n\n## Docker Execution\n\nDocker is used to provide a reproducible local execution environment for the pipeline, tests, and dashboard.\n\nBuild the Docker image manually:\n\n```powershell\ndocker build -t financial-ops-pipeline .\n```\n\nRun the pipeline manually with Docker:\n\n```powershell\ndocker run --name financial-ops-run financial-ops-pipeline\n```\n\nView pipeline logs:\n\n```powershell\ndocker logs financial-ops-run\n```\n\n## Docker Compose Execution\n\nRun the pipeline with Docker Compose:\n\n```powershell\ndocker compose up pipeline\n```\n\nRun tests with Docker Compose:\n\n```powershell\ndocker compose up test\n```\n\nRun the dashboard with Docker Compose:\n\n```powershell\ndocker compose up dashboard\n```\n\nThen open:\n\n```text\nhttp://localhost:8501\n```\n\nThe `localhost` address is the correct address to use from the host machine. Other Streamlit network or external URLs printed by the container may not be reachable on Windows due to Docker networking, firewall, or router behavior.\n\n## Tests\n\nThe project includes pytest tests for pipeline output validation and processed data quality.\n\nCurrent test status:\n\n```text\n7 passed\n```\n\nTest files:\n\n```text\ntests/test_pipeline_outputs.py\ntests/test_processed_data_quality.py\n```\n\n## Project Status\n\nCurrent status:\n\n```text\nLocal data engineering pipeline complete\nDocker and Docker Compose execution complete\nAutomated tests passing\nGitHub Actions CI passing\nAWS S3 data lake structure created\nLocal pipeline outputs uploaded to S3\nAthena database and external tables created\nPartitioned Parquet tables queried with Athena\nBusiness query evidence saved\nIAM least-privilege setup completed\nPython S3 upload automation completed\nAWS Glue job created and executed successfully\nGlue output written as partitioned Parquet in S3\nCloudWatch logs verified for Glue execution\nLambda function created for Glue orchestration\nS3 trigger added for raw invoice upload events\nS3 event successfully triggered Lambda and started Glue job\nTerraform structure added for Infrastructure as Code\nTerraform init, fmt, validate, and plan tested successfully\nExisting Glue/Athena tables referenced safely with Terraform\n```\n\nThis project can now be described as a local data engineering pipeline extended with an AWS S3, Glue Data Catalog, Athena analytics, Glue ETL, Lambda orchestration, CloudWatch monitoring, and Terraform / Infrastructure as Code layer.\n\n## Next Steps\n\nPlanned next steps:\n\n* Optionally import selected existing AWS resources into Terraform state after careful review\n* Add deployment documentation for AWS resources and scripts\n* Add GitHub Actions deployment notes for future CI/CD extension\n* Add a final AWS cost cleanup checklist\n* Improve architecture diagrams for README and LinkedIn project presentation\n* Optionally add more Glue jobs for payments and expenses\n* Optionally add Athena queries over Glue-generated curated outputs\n\n## Project Goal\n\nThe goal of this project is to demonstrate practical junior data engineering skills, including:\n\n* Batch data ingestion\n* Data validation\n* Data cleaning\n* Quarantine handling for rejected records\n* Data quality reporting\n* Pipeline observability\n* Data lake-style layering\n* Parquet conversion\n* Partitioned analytical storage\n* SQL-based business analytics\n* Dashboard reporting\n* Automated testing\n* Dockerized local execution\n* Amazon S3 data lake storage\n* AWS IAM least-privilege access control\n* AWS Glue Data Catalog metadata management\n* Amazon Athena serverless SQL analytics\n* AWS Glue cloud ETL processing\n* AWS Lambda event-driven orchestration\n* Amazon CloudWatch logging and monitoring\n* Secure environment configuration using `.env.example`\n* Cost-aware cloud data engineering basics\n* Infrastructure as Code documentation and safe Terraform adoption\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Freza-saeedi-coding%2Faws-financial-operations-data-pipeline","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Freza-saeedi-coding%2Faws-financial-operations-data-pipeline","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Freza-saeedi-coding%2Faws-financial-operations-data-pipeline/lists"}