{"id":29728618,"url":"https://github.com/djirlic/fraud-detection-e2e-pipeline","last_synced_at":"2026-02-16T10:04:43.315Z","repository":{"id":305774711,"uuid":"1023898452","full_name":"Djirlic/fraud-detection-e2e-pipeline","owner":"Djirlic","description":"End-to-end analytical data pipeline for credit card fraud detection using AWS, Snowflake, dbt, Airflow, and Streamlit.","archived":false,"fork":false,"pushed_at":"2025-08-22T23:55:54.000Z","size":978,"stargazers_count":0,"open_issues_count":0,"forks_count":2,"subscribers_count":0,"default_branch":"main","last_synced_at":"2025-10-03T17:57:39.589Z","etag":null,"topics":["airflow","analytics","aws","data-engineering","data-pipeline","end-to-end","etl","fraud-detection","modern-data-stack","python","snowflake","streamlit"],"latest_commit_sha":null,"homepage":"","language":null,"has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Djirlic.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2025-07-21T21:52:33.000Z","updated_at":"2025-08-22T23:55:57.000Z","dependencies_parsed_at":"2025-10-03T17:44:15.082Z","dependency_job_id":"467c8d40-acb6-489c-a238-06506cdcd964","html_url":"https://github.com/Djirlic/fraud-detection-e2e-pipeline","commit_stats":null,"previous_names":["djirlic/fraud-detection-e2e-pipeline"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/Djirlic/fraud-detection-e2e-pipeline","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Djirlic%2Ffraud-detection-e2e-pipeline","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Djirlic%2Ffraud-detection-e2e-pipeline/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Djirlic%2Ffraud-detection-e2e-pipeline/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Djirlic%2Ffraud-detection-e2e-pipeline/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Djirlic","download_url":"https://codeload.github.com/Djirlic/fraud-detection-e2e-pipeline/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Djirlic%2Ffraud-detection-e2e-pipeline/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":29505673,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-02-16T09:05:14.864Z","status":"ssl_error","status_checked_at":"2026-02-16T08:55:59.364Z","response_time":115,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["airflow","analytics","aws","data-engineering","data-pipeline","end-to-end","etl","fraud-detection","modern-data-stack","python","snowflake","streamlit"],"created_at":"2025-07-25T02:40:31.047Z","updated_at":"2026-02-16T10:04:43.279Z","avatar_url":"https://github.com/Djirlic.png","language":null,"funding_links":[],"categories":[],"sub_categories":[],"readme":"# AWS \u0026 Snowflake Batch Processing Pipeline for Credit Card Transactions\n\n# Introduction \u0026 Goals\n\nCloud-native data platforms are reshaping how companies build analytical systems. With a strong background in software engineering and a growing focus on data engineering, I created this end-to-end project to explore modern tools and best practices for building analytical data pipelines in the cloud.\n\nIn this project, I worked with a public credit card transaction dataset (sourced from Kaggle) to simulate a real-world fraud detection pipeline. The system is designed to process and transform raw transactional data using scalable, modular components - from ingestion through transformation and orchestration, all the way to a business-facing dashboard.\n\nThe goal was to build a production-style analytical pipeline, using best practices such as CI/CD, testing, and modular design and not just to move data, but to do it the right way.\n\n# Contents  \n\n- [1. Architecture Diagram \u0026 Data Flow](#1-architecture-diagram--data-flow)\n  - [Architecture Diagram](#architecture-diagram)\n  - [Data Flow](#data-flow)\n- [2. Production-grade Engineering](#2-production-grade-engineering)\n- [3. Project Repositories](#3-project-repositories)\n  - [Amazon Web Services / Ingestion](#amazon-web-services--ingestion)\n  - [Snowflake / Storage](#snowflake--storage)\n  - [dbt / Transformation](#dbt--transformation)\n  - [Airflow / Orchestration](#airflow--orchestration)\n  - [Streamlit / Visualization](#streamlit--visualization)\n- [4. Visualizations](#4-visualizations)\n- [5. How to run](#5-how-to-run)\n- [6. Future Improvements](#6-future-improvements)\n- [7. Credits \u0026 Mentorship](#7-credits--mentorship)\n- [8. About me \u0026 Contact](#8-about-me--contact)\n\n# 1. Architecture Diagram \u0026 Data Flow\n\n## Architecture Diagram\n\n![A diagram of the current flow of data and tools as described in the section below.](images/tools-and-data-flow.png)\n\n## Data Flow\n\nThis project simulates a near-production data pipeline with minimal manual steps. From local ingestion to analytical visualization, each component is modular and automatically triggered through well-integrated AWS and SaaS services.\n\nWhile not yet packaged for infrastructure-as-code deployment, the system is built for scalability, modularity, and clarity. The Python scripts work out-of-the-box, and each step is documented and testable.\n\nThe whole pipeline requires a single-command to run from start to finish.\n\n### Dataset\n\nThe pipeline starts with a publicly available credit card fraud detection dataset from [Kaggle](https://www.kaggle.com/datasets/kartik2112/fraud-detection/data).\n\nTo simulate ongoing ingestion (e.g., daily or weekly loads), I created a Python CLI that splits large CSVs into one file per day based on a timestamp column.\n\n![A diagram of the the step from Kaggle data, to a split up dataset.](images/split-dataset.png)\n\nAdditionally, I incorporated US geo-location data from the [U.S. Census Bureau](https://www.census.gov/programs-surveys/geography/data.html) to enrich transactions with state-level context.\n\n👉 https://github.com/Djirlic/raw-to-daily-splitter\n\n### Ingestion of the data\n\nThe ingestion into AWS S3 (Simple Storage Service) can be done manually. However, I preferred a more automated approach. There are several options on how to do that with AWS. But as you can see in the Python script I've created, uploading files with presigned URLs without the need to provide any credentials to the script itself and being capable of handling up to 5TB of data makes it a great alternative to API Gateway (max object size: 10MB).\n\n👉 https://github.com/Djirlic/s3-file-uploader-cli\n\n![A diagram of the the step where the data gets validated and moved to a second (refined) bucket.](images/schema-validation.png)\n\n#### First schema validation\n\nOnce a file was successfully uploaded to S3 (raw bucket), an AWS Lambda function gets triggered automatically. This function will use [Polars](https://github.com/pola-rs/polars) to check the schema and decide on how to proceed with the file.\n\nPolars is a great alternative to the well-known Pandas package because of the better performance.\n\nIf the file was validated successfully, it is moved to another S3 bucket (refined) and further processed.\n\nIf the validation did fail, it will still be moved into the refined bucket, but into the quarantine directory. \n\nIn both cases logs will be made to get information about potential failures but also successfully processed files.\n\n👉 https://github.com/Djirlic/raw-transactions-handler\n\n![A diagram of the the step where the data is uploaded to an AWS S3 bucket via a presigned URL.](images/upload-via-presigned-url.png)\n\n#### Upload to Snowflake via Snowpipe\n\nTo further process the data, Snowpipe will automatically detect new files in the success directory of the refined S3 bucket. The new data will then be loaded into Snowflake.\n\nSnowflake serves as the central data warehouse, receiving data via Snowpipe and storing all processed layers (bronze, silver, gold) for analytical access.\n\nThe detection uses the recommended path by Snowflake of using AWS EventBridge, AWS SNS (Simple Notification Service), and AWS SQS (Simple Queue Service). Additionally, I decided to protect the messages sent through a SNS topic via encryption of the KMS (Key Management Service).\n\n![A diagram of the the step where the data moves from the refined AWS S3 bucket Snowflake via Snowpipe. File uploads are recognized by EventBridge and sent to as a message of an SNS topic through a SQS queue.](images/s3-to-snowflake.png)\n\n### Orchestration\n\nThis orchestration with Airflow currently has a single DAG (Directed Acyclic Graph) with three tasks for orchestration. Every 10 minutes, Airflow checks a Snowflake Stream (append-only) on the credit card transactions source table for new entries.\n\nIf the Sensor detects new data, the second task gets started which will trigger a dbt cloud job ([see Transformation](#transformation)).\n\nOnce the second task completes successfully, a final task will advance the Stream. Without this step, the Stream would continue reporting the same unconsumed rows, and Airflow would re-run dbt unnecessarily. \n\nAirflow runs on the Astronomer platform.\n\n👉 https://github.com/Djirlic/airflow-credit-card-transactions\n\n![A diagram of the the steps where the data gets transformed from the bronze layer to silver and gold layers with dbt. This step is orchestrated with Airflow.](images/airflow-dbt-medallion.png)\n\n### Transformation\n\nThe transformation was done in dbt which runs in the dbt cloud platform. The data takes several steps to get further refined and to become ready for analytical use cases. \n\nThe medallion architecture includes these layers:\n\n- Bronze: Raw Credit Card Data from Snowflake \u0026 GeoData from US-states.\n- Silver: Enriched data (e.g. having information about merchant locations, age of the cardholder during the transaction, etc.).\n- Gold: Final data that can be used for visualization use cases (e.g. fraud during day vs. night, fraud by age group, etc.).\n\nThe data was tested (data tests) in each step and full data lineage from source to gold layer is supported. I also documented all models and fields.\n\n👉 https://github.com/Djirlic/cc-transactions-transformer\n\n### Visualization\n\nWhile Streamlit in Snowflake has some limitations, it enables fast, embedded visualization without external tooling.\n\nAn example of the visualization can be seen below. For more examples look into [4. Visualizations](#4-visualizations).\n\n![A simple visualization via a bar chart with Streamlit of the states with the most transactions](images/steamlit-simple.png)\n\n👉 https://github.com/Djirlic/cc-transactions-streamlit\n\n![A diagram of the the step where the data gets consumed and visualized by Streamlit from the gold layer tables.](images/gold-to-streamlit.png)\n\n# 2. Production grade engineering\n\nThroughout the pipeline, I applied production-grade engineering practices to ensure security, maintainability, and reliability:\n\n- **Least-privilege access control** was enforced across all components — each AWS, Snowflake, Streamlit, dbt, and Airflow interaction was handled via a dedicated user or role with only the necessary permissions.\n- **Secrets and credentials** were securely managed using GitHub environment secrets and never hardcoded or shared within the codebase.\n- All **Python code is fully linted, formatted, and tested** via CI/CD pipelines using tools like [black](https://github.com/psf/black), [flake8](https://github.com/pycqa/flake8), [mypy](https://github.com/python/mypy), [pytest](https://github.com/pytest-dev/pytest), and later [uv](https://github.com/astral-sh/uv).\n- **Project structure** was modular and reproducible, following clear separation of concerns between ingestion, transformation, orchestration, and visualization.\n- **Open-source tooling** was used where possible (e.g., Polars, dbt, Streamlit), ensuring transparency and extensibility.\n- **Schema validation and logging** were built in from the start, including structured handling of failed ingestion attempts through Dead Letter Queues (DLQs) and quarantine directories.\n- **Documentation** was written for every component to make onboarding, understanding, and extension easier — with a global README tying it all together.\n- **High test coverage**, with most components reaching close to 100%. This ensures confidence in core logic and easier future refactoring.\n- **Fault isolation through quarantine directories**, allowing malformed or invalid data to be flagged, investigated, and optionally reprocessed manually.\n- **CI/CD pipelines for multiple components**, enforcing code quality through automated linting, formatting, and testing. For the AWS Lambda function, deployment is fully automated via Docker—including all third-party dependencies bundled into a Lambda Layer to avoid duplicated dependency definitions.\n\n# 3. Project Repositories\n\n## Amazon Web Services / Ingestion\n- [Raw-to-daily-splitter](https://github.com/Djirlic/raw-to-daily-splitter): A Python CLI tool to split a raw CSV file into daily CSV files, one per date by a given date column.\n- [s3-file-uploader-cli](https://github.com/Djirlic/s3-file-uploader-cli): A Python CLI tool to upload files to an AWS S3 bucket using a presigned URL. Designed for data engineering workflows, automation pipelines, and robust CLI-based file ingestion.\n- [Raw Transaction Handler](https://github.com/Djirlic/raw-transactions-handler): A serverless data ingestion and transformation pipeline using AWS Lambda, designed to validate, transform, and route raw financial transaction data to a refined data bucket. The pipeline also maintains structured logs for successful and failed processing attempts.\n\n## dbt / Transformation\n\n- [Creditcard Transaction Transformer](https://github.com/Djirlic/cc-transactions-transformer): Transforms raw data from a connected warehouse (in my case Snowflake) into bronze, silver, and gold layers (medallion architecture).\n\n## Airflow / Orchestration\n\n- [Orchestration with a DAG in Airflow](https://github.com/Djirlic/airflow-credit-card-transactions).\n\n## Streamlit / Visualization\n\n- [Streamlit Visualization](https://github.com/Djirlic/cc-transactions-streamlit).\n\n# 4. Visualizations\n\nThe final pipeline output is a simple Streamlit dashboard to help answer key analytical questions, such as:\n\n- Which age group does have the highest transaction volume?\n- Which age group is most likely to experience fraudulent transactions?\n- On which day of the week does fraud most frequently occur?\n- Are night-time transactions more likely to be fraudulent?\n- Which U.S. states report the hightest number of fraud cases?\n\nAll this information can help the business decide on how to adapt the algorithm to label transactions as fraud.\n\nBelow are example screenshots of the Streamlit dashboard visualizing answers to these questions:\n\n\u003e [!NOTE]  \n\u003e The following screenshots were taken after uploading data for a single day. Final visualizations based on the full dataset may show different patterns and results.\n\n\n![The entry of the Streamlit dashboard.](images/dashboard-fraud-analysis.png)\n_Entry point of the Streamlit dashboard._\n\n![Total transaction volume by age group.](images/fraud-by-age-group.png)\n_Total transaction volume by age group._\n\n![Total transaction volume by merchant location (state).](images/fraud-by-merchant-location.png)\n_Total transaction volume by merchant location (state)._\n\n![A map of the United States with fraud cases highlighted.](images/fraud-merchant-location-map.png)\n_A map of the states with fraud cases highlighted._\n\n# 5. How to Run\n\nThis project is not designed to be run with a single command, and currently does not include infrastructure as code for full environment setup. \n\nHowever, the Python scripts are ready to use with minimal setup (mainly an AWS account and S3 bucket).\n\nPlease refer to each sub-repository for detailed setup and usage information.\n\n# 6. Future Improvements\n\nWhile this project successfully meets its initial scope and delivers a complete end-to-end pipeline, there are several areas worth exploring to deepen my expertise and bring the system closer to a real-world production setup:\n\n- **Extract the visualization from Streamlit in Snowflake (SiS)**\u003c/br\u003e\nExtracting visualizations from SiS into a standalone Streamlit instance or other visualization tool could increase the options, like adding heatmaps for merchant locations, that were not possible due to limitations in SiS.\n\n- **Natural Language Querying via AWS Bedrock (GenAI)**\u003c/br\u003e\nIntegrating a GenAI interface to allow stakeholders to query data using natural language prompts would significantly increase accessibility and open up self-service analytics.\n\n- **Enhanced Observability \u0026 Monitoring**\u003c/br\u003e\nWhile dead-letter queues (DLQs) and logging are already in place, adding alerting, metrics dashboards, and structured traceability would make the system easier to debug and maintain.\n\n- **Incremental Models in dbt**\u003c/br\u003e \nSwitching from full refreshes to incremental models builds in dbt would reduce compute cost and improve processing efficiency, especially at scale.\n\n- **Infracstructure as Code with AWS CDK**\u003c/br\u003e\nDefining the infrastructure using the AWS Cloud Development Kit (CDK) would ensure reproducibility and make environment setup more robust and automated.\n\n- **CI/CD for dbt and Airflow**\u003c/br\u003e\nAdd GitHub Actions to validate and deploy dbt models and DAGs, bringing full automation to the workflow.\n\n- **Data Quality Monitoring (e.g., Great Expectations or Soda)**\u003c/br\u003e\nIntegrate a framework for validating data contracts and schema changes in production.\n\n- **Cost Monitoring (Snowflake)**\u003c/br\u003e\nAdd cost tracking and warehouse usage reporting to gain better visibility into pipeline efficiency.\n\n# 7. Credits \u0026 Mentorship\n\nThis project was created as part of my mentorship through [LearnDataEngineering](https://learndataengineering.com).\n\nA special thanks to [Andreas Kretz](https://github.com/team-data-science) and [Arockia Nirmal Amala Doss](https://github.com/arockianirmal26) for their valuable guidance and recommendations throughout this project. Their real-world experience and feedback played a key role in shaping both the technical and architectural decisions.\n\n# 8. About Me \u0026 Contact\n\nI'm a freelance data engineer based in Germany, experienced in building modern data platforms with:\n\n- Databricks\n- Microsoft Fabric\n- Azure\n- Clickhouse\n- AWS\n- Snowflake\n- dbt\n- Airflow\n- Streamlit\n\nWith over 10 years experience as a software engineer, I've developed production-grade\napplications for major German companies such as [Fressnapf](https://www.fressnapf.de/), leading social lotteries, automotive manufacturers, and industrial clients.\n\nIf you looking for someone who can drive innovation, build your data platform from the ground up or take your existing stack to the next level, feel free to get in touch:\n\n📧 [Email](freelance(at)djirlic.com)\u003c/br\u003e\n🔗 [LinkedIn](https://www.linkedin.com/in/djirlic/)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdjirlic%2Ffraud-detection-e2e-pipeline","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdjirlic%2Ffraud-detection-e2e-pipeline","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdjirlic%2Ffraud-detection-e2e-pipeline/lists"}