https://github.com/rajat116/github-anomaly-project
Real-time anomaly detection system for GitHub activity using Airflow, MLflow, and Terraform
https://github.com/rajat116/github-anomaly-project
airflow anomaly-detection ci-cd-pipeline data-drift data-pipeline docker github-events infrastructure-as-code isolation-forest mlflow mlops python streamlit terraform
Last synced: 3 months ago
JSON representation
Real-time anomaly detection system for GitHub activity using Airflow, MLflow, and Terraform
- Host: GitHub
- URL: https://github.com/rajat116/github-anomaly-project
- Owner: rajat116
- License: other
- Created: 2025-07-01T11:59:17.000Z (3 months ago)
- Default Branch: main
- Last Pushed: 2025-07-08T14:28:08.000Z (3 months ago)
- Last Synced: 2025-07-08T14:34:50.470Z (3 months ago)
- Topics: airflow, anomaly-detection, ci-cd-pipeline, data-drift, data-pipeline, docker, github-events, infrastructure-as-code, isolation-forest, mlflow, mlops, python, streamlit, terraform
- Language: Python
- Homepage:
- Size: 38.4 MB
- Stars: 1
- Watchers: 0
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README

# π οΈ GitHub Anomaly Detection Pipeline
## π‘ Motivation & Use Case
GitHub hosts an enormous amount of user activity, including pull requests, issues, forks, and stars. Monitoring this activity in real-time is essential for identifying unusual or malicious behavior β such as bots, misuse, or suspicious spikes in contributions.
This project aims to build a **production-grade anomaly detection system** to:
- Detect abnormal GitHub user behavior (e.g., excessive PRs, bot-like stars)
- Alert maintainers and admins in real time via Slack or email
- Serve anomaly scores via API and support continuous retraining
- Visualize trends, drift, and recent activity using an interactive dashboard---
A production-grade anomaly detection system for GitHub user behavior using:
- **Apache Airflow** for orchestration
- **Pandas + Scikit-learn (Isolation Forest)** for modeling and anomaly detection
- **Alerts: Email & Slack** alerting mechanisms for anomaly spikes and data drift
- **FastAPI** for real-time inference
- **Pytest, Black, Flake8** for testing and linting
- **Pre-commit + GitHub Actions** for CI/CD and code quality
- **Streamlit UI** for visualization
- **Terraform** for infrastructure-as-code provisioning (MLflow)
- **AWS S3** for optional cloud-based storage of features, models, and predictions#### The full architecture of this GitHub anomaly detection pipeline is illustrated in the diagram below.

---
## π€ Too lazy for copy-pasting commands?
If you're like me and hate typing out commands... good news!
Just use the **Makefile** to do all the boring stuff for you:```bash
make help
```See full Makefile usage [here](#makefile-usage) β from setup to linting, testing, API, Airflow, and Terraform infra!
## π¦ Project Structure
```java
.
βββ dags/ β Airflow DAGs for data pipeline and retraining
βββ data/ β Input datasets (raw, features, processed)
βββ models/ β Trained ML models (e.g., Isolation Forest)
βββ mlruns/ β MLflow experiment tracking artifacts
βββ infra/ β Terraform IaC for provisioning MLflow container
βββ github_pipeline/ β Feature engineering, inference, monitoring scripts
βββ tests/ β Pytest-based unit/integration tests
βββ reports/ β Data drift reports (JSON/HTML) from Evidently
βββ alerts/ β Alert log dumps (e.g., triggered drift/anomaly alerts)
βββ notebooks/ β Jupyter notebooks for exploration & experimentation
βββ assets/ β Images and architecture diagrams for README
βββ .github/workflows/ β GitHub Actions CI/CD pipelines
βββ streamlit_app.py β Realtime dashboard for monitoring
βββ serve_model.py β FastAPI inference service
βββ Dockerfile.* β Dockerfiles for API and Streamlit services
βββ docker-compose.yaml β Compose file to run Airflow and supporting services
βββ Makefile β Task automation: setup, test, Airflow, Terraform, etc.
βββ requirements.txt β Python dependencies for Airflow containers
βββ Pipfile / Pipfile.lock β Python project environment (via Pipenv)
βββ .env β Environment variables (Slack, Email, Airflow UID, S3 support flag)
βββ README.md β π You are here
```---
## βοΈ Setup Instructions
### 1. Clone and install dependencies
```bash
git clone https://github.com/rajat116/github-anomaly-project.git
cd github-anomaly-project
pipenv install --dev
pipenv shell
```
### Or install using pip:```bash
pip install -r requirements.txt
```### π .env Configuration (Required)
Before running Airflow, you must create a `.env` file in the project root with at least following content:
```env
AIRFLOW_UID=50000
USE_S3=false
```This is required for Docker to set correct permissions inside the Airflow containers.
#### π USE_S3 Flag
Set this flag to control where your pipeline reads/writes files:
- USE_S3=false: All files will be stored locally (default, for development and testing)
- USE_S3=true: Files will be written to and read from AWS S3β Required When USE_S3=true
If you enable S3 support, also provide your AWS credentials in the .env:
```bash
AWS_ACCESS_KEY_ID=your_aws_access_key
AWS_SECRET_ACCESS_KEY=your_aws_secret
AWS_REGION=us-east-1
S3_BUCKET_NAME=github-anomaly-logs
```π‘ Tip for Contributors
If you're testing locally or don't have AWS credentials, just keep:
```bash
USE_S3=false
```
This will disable all cloud storage usage and allow you to run the full pipeline locally.#### Optional (For Email & Slack Alerts)
If you'd like to enable alerts, you can also include the following variables:
```env
# Slack Alerts
SLACK_API_TOKEN=xoxb-...
SLACK_CHANNEL=#your-channel# Email Alerts
EMAIL_SENDER=your_email@example.com
EMAIL_PASSWORD=your_email_app_password
EMAIL_RECEIVER=receiver@example.com
EMAIL_SMTP=smtp.gmail.com
EMAIL_PORT=587
```
---### 2. βοΈ Airflow + π MLflow Integration
This project uses Apache Airflow to orchestrate a real-time ML pipeline and MLflow to track model training, metrics, and artifacts.
#### π 1. Start Airflow & MLflow via Docker
π οΈ Build & Launch
```bash
docker compose build airflow
docker compose up airflow
```Once up, access:
- Airflow UI: http://localhost:8080 (Login: airflow / airflow)
- MLflow UI: http://localhost:5000#### β±οΈ 2. Airflow DAGs Overview
- daily_github_inference: Download β Feature Engineering β Inference
- daily_monitoring_dag: Drift checks, cleanup, alerting
- retraining_dag: Triggers model training weekly and logs it to MLflow#### π 3. MLflow Experiment Tracking
Model training is handled by:
```bash
github_pipeline/train_model.py
```Each run logs the following:
β Parameters:
- timestamp β Training batch timestamp
- model_type β Algorithm used (IsolationForest)
- n_estimators β Number of treesπ Metrics
- mean_anomaly_score
- num_anomalies
- num_total
- anomaly_rateπ¦ Artifacts
- isolation_forest.pkl β Trained model
- actor_predictions_.parquet
- MLflow Model Registry entryAll experiments are stored in the mlruns/ volume:
```bash
volumes:
- ./mlruns:/opt/airflow/mlruns
```
You can explore experiment runs and models in the MLflow UI.### 3. π§ Model Training
The model (Isolation Forest) is trained on actor-wise event features:
```bash
python github_pipeline/train_model.py
```
The latest parquet file is used automatically. Model and scaler are saved to models/.### 4. π FastAPI Inference
#### Build & Run
```bash
docker build -t github-anomaly-inference -f Dockerfile.inference .
docker run -p 8000:8000 github-anomaly-inference
```#### Test the API
```bash
curl -X POST http://localhost:8000/predict \
-H "Content-Type: application/json" \
-d '{"features": [12, 0, 1, 0, 4]}'
```### 5. π£ Alerts: Email & Slack
This project includes automated alerting mechanisms for anomaly spikes and data drift, integrated into the daily_monitoring_dag DAG.
#### β Triggers for Alerts
- πΊ Anomaly Rate Alert: If anomaly rate exceeds a threshold (e.g. >10% of actors).
- π Drift Detection Alert: If feature distributions change significantly over time.#### π Notification Channels
- Email alerts (via smtplib)
- Slack alerts (via Slack Incoming Webhooks)#### π§ Configuration
Set the following environment variables in your Airflow setup:
```bash
# .env or Airflow environment
ALERT_EMAIL_FROM=your_email@example.com
ALERT_EMAIL_TO=recipient@example.com
ALERT_EMAIL_PASSWORD=your_email_app_password
ALERT_EMAIL_SMTP=smtp.gmail.com
ALERT_EMAIL_PORT=587SLACK_WEBHOOK_URL=https://hooks.slack.com/services/XXX/YYY/ZZZ
```
π‘οΈ Email app passwords are recommended over actual passwords for Gmail or Outlook.#### π Alert Script
Logic is handled inside:
```bash
github_pipeline/monitor.py
alerts/alerting.py
```These generate alert messages and send them through email and Slack if thresholds are breached.
### 6. β CI/CD with GitHub Actions
The .github/workflows/ci.yml file runs on push:
- β black --check
- β flake8 (E501,W503 ignored)
- β pytest
- β (optional) Docker build### 7. π Code Quality
Pre-commit hooks ensure style and linting:
```bash
pre-commit install
pre-commit run --all-files
```Configured via:
- .pre-commit-config.yaml
- .flake8 (ignore = E501)### 8. π§ͺ Testing
Run all tests:
```bash
PYTHONPATH=. pytest
```Tests are in tests/ and cover:
- Inference API (serve_model.py)
- Feature engineering
- Model training logic### 9. π Streamlit Dashboard
The project includes an optional interactive Streamlit dashboard to visualize:
- β Latest anomaly predictions
- π Data drift metrics from the Evidently report
- π§βπ» Top actors based on GitHub activity
- β±οΈ Activity summary over the last 48 hours#### π§ How to Run Locally
Make sure you have installed all dependencies via Pipenv, then launch the Streamlit app:
```bash
streamlit run streamlit_app.py
```Once it starts, open the dashboard in your browser at:
```bash
http://localhost:8501
```The app will automatically load:
- The latest prediction file from data/features/
- The latest drift report from reports/Note: If these files do not exist, the dashboard will show a warning or empty state. You can generate them by running the Airflow pipeline or the monitoring scripts manually.
#### π³ Optional: Run via Docker
You can also build and run the dashboard as a container (if desired):
Build the image:
```bash
docker build -t github-anomaly-dashboard -f Dockerfile.streamlit .
```Run the container:
```bash
docker run -p 8501:8501 \
-v $(pwd)/data:/app/data \
-v $(pwd)/reports:/app/reports \
github-anomaly-dashboard
```Then open your browser at http://localhost:8501.
### 11. βοΈ Infrastructure as Code (IaC): MLflow Server with Terraform
This Terraform module provisions a **Docker-based MLflow tracking server**, matching the setup used in `docker-compose.yaml`, but on a **different port (5050)** to avoid conflicts.
---
#### π Directory Structure
- infra/main.tf # Terraform configuration
- README.md # This file#### βοΈ Requirements
- [Terraform](https://developer.hashicorp.com/terraform/downloads)
- [Docker](https://docs.docker.com/get-docker/)#### π How to Use:
##### 1. Navigate to the `infra/` folder
```bash
cd infra
```##### 2. Initialize Terraform
```bash
terraform init
```##### 3. Apply the infrastructure
```bash
terraform apply # Confirm with yes when prompted.
```##### 4. π Verify
MLflow server will be available at:
```bash
http://localhost:5050
```All artifacts will be stored in your projectβs mlruns/ directory.
##### 5. β To Clean Up
```bash
terraform destroy
```This removes the MLflow container provisioned by Terraform.
### 12. π§Ή Clean Code
All code follows:
- PEP8 formatting via Black
- Linting with Flake8 + Bugbear
- Pre-commit hook enforcement
### 13. π οΈ Makefile UsageThis project includes a Makefile that simplifies formatting, testing, building Docker containers, and running Airflow or the FastAPI inference app.
You can run all commands with or without activating the Pipenv shell. For example:
```bash
make lint
```#### π§ Setup Commands
```bash
make install # Install all dependencies via Pipenv (both runtime and dev)
make create-env # Create .env file with AIRFLOW_UID, alert placeholders, and S3 support flag
make clean # Remove all __pycache__ folders and .pyc files
```#### π§ͺ Code Quality & Testing
```bash
make format # Format code using Black
make lint # Lint code using Flake8
make test # Run tests using Pytest
make check # Run all of the above together
```### π Streamlit Dashboard
```bash
make streamlit # Launch the Streamlit dashboard at http://localhost:8501
```#### π³ FastAPI Inference App
```bash
make docker-build # Build the Docker image for FastAPI app
make docker-run # Run the Docker container on port 8000
make api-test # Send a test prediction request using curl
```After running make docker-run, open another terminal and run make api-test.
#### β±οΈ Airflow Pipeline
```bash
make airflow-up # Start Airflow services (scheduler, UI, etc.)
make airflow-down Stop all Airflow containers
```Once up, access:
- Airflow UI: http://localhost:8080 (Login: airflow / airflow)
- MLflow UI: http://localhost:5000### MLflow Server with Terraform
```bash
make install-terraform # Install Terraform CLI if not present
make terraform-init # Initialize Terraform config
make terraform-apply # Provision MLflow container (port 5050)
make terraform-destroy # Tear down MLflow container
make terraform-status # Show current infra state
```#### π View All Commands
```bash
make help # Prints a summary of all available targets and their descriptions.
```### 14. π Credits
Built by Rajat Gupta as part of an MLOps portfolio.
Inspired by real-time event pipelines and anomaly detection architectures used in production.### 14. π License