An open API service indexing awesome lists of open source software.

https://github.com/rajat116/github-anomaly-project

Real-time anomaly detection system for GitHub activity using Airflow, MLflow, and Terraform
https://github.com/rajat116/github-anomaly-project

airflow anomaly-detection ci-cd-pipeline data-drift data-pipeline docker github-events infrastructure-as-code isolation-forest mlflow mlops python streamlit terraform

Last synced: 3 months ago
JSON representation

Real-time anomaly detection system for GitHub activity using Airflow, MLflow, and Terraform

Awesome Lists containing this project

README

          

![Banner](assets/banner.png)

# πŸ› οΈ GitHub Anomaly Detection Pipeline

## πŸ’‘ Motivation & Use Case

GitHub hosts an enormous amount of user activity, including pull requests, issues, forks, and stars. Monitoring this activity in real-time is essential for identifying unusual or malicious behavior β€” such as bots, misuse, or suspicious spikes in contributions.

This project aims to build a **production-grade anomaly detection system** to:

- Detect abnormal GitHub user behavior (e.g., excessive PRs, bot-like stars)
- Alert maintainers and admins in real time via Slack or email
- Serve anomaly scores via API and support continuous retraining
- Visualize trends, drift, and recent activity using an interactive dashboard

---

A production-grade anomaly detection system for GitHub user behavior using:

- **Apache Airflow** for orchestration
- **Pandas + Scikit-learn (Isolation Forest)** for modeling and anomaly detection
- **Alerts: Email & Slack** alerting mechanisms for anomaly spikes and data drift
- **FastAPI** for real-time inference
- **Pytest, Black, Flake8** for testing and linting
- **Pre-commit + GitHub Actions** for CI/CD and code quality
- **Streamlit UI** for visualization
- **Terraform** for infrastructure-as-code provisioning (MLflow)
- **AWS S3** for optional cloud-based storage of features, models, and predictions

#### The full architecture of this GitHub anomaly detection pipeline is illustrated in the diagram below.

![Architecture](assets/architecture.png)

---

## πŸ€– Too lazy for copy-pasting commands?

If you're like me and hate typing out commands... good news!
Just use the **Makefile** to do all the boring stuff for you:

```bash
make help
```

See full Makefile usage [here](#makefile-usage) β€” from setup to linting, testing, API, Airflow, and Terraform infra!

## πŸ“¦ Project Structure

```java
.
β”œβ”€β”€ dags/ ← Airflow DAGs for data pipeline and retraining
β”œβ”€β”€ data/ ← Input datasets (raw, features, processed)
β”œβ”€β”€ models/ ← Trained ML models (e.g., Isolation Forest)
β”œβ”€β”€ mlruns/ ← MLflow experiment tracking artifacts
β”œβ”€β”€ infra/ ← Terraform IaC for provisioning MLflow container
β”œβ”€β”€ github_pipeline/ ← Feature engineering, inference, monitoring scripts
β”œβ”€β”€ tests/ ← Pytest-based unit/integration tests
β”œβ”€β”€ reports/ ← Data drift reports (JSON/HTML) from Evidently
β”œβ”€β”€ alerts/ ← Alert log dumps (e.g., triggered drift/anomaly alerts)
β”œβ”€β”€ notebooks/ ← Jupyter notebooks for exploration & experimentation
β”œβ”€β”€ assets/ ← Images and architecture diagrams for README
β”œβ”€β”€ .github/workflows/ ← GitHub Actions CI/CD pipelines
β”œβ”€β”€ streamlit_app.py ← Realtime dashboard for monitoring
β”œβ”€β”€ serve_model.py ← FastAPI inference service
β”œβ”€β”€ Dockerfile.* ← Dockerfiles for API and Streamlit services
β”œβ”€β”€ docker-compose.yaml ← Compose file to run Airflow and supporting services
β”œβ”€β”€ Makefile ← Task automation: setup, test, Airflow, Terraform, etc.
β”œβ”€β”€ requirements.txt ← Python dependencies for Airflow containers
β”œβ”€β”€ Pipfile / Pipfile.lock ← Python project environment (via Pipenv)
β”œβ”€β”€ .env ← Environment variables (Slack, Email, Airflow UID, S3 support flag)
└── README.md ← πŸ“˜ You are here
```

---

## βš™οΈ Setup Instructions

### 1. Clone and install dependencies

```bash
git clone https://github.com/rajat116/github-anomaly-project.git
cd github-anomaly-project
pipenv install --dev
pipenv shell
```
### Or install using pip:

```bash
pip install -r requirements.txt
```

### πŸ“„ .env Configuration (Required)

Before running Airflow, you must create a `.env` file in the project root with at least following content:

```env
AIRFLOW_UID=50000
USE_S3=false
```

This is required for Docker to set correct permissions inside the Airflow containers.

#### πŸ”„ USE_S3 Flag

Set this flag to control where your pipeline reads/writes files:

- USE_S3=false: All files will be stored locally (default, for development and testing)
- USE_S3=true: Files will be written to and read from AWS S3

βœ… Required When USE_S3=true

If you enable S3 support, also provide your AWS credentials in the .env:

```bash
AWS_ACCESS_KEY_ID=your_aws_access_key
AWS_SECRET_ACCESS_KEY=your_aws_secret
AWS_REGION=us-east-1
S3_BUCKET_NAME=github-anomaly-logs
```

πŸ’‘ Tip for Contributors

If you're testing locally or don't have AWS credentials, just keep:

```bash
USE_S3=false
```
This will disable all cloud storage usage and allow you to run the full pipeline locally.

#### Optional (For Email & Slack Alerts)

If you'd like to enable alerts, you can also include the following variables:

```env
# Slack Alerts
SLACK_API_TOKEN=xoxb-...
SLACK_CHANNEL=#your-channel

# Email Alerts
EMAIL_SENDER=your_email@example.com
EMAIL_PASSWORD=your_email_app_password
EMAIL_RECEIVER=receiver@example.com
EMAIL_SMTP=smtp.gmail.com
EMAIL_PORT=587
```
---

### 2. βš™οΈ Airflow + πŸ“ˆ MLflow Integration

This project uses Apache Airflow to orchestrate a real-time ML pipeline and MLflow to track model training, metrics, and artifacts.

#### πŸš€ 1. Start Airflow & MLflow via Docker

πŸ› οΈ Build & Launch

```bash
docker compose build airflow
docker compose up airflow
```

Once up, access:

- Airflow UI: http://localhost:8080 (Login: airflow / airflow)
- MLflow UI: http://localhost:5000

#### ⏱️ 2. Airflow DAGs Overview

- daily_github_inference: Download β†’ Feature Engineering β†’ Inference
- daily_monitoring_dag: Drift checks, cleanup, alerting
- retraining_dag: Triggers model training weekly and logs it to MLflow

#### πŸ“ˆ 3. MLflow Experiment Tracking

Model training is handled by:

```bash
github_pipeline/train_model.py
```

Each run logs the following:

βœ… Parameters:

- timestamp β€” Training batch timestamp
- model_type β€” Algorithm used (IsolationForest)
- n_estimators β€” Number of trees

πŸ“Š Metrics

- mean_anomaly_score
- num_anomalies
- num_total
- anomaly_rate

πŸ“¦ Artifacts

- isolation_forest.pkl β€” Trained model
- actor_predictions_.parquet
- MLflow Model Registry entry

All experiments are stored in the mlruns/ volume:

```bash
volumes:
- ./mlruns:/opt/airflow/mlruns
```
You can explore experiment runs and models in the MLflow UI.

### 3. 🧠 Model Training

The model (Isolation Forest) is trained on actor-wise event features:

```bash
python github_pipeline/train_model.py
```
The latest parquet file is used automatically. Model and scaler are saved to models/.

### 4. πŸš€ FastAPI Inference

#### Build & Run

```bash
docker build -t github-anomaly-inference -f Dockerfile.inference .
docker run -p 8000:8000 github-anomaly-inference
```

#### Test the API

```bash
curl -X POST http://localhost:8000/predict \
-H "Content-Type: application/json" \
-d '{"features": [12, 0, 1, 0, 4]}'
```

### 5. πŸ“£ Alerts: Email & Slack

This project includes automated alerting mechanisms for anomaly spikes and data drift, integrated into the daily_monitoring_dag DAG.

#### βœ… Triggers for Alerts

- πŸ”Ί Anomaly Rate Alert: If anomaly rate exceeds a threshold (e.g. >10% of actors).
- πŸ” Drift Detection Alert: If feature distributions change significantly over time.

#### πŸ”” Notification Channels

- Email alerts (via smtplib)
- Slack alerts (via Slack Incoming Webhooks)

#### πŸ”§ Configuration

Set the following environment variables in your Airflow setup:

```bash
# .env or Airflow environment
ALERT_EMAIL_FROM=your_email@example.com
ALERT_EMAIL_TO=recipient@example.com
ALERT_EMAIL_PASSWORD=your_email_app_password
ALERT_EMAIL_SMTP=smtp.gmail.com
ALERT_EMAIL_PORT=587

SLACK_WEBHOOK_URL=https://hooks.slack.com/services/XXX/YYY/ZZZ
```
πŸ›‘οΈ Email app passwords are recommended over actual passwords for Gmail or Outlook.

#### πŸ“ Alert Script

Logic is handled inside:

```bash
github_pipeline/monitor.py
alerts/alerting.py
```

These generate alert messages and send them through email and Slack if thresholds are breached.

### 6. βœ… CI/CD with GitHub Actions

The .github/workflows/ci.yml file runs on push:

- βœ… black --check
- βœ… flake8 (E501,W503 ignored)
- βœ… pytest
- βœ… (optional) Docker build

### 7. πŸ” Code Quality

Pre-commit hooks ensure style and linting:

```bash
pre-commit install
pre-commit run --all-files
```

Configured via:

- .pre-commit-config.yaml
- .flake8 (ignore = E501)

### 8. πŸ§ͺ Testing

Run all tests:

```bash
PYTHONPATH=. pytest
```

Tests are in tests/ and cover:

- Inference API (serve_model.py)
- Feature engineering
- Model training logic

### 9. πŸ“Š Streamlit Dashboard

The project includes an optional interactive Streamlit dashboard to visualize:

- βœ… Latest anomaly predictions
- πŸ“ˆ Data drift metrics from the Evidently report
- πŸ§‘β€πŸ’» Top actors based on GitHub activity
- ⏱️ Activity summary over the last 48 hours

#### πŸ”§ How to Run Locally

Make sure you have installed all dependencies via Pipenv, then launch the Streamlit app:

```bash
streamlit run streamlit_app.py
```

Once it starts, open the dashboard in your browser at:

```bash
http://localhost:8501
```

The app will automatically load:

- The latest prediction file from data/features/
- The latest drift report from reports/

Note: If these files do not exist, the dashboard will show a warning or empty state. You can generate them by running the Airflow pipeline or the monitoring scripts manually.

#### 🐳 Optional: Run via Docker

You can also build and run the dashboard as a container (if desired):

Build the image:

```bash
docker build -t github-anomaly-dashboard -f Dockerfile.streamlit .
```

Run the container:

```bash
docker run -p 8501:8501 \
-v $(pwd)/data:/app/data \
-v $(pwd)/reports:/app/reports \
github-anomaly-dashboard
```

Then open your browser at http://localhost:8501.

### 11. ☁️ Infrastructure as Code (IaC): MLflow Server with Terraform

This Terraform module provisions a **Docker-based MLflow tracking server**, matching the setup used in `docker-compose.yaml`, but on a **different port (5050)** to avoid conflicts.

---

#### πŸ“ Directory Structure

- infra/main.tf # Terraform configuration
- README.md # This file

#### βš™οΈ Requirements

- [Terraform](https://developer.hashicorp.com/terraform/downloads)
- [Docker](https://docs.docker.com/get-docker/)

#### πŸš€ How to Use:

##### 1. Navigate to the `infra/` folder

```bash
cd infra
```

##### 2. Initialize Terraform

```bash
terraform init
```

##### 3. Apply the infrastructure

```bash
terraform apply # Confirm with yes when prompted.
```

##### 4. πŸ”Ž Verify

MLflow server will be available at:

```bash
http://localhost:5050
```

All artifacts will be stored in your project’s mlruns/ directory.

##### 5. ❌ To Clean Up

```bash
terraform destroy
```

This removes the MLflow container provisioned by Terraform.

### 12. 🧹 Clean Code

All code follows:

- PEP8 formatting via Black
- Linting with Flake8 + Bugbear
- Pre-commit hook enforcement


### 13. πŸ› οΈ Makefile Usage

This project includes a Makefile that simplifies formatting, testing, building Docker containers, and running Airflow or the FastAPI inference app.

You can run all commands with or without activating the Pipenv shell. For example:

```bash
make lint
```

#### πŸ”§ Setup Commands

```bash
make install # Install all dependencies via Pipenv (both runtime and dev)
make create-env # Create .env file with AIRFLOW_UID, alert placeholders, and S3 support flag
make clean # Remove all __pycache__ folders and .pyc files
```

#### πŸ§ͺ Code Quality & Testing

```bash
make format # Format code using Black
make lint # Lint code using Flake8
make test # Run tests using Pytest
make check # Run all of the above together
```

### πŸ“Š Streamlit Dashboard

```bash
make streamlit # Launch the Streamlit dashboard at http://localhost:8501
```

#### 🐳 FastAPI Inference App

```bash
make docker-build # Build the Docker image for FastAPI app
make docker-run # Run the Docker container on port 8000
make api-test # Send a test prediction request using curl
```

After running make docker-run, open another terminal and run make api-test.

#### ⏱️ Airflow Pipeline

```bash
make airflow-up # Start Airflow services (scheduler, UI, etc.)
make airflow-down Stop all Airflow containers
```

Once up, access:

- Airflow UI: http://localhost:8080 (Login: airflow / airflow)
- MLflow UI: http://localhost:5000

### MLflow Server with Terraform

```bash
make install-terraform # Install Terraform CLI if not present
make terraform-init # Initialize Terraform config
make terraform-apply # Provision MLflow container (port 5050)
make terraform-destroy # Tear down MLflow container
make terraform-status # Show current infra state
```

#### πŸ“‹ View All Commands

```bash
make help # Prints a summary of all available targets and their descriptions.
```

### 14. πŸ™Œ Credits

Built by Rajat Gupta as part of an MLOps portfolio.
Inspired by real-time event pipelines and anomaly detection architectures used in production.

### 14. πŸ“ License