{"id":29416362,"url":"https://github.com/rajat116/github-anomaly-project","last_synced_at":"2025-07-22T17:03:12.403Z","repository":{"id":302271636,"uuid":"1011830681","full_name":"rajat116/github-anomaly-project","owner":"rajat116","description":"Real-time anomaly detection system for GitHub activity using Airflow, MLflow, and Terraform","archived":false,"fork":false,"pushed_at":"2025-07-08T14:28:08.000Z","size":40302,"stargazers_count":1,"open_issues_count":0,"forks_count":1,"subscribers_count":0,"default_branch":"main","last_synced_at":"2025-07-08T14:34:50.470Z","etag":null,"topics":["airflow","anomaly-detection","ci-cd-pipeline","data-drift","data-pipeline","docker","github-events","infrastructure-as-code","isolation-forest","mlflow","mlops","python","streamlit","terraform"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/rajat116.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2025-07-01T11:59:17.000Z","updated_at":"2025-07-08T14:28:12.000Z","dependencies_parsed_at":"2025-07-01T13:26:48.757Z","dependency_job_id":"aec1f412-a24f-4695-ba77-7e23f701d0d5","html_url":"https://github.com/rajat116/github-anomaly-project","commit_stats":null,"previous_names":["rajat116/github-anomaly-project"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/rajat116/github-anomaly-project","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rajat116%2Fgithub-anomaly-project","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rajat116%2Fgithub-anomaly-project/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rajat116%2Fgithub-anomaly-project/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rajat116%2Fgithub-anomaly-project/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/rajat116","download_url":"https://codeload.github.com/rajat116/github-anomaly-project/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rajat116%2Fgithub-anomaly-project/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":266535690,"owners_count":23944275,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-07-22T02:00:09.085Z","response_time":66,"last_error":null,"robots_txt_status":null,"robots_txt_updated_at":null,"robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["airflow","anomaly-detection","ci-cd-pipeline","data-drift","data-pipeline","docker","github-events","infrastructure-as-code","isolation-forest","mlflow","mlops","python","streamlit","terraform"],"created_at":"2025-07-11T19:03:02.273Z","updated_at":"2025-07-22T17:03:12.375Z","avatar_url":"https://github.com/rajat116.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"![Banner](assets/banner.png)\n\n# 🛠️ GitHub Anomaly Detection Pipeline\n\n## 💡 Motivation \u0026 Use Case\n\nGitHub hosts an enormous amount of user activity, including pull requests, issues, forks, and stars. Monitoring this activity in real-time is essential for identifying unusual or malicious behavior — such as bots, misuse, or suspicious spikes in contributions. \n\nThis project aims to build a **production-grade anomaly detection system** to:\n\n- Detect abnormal GitHub user behavior (e.g., excessive PRs, bot-like stars)\n- Alert maintainers and admins in real time via Slack or email\n- Serve anomaly scores via API and support continuous retraining\n- Visualize trends, drift, and recent activity using an interactive dashboard\n\n---\n\nA production-grade anomaly detection system for GitHub user behavior using:\n\n- **Apache Airflow** for orchestration  \n- **Pandas + Scikit-learn (Isolation Forest)** for modeling and anomaly detection\n- **Alerts: Email \u0026 Slack** alerting mechanisms for anomaly spikes and data drift\n- **FastAPI** for real-time inference  \n- **Pytest, Black, Flake8** for testing and linting  \n- **Pre-commit + GitHub Actions** for CI/CD and code quality  \n- **Streamlit UI** for visualization  \n- **Terraform** for infrastructure-as-code provisioning (MLflow)\n- **AWS S3** for optional cloud-based storage of features, models, and predictions\n\n#### The full architecture of this GitHub anomaly detection pipeline is illustrated in the diagram below.\n\n![Architecture](assets/architecture.png)\n\n---\n\n## 🤖 Too lazy for copy-pasting commands?\n\nIf you're like me and hate typing out commands... good news!  \nJust use the **Makefile** to do all the boring stuff for you:\n\n```bash\nmake help\n```\n\nSee full Makefile usage [here](#makefile-usage) — from setup to linting, testing, API, Airflow, and Terraform infra!\n\n## 📦 Project Structure\n\n```java\n.\n├── dags/                    ← Airflow DAGs for data pipeline and retraining\n├── data/                    ← Input datasets (raw, features, processed)\n├── models/                  ← Trained ML models (e.g., Isolation Forest)\n├── mlruns/                  ← MLflow experiment tracking artifacts\n├── infra/                   ← Terraform IaC for provisioning MLflow container\n├── github_pipeline/         ← Feature engineering, inference, monitoring scripts\n├── tests/                   ← Pytest-based unit/integration tests\n├── reports/                 ← Data drift reports (JSON/HTML) from Evidently\n├── alerts/                  ← Alert log dumps (e.g., triggered drift/anomaly alerts)\n├── notebooks/               ← Jupyter notebooks for exploration \u0026 experimentation\n├── assets/                  ← Images and architecture diagrams for README\n├── .github/workflows/       ← GitHub Actions CI/CD pipelines\n├── streamlit_app.py         ← Realtime dashboard for monitoring\n├── serve_model.py           ← FastAPI inference service\n├── Dockerfile.*             ← Dockerfiles for API and Streamlit services\n├── docker-compose.yaml      ← Compose file to run Airflow and supporting services\n├── Makefile                 ← Task automation: setup, test, Airflow, Terraform, etc.\n├── requirements.txt         ← Python dependencies for Airflow containers\n├── Pipfile / Pipfile.lock   ← Python project environment (via Pipenv)\n├── .env                     ← Environment variables (Slack, Email, Airflow UID, S3 support flag)\n└── README.md                ← 📘 You are here\n```\n\n---\n\n## ⚙️ Setup Instructions\n\n### 1. Clone and install dependencies\n\n```bash\ngit clone https://github.com/rajat116/github-anomaly-project.git\ncd github-anomaly-project\npipenv install --dev\npipenv shell\n```\n### Or install using pip:\n\n```bash\npip install -r requirements.txt\n```\n\n### 📄 .env Configuration (Required)\n\nBefore running Airflow, you must create a `.env` file in the project root with at least following content:\n\n```env\nAIRFLOW_UID=50000\nUSE_S3=false\n```\n\nThis is required for Docker to set correct permissions inside the Airflow containers.\n\n#### 🔄 USE_S3 Flag\n\nSet this flag to control where your pipeline reads/writes files:\n\n- USE_S3=false: All files will be stored locally (default, for development and testing)\n- USE_S3=true: Files will be written to and read from AWS S3\n\n✅ Required When USE_S3=true\n\nIf you enable S3 support, also provide your AWS credentials in the .env:\n\n```bash\nAWS_ACCESS_KEY_ID=your_aws_access_key\nAWS_SECRET_ACCESS_KEY=your_aws_secret\nAWS_REGION=us-east-1\nS3_BUCKET_NAME=github-anomaly-logs\n```\n\n💡 Tip for Contributors\n\nIf you're testing locally or don't have AWS credentials, just keep:\n\n```bash\nUSE_S3=false\n```\nThis will disable all cloud storage usage and allow you to run the full pipeline locally.\n\n#### Optional (For Email \u0026 Slack Alerts)\n\nIf you'd like to enable alerts, you can also include the following variables:\n\n```env\n# Slack Alerts\nSLACK_API_TOKEN=xoxb-...\nSLACK_CHANNEL=#your-channel\n\n# Email Alerts\nEMAIL_SENDER=your_email@example.com\nEMAIL_PASSWORD=your_email_app_password\nEMAIL_RECEIVER=receiver@example.com\nEMAIL_SMTP=smtp.gmail.com\nEMAIL_PORT=587\n```\n---\n\n### 2. ⚙️ Airflow + 📈 MLflow Integration\n\nThis project uses Apache Airflow to orchestrate a real-time ML pipeline and MLflow to track model training, metrics, and artifacts.\n\n#### 🚀 1. Start Airflow \u0026 MLflow via Docker\n\n🛠️ Build \u0026 Launch\n\n```bash\ndocker compose build airflow\ndocker compose up airflow\n```\n\nOnce up, access:\n\n- Airflow UI: http://localhost:8080 (Login: airflow / airflow)\n- MLflow UI: http://localhost:5000\n\n#### ⏱️ 2. Airflow DAGs Overview\n\n- daily_github_inference: Download → Feature Engineering → Inference\n- daily_monitoring_dag: Drift checks, cleanup, alerting\n- retraining_dag: Triggers model training weekly and logs it to MLflow\n\n#### 📈 3. MLflow Experiment Tracking\n\nModel training is handled by:\n\n```bash\ngithub_pipeline/train_model.py\n```\n\nEach run logs the following:\n\n✅ Parameters:\n\n- timestamp — Training batch timestamp\n- model_type — Algorithm used (IsolationForest)\n- n_estimators — Number of trees\n\n📊 Metrics\n\n- mean_anomaly_score\n- num_anomalies\n- num_total\n- anomaly_rate\n\n📦 Artifacts\n\n- isolation_forest.pkl — Trained model\n- actor_predictions_\u003ctimestamp\u003e.parquet\n- MLflow Model Registry entry\n\nAll experiments are stored in the mlruns/ volume:\n\n```bash\nvolumes:\n  - ./mlruns:/opt/airflow/mlruns\n```\nYou can explore experiment runs and models in the MLflow UI.\n\n### 3. 🧠 Model Training\n\nThe model (Isolation Forest) is trained on actor-wise event features:\n\n```bash\npython github_pipeline/train_model.py\n```\nThe latest parquet file is used automatically. Model and scaler are saved to models/.\n\n### 4. 🚀 FastAPI Inference\n\n#### Build \u0026 Run\n\n```bash\ndocker build -t github-anomaly-inference -f Dockerfile.inference .\ndocker run -p 8000:8000 github-anomaly-inference\n```\n\n#### Test the API\n\n```bash\ncurl -X POST http://localhost:8000/predict \\\n     -H \"Content-Type: application/json\" \\\n     -d '{\"features\": [12, 0, 1, 0, 4]}'\n```\n\n### 5. 📣 Alerts: Email \u0026 Slack\n\nThis project includes automated alerting mechanisms for anomaly spikes and data drift, integrated into the daily_monitoring_dag DAG.\n\n#### ✅ Triggers for Alerts\n\n- 🔺 Anomaly Rate Alert: If anomaly rate exceeds a threshold (e.g. \u003e10% of actors).\n- 🔁 Drift Detection Alert: If feature distributions change significantly over time.\n\n#### 🔔 Notification Channels\n\n- Email alerts (via smtplib)\n- Slack alerts (via Slack Incoming Webhooks)\n\n#### 🔧 Configuration\n\nSet the following environment variables in your Airflow setup:\n\n```bash\n# .env or Airflow environment\nALERT_EMAIL_FROM=your_email@example.com\nALERT_EMAIL_TO=recipient@example.com\nALERT_EMAIL_PASSWORD=your_email_app_password\nALERT_EMAIL_SMTP=smtp.gmail.com\nALERT_EMAIL_PORT=587\n\nSLACK_WEBHOOK_URL=https://hooks.slack.com/services/XXX/YYY/ZZZ\n```\n🛡️ Email app passwords are recommended over actual passwords for Gmail or Outlook.\n\n#### 📁 Alert Script\n\nLogic is handled inside:\n\n```bash\ngithub_pipeline/monitor.py\nalerts/alerting.py\n```\n\nThese generate alert messages and send them through email and Slack if thresholds are breached.\n\n### 6. ✅ CI/CD with GitHub Actions\n\nThe .github/workflows/ci.yml file runs on push:\n\n- ✅ black --check\n- ✅ flake8 (E501,W503 ignored)\n- ✅ pytest\n- ✅ (optional) Docker build\n\n### 7. 🔍 Code Quality\n\nPre-commit hooks ensure style and linting:\n\n```bash\npre-commit install\npre-commit run --all-files\n```\n\nConfigured via:\n\n- .pre-commit-config.yaml\n- .flake8 (ignore = E501)\n\n### 8. 🧪 Testing\n\nRun all tests:\n\n```bash\nPYTHONPATH=. pytest\n```\n\nTests are in tests/ and cover:\n\n- Inference API (serve_model.py)\n- Feature engineering\n- Model training logic\n\n### 9. 📊 Streamlit Dashboard\n\nThe project includes an optional interactive Streamlit dashboard to visualize:\n\n- ✅ Latest anomaly predictions\n- 📈 Data drift metrics from the Evidently report\n- 🧑‍💻 Top actors based on GitHub activity\n- ⏱️ Activity summary over the last 48 hours\n\n#### 🔧 How to Run Locally\n\nMake sure you have installed all dependencies via Pipenv, then launch the Streamlit app:\n\n```bash\nstreamlit run streamlit_app.py\n```\n\nOnce it starts, open the dashboard in your browser at:\n\n```bash\nhttp://localhost:8501\n```\n\nThe app will automatically load:\n\n- The latest prediction file from data/features/\n- The latest drift report from reports/\n\nNote: If these files do not exist, the dashboard will show a warning or empty state. You can generate them by running the Airflow pipeline or the monitoring scripts manually.\n\n#### 🐳 Optional: Run via Docker\n\nYou can also build and run the dashboard as a container (if desired):\n\nBuild the image:\n\n```bash\ndocker build -t github-anomaly-dashboard -f Dockerfile.streamlit .\n```\n\nRun the container:\n\n```bash\ndocker run -p 8501:8501 \\\n  -v $(pwd)/data:/app/data \\\n  -v $(pwd)/reports:/app/reports \\\n  github-anomaly-dashboard\n```\n\nThen open your browser at http://localhost:8501.\n\n### 11. ☁️ Infrastructure as Code (IaC): MLflow Server with Terraform\n\nThis Terraform module provisions a **Docker-based MLflow tracking server**, matching the setup used in `docker-compose.yaml`, but on a **different port (5050)** to avoid conflicts.\n\n---\n\n#### 📁 Directory Structure\n\n- infra/main.tf # Terraform configuration\n- README.md # This file\n\n#### ⚙️ Requirements\n\n- [Terraform](https://developer.hashicorp.com/terraform/downloads)\n- [Docker](https://docs.docker.com/get-docker/)\n\n#### 🚀 How to Use:\n\n##### 1. Navigate to the `infra/` folder\n\n```bash\ncd infra\n```\n\n##### 2. Initialize Terraform\n\n```bash\nterraform init\n```\n\n##### 3. Apply the infrastructure\n\n```bash\nterraform apply # Confirm with yes when prompted.\n```\n\n##### 4. 🔎 Verify\n\nMLflow server will be available at:\n\n```bash\nhttp://localhost:5050\n```\n\nAll artifacts will be stored in your project’s mlruns/ directory.\n\n##### 5. ❌ To Clean Up\n\n```bash\nterraform destroy\n```\n\nThis removes the MLflow container provisioned by Terraform.\n\n### 12. 🧹 Clean Code\n\nAll code follows:\n\n- PEP8 formatting via Black\n- Linting with Flake8 + Bugbear\n- Pre-commit hook enforcement\n\n\u003cspan id=\"makefile-usage\"\u003e\u003c/span\u003e\n### 13. 🛠️ Makefile Usage\n\nThis project includes a Makefile that simplifies formatting, testing, building Docker containers, and running Airflow or the FastAPI inference app.\n\nYou can run all commands with or without activating the Pipenv shell. For example:\n\n```bash\nmake lint\n```\n\n#### 🔧 Setup Commands\n\n```bash\nmake install # Install all dependencies via Pipenv (both runtime and dev)\nmake create-env   # Create .env file with AIRFLOW_UID, alert placeholders, and S3 support flag\nmake clean # Remove all __pycache__ folders and .pyc files\n```\n\n#### 🧪 Code Quality \u0026 Testing\n\n```bash\nmake format # Format code using Black\nmake lint # Lint code using Flake8\nmake test # Run tests using Pytest\nmake check # Run all of the above together\n```\n\n### 📊 Streamlit Dashboard\n\n```bash\nmake streamlit  # Launch the Streamlit dashboard at http://localhost:8501\n```\n\n#### 🐳 FastAPI Inference App\n\n```bash\nmake docker-build # Build the Docker image for FastAPI app\nmake docker-run # Run the Docker container on port 8000\nmake api-test\t# Send a test prediction request using curl\n```\n\nAfter running make docker-run, open another terminal and run make api-test.\n\n#### ⏱️ Airflow Pipeline\n\n```bash\nmake airflow-up # Start Airflow services (scheduler, UI, etc.)\nmake airflow-down\tStop all Airflow containers\n```\n\nOnce up, access:\n\n- Airflow UI: http://localhost:8080 (Login: airflow / airflow)\n- MLflow UI: http://localhost:5000\n\n### MLflow Server with Terraform\n\n```bash\nmake install-terraform # Install Terraform CLI if not present\nmake terraform-init    # Initialize Terraform config\nmake terraform-apply   # Provision MLflow container (port 5050)\nmake terraform-destroy # Tear down MLflow container\nmake terraform-status  # Show current infra state\n```\n\n#### 📋 View All Commands\n\n```bash\nmake help # Prints a summary of all available targets and their descriptions.\n```\n\n### 14. 🙌 Credits\n\nBuilt by Rajat Gupta as part of an MLOps portfolio.\nInspired by real-time event pipelines and anomaly detection architectures used in production.\n\n### 14. 📝 License","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Frajat116%2Fgithub-anomaly-project","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Frajat116%2Fgithub-anomaly-project","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Frajat116%2Fgithub-anomaly-project/lists"}