{"id":30850139,"url":"https://github.com/robinmillford/reddit-content-classifier","last_synced_at":"2025-09-07T05:06:59.254Z","repository":{"id":306977664,"uuid":"1027882655","full_name":"RobinMillford/Reddit-content-classifier","owner":"RobinMillford","description":"This repository contains the source code for a complete, end-to-end MLOps project that automatically trains, evaluates, and deploys a machine learning model to classify Reddit content as Safe-For-Work (SFW) or Not-Safe-For-Work (NSFW).","archived":false,"fork":false,"pushed_at":"2025-09-02T05:37:28.000Z","size":137,"stargazers_count":1,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2025-09-02T07:21:43.392Z","etag":null,"topics":["content-classification","end-to-end-machine-learning","mlops-project","mlops-workflow","nsfw","reddit-api"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"agpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/RobinMillford.png","metadata":{"files":{"readme":"Readme.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2025-07-28T17:15:40.000Z","updated_at":"2025-09-02T05:37:31.000Z","dependencies_parsed_at":"2025-07-28T20:27:09.666Z","dependency_job_id":"a38aa41e-f2c6-4423-a032-62ad7f78cb47","html_url":"https://github.com/RobinMillford/Reddit-content-classifier","commit_stats":null,"previous_names":["robinmillford/reddit-content-classifier"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/RobinMillford/Reddit-content-classifier","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/RobinMillford%2FReddit-content-classifier","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/RobinMillford%2FReddit-content-classifier/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/RobinMillford%2FReddit-content-classifier/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/RobinMillford%2FReddit-content-classifier/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/RobinMillford","download_url":"https://codeload.github.com/RobinMillford/Reddit-content-classifier/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/RobinMillford%2FReddit-content-classifier/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":273997155,"owners_count":25204479,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-09-07T02:00:09.463Z","response_time":67,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["content-classification","end-to-end-machine-learning","mlops-project","mlops-workflow","nsfw","reddit-api"],"created_at":"2025-09-07T05:06:57.474Z","updated_at":"2025-09-07T05:06:59.241Z","avatar_url":"https://github.com/RobinMillford.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# AI Content Classifier: Production MLOps Pipeline\n\n[![Scheduled Model Retraining](https://github.com/RobinMillford/Reddit-content-classifier/actions/workflows/main.yml/badge.svg)](https://github.com/RobinMillford/Reddit-content-classifier/actions/workflows/main.yml)\n[![Python 3.11](https://img.shields.io/badge/python-3.11-blue.svg)](https://www.python.org/downloads/)\n[![Streamlit](https://img.shields.io/badge/streamlit-1.28+-red.svg)](https://streamlit.io/)\n[![MLOps](https://img.shields.io/badge/MLOps-automated-green.svg)](https://github.com/RobinMillford/Reddit-content-classifier)\n\nA production-ready MLOps pipeline that automatically classifies Reddit content using advanced multi-label machine learning with enterprise-level automation and continuous learning.\n\n🌐 **[Live Application](https://reddit-content-classifier.streamlit.app/)**\n\n## 🎯 Project Overview\n\n- **Multi-Label Classification**: Simultaneous analysis across 5 dimensions (Safety, Toxicity, Sentiment, Topic, Engagement)\n- **Automated MLOps**: Weekly retraining with model selection and deployment\n- **Production Scaling**: Handles 25,000+ posts per training cycle\n- **Real-time Inference**: Sub-second response times with 88%+ accuracy\n\n## 📊 Multi-Label Classification\n\n| Category       | Description                | Classifications                      |\n| -------------- | -------------------------- | ------------------------------------ |\n| **Safety**     | Content safety assessment  | Safe, NSFW                           |\n| **Toxicity**   | Harmful content detection  | Non-toxic, Toxic                     |\n| **Sentiment**  | Emotional tone analysis    | Positive, Neutral, Negative          |\n| **Topic**      | Content categorization     | Technology, Gaming, Business, Health |\n| **Engagement** | Viral potential prediction | High, Low Engagement                 |\n\n## 🏗️ Architecture\n\n![MLOps Pipeline Workflow](Workflow.png)\n\n```\nReddit API → Data Pipeline → ML Training → Model Deployment → Web Application\n    │              │              │               │                │\n PRAW API       GitHub Actions   Ensemble ML      Git LFS        Streamlit\n```\n\n**MLOps Pipeline**:\n\n1. **Data Collection**: Weekly automated Reddit data ingestion (25,000+ posts)\n2. **Feature Engineering**: TF-IDF vectorization (10k features, 1-2 grams)\n3. **Model Training**: Multi-algorithm competition (Logistic Regression, SVM, Neural Networks, LightGBM)\n4. **Model Selection**: Ensemble creation from top-performing models\n5. **Deployment**: Automated Git LFS versioning and cloud deployment\n\n## 🎯 Performance Metrics\n\n| Metric                  | Value   | Description                        |\n| ----------------------- | ------- | ---------------------------------- |\n| **Binary F1-Score**     | 88.3%   | SFW/NSFW classification accuracy   |\n| **Multi-Label Jaccard** | 82.7%   | Overall multi-category performance |\n| **Training Data**       | 25,000+ | Reddit posts per training cycle    |\n| **Inference Speed**     | \u003c100ms  | Real-time response capability      |\n| **Model Size**          | ~150MB  | Optimized for cloud deployment     |\n| **Automation**          | Weekly  | Continuous learning and updates    |\n\n## 🛠️ Technology Stack\n\n**Core Technologies**:\n\n- **Python 3.11**, Scikit-learn, LightGBM, Pandas, NumPy\n- **Streamlit**, Plotly (Visualization)\n- **PRAW** (Reddit API), TF-IDF (NLP)\n\n**MLOps \u0026 Infrastructure**:\n\n- **GitHub Actions** (CI/CD), **Git LFS** (Model Versioning)\n- **Docker** (Containerization), **Streamlit Cloud** (Deployment)\n\n## 🚀 Local Development Setup\n\n### Prerequisites\n\n- **Python 3.11+**\n- **Git** with **Git LFS** support\n- **Reddit API credentials** (for data collection and training)\n\n### Step 1: Clone Repository\n\n```bash\ngit clone https://github.com/RobinMillford/Reddit-content-classifier.git\ncd Reddit-content-classifier\n\n# Setup Git LFS for model files\ngit lfs install\ngit lfs pull\n```\n\n### Step 2: Environment Setup\n\n```bash\n# Create virtual environment\npython -m venv venv\n\n# Activate virtual environment\n# Windows:\nvenv\\Scripts\\activate\n# macOS/Linux:\nsource venv/bin/activate\n\n# Install dependencies\npip install -r requirements.txt\n```\n\n### Step 3: Reddit API Configuration\n\n**Get Reddit API Credentials**:\n\n1. Go to [Reddit App Preferences](https://www.reddit.com/prefs/apps)\n2. Click **\"Create App\"** or **\"Create Another App\"**\n3. Fill in the form:\n   - **Name**: Your app name (e.g., \"Content Classifier\")\n   - **App type**: Select **\"script\"**\n   - **Description**: Optional\n   - **About URL**: Leave blank\n   - **Redirect URI**: `http://localhost:8080`\n4. Click **\"Create app\"**\n5. Note down the **Client ID** (under app name) and **Client Secret**\n\n**Setup Environment Variables**:\n\nCreate a `.env` file in the project root:\n\n```bash\n# Create .env file\ntouch .env  # Linux/macOS\n# or create manually on Windows\n```\n\nAdd your Reddit API credentials to `.env`:\n\n```env\nREDDIT_CLIENT_ID=your_client_id_here\nREDDIT_CLIENT_SECRET=your_client_secret_here\nREDDIT_USER_AGENT=YourAppName/1.0\n```\n\n**Alternative: Export Environment Variables**:\n\n```bash\n# Export variables (Linux/macOS)\nexport REDDIT_CLIENT_ID=\"your_client_id\"\nexport REDDIT_CLIENT_SECRET=\"your_client_secret\"\nexport REDDIT_USER_AGENT=\"YourAppName/1.0\"\n\n# Windows Command Prompt\nset REDDIT_CLIENT_ID=your_client_id\nset REDDIT_CLIENT_SECRET=your_client_secret\nset REDDIT_USER_AGENT=YourAppName/1.0\n\n# Windows PowerShell\n$env:REDDIT_CLIENT_ID=\"your_client_id\"\n$env:REDDIT_CLIENT_SECRET=\"your_client_secret\"\n$env:REDDIT_USER_AGENT=\"YourAppName/1.0\"\n```\n\n### Step 4: Run Application\n\n```bash\n# Start the web application\nstreamlit run app.py\n```\n\n🌐 **Access**: Application runs at `http://localhost:8501`\n\n### Step 5: Custom Model Training (Optional)\n\n```bash\n# Collect fresh training data\npython src/ingest_data.py\n\n# Train and evaluate models\npython src/train.py\n\n# Models are automatically saved and can be loaded by app.py\n```\n\n### Troubleshooting\n\n**Common Issues**:\n\n1. **Git LFS files not downloading**: Run `git lfs pull`\n2. **Reddit API errors**: Verify your `.env` credentials\n3. **Model files missing**: Ensure Git LFS is installed and configured\n4. **Import errors**: Check virtual environment activation\n\n**Verify Setup**:\n\n```bash\n# Check Git LFS status\ngit lfs ls-files\n\n# Verify environment variables\npython -c \"import os; print(os.getenv('REDDIT_CLIENT_ID'))\"\n\n# Test Reddit API connection\npython -c \"import praw; reddit = praw.Reddit(client_id='test', client_secret='test', user_agent='test'); print('PRAW imported successfully')\"\n```\n\n## 📁 Project Structure\n\n```\n├── src/\n│   ├── ingest_data.py      # Reddit data collection\n│   └── train.py            # ML model training\n├── .github/workflows/      # CI/CD automation\n├── app.py                  # Streamlit web application\n├── champion_model.pkl      # Production binary model (Git LFS)\n├── multi_label_model.pkl   # Production multi-label model (Git LFS)\n├── vectorizer.joblib       # Text preprocessing pipeline (Git LFS)\n└── model_metadata.joblib   # Model performance metrics (Git LFS)\n```\n\n## 💼 Professional Impact\n\n**Business Value**: Demonstrates end-to-end ML engineering capabilities with production-ready automation and scalable infrastructure design.\n\n**Technical Expertise**: Showcases expertise in MLOps, automated pipelines, multi-label classification, and cloud deployment strategies.\n\n**Results Delivered**: 88%+ accuracy system processing 25,000+ posts weekly with zero-downtime continuous deployment.\n\n## 🤝 Contributing\n\nThis project is **open source** and welcomes contributions from the community.\n\n**How to Contribute**:\n\n1. Fork the repository\n2. Create a feature branch: `git checkout -b feature/enhancement`\n3. Make your changes with proper testing\n4. Submit a pull request with detailed description\n\n**Areas for Contribution**:\n\n- Model performance improvements\n- New classification categories\n- Enhanced MLOps automation\n- Documentation and testing\n\n**Development Setup**:\n\n```bash\ngit clone https://github.com/RobinMillford/Reddit-content-classifier.git\ncd Reddit-content-classifier\npip install -r requirements.txt\nstreamlit run app.py\n```\n\n**Project Repository**: [github.com/RobinMillford/Reddit-content-classifier](https://github.com/RobinMillford/Reddit-content-classifier)\n\n---\n\n_This project demonstrates production-ready MLOps implementation suitable for enterprise content moderation systems._\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Frobinmillford%2Freddit-content-classifier","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Frobinmillford%2Freddit-content-classifier","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Frobinmillford%2Freddit-content-classifier/lists"}