{"id":29774588,"url":"https://github.com/athrvk/daily-article-scrapper","last_synced_at":"2025-07-27T08:08:44.737Z","repository":{"id":303194700,"uuid":"1014697948","full_name":"athrvk/daily-article-scrapper","owner":"athrvk","description":"A Python application that scrapes articles from various tech news sources and stores them in MongoDB. Designed to run as a scheduled job via GitHub Actions.","archived":false,"fork":false,"pushed_at":"2025-07-18T10:28:05.000Z","size":72,"stargazers_count":0,"open_issues_count":2,"forks_count":0,"subscribers_count":0,"default_branch":"master","last_synced_at":"2025-07-18T13:44:34.066Z","etag":null,"topics":["articles","latest","news","rss"],"latest_commit_sha":null,"homepage":"https://tokindle.in/discover","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/athrvk.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2025-07-06T08:28:08.000Z","updated_at":"2025-07-16T18:44:53.000Z","dependencies_parsed_at":"2025-07-06T09:31:00.676Z","dependency_job_id":null,"html_url":"https://github.com/athrvk/daily-article-scrapper","commit_stats":null,"previous_names":["athrvk/daily-article-scrapper"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/athrvk/daily-article-scrapper","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/athrvk%2Fdaily-article-scrapper","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/athrvk%2Fdaily-article-scrapper/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/athrvk%2Fdaily-article-scrapper/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/athrvk%2Fdaily-article-scrapper/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/athrvk","download_url":"https://codeload.github.com/athrvk/daily-article-scrapper/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/athrvk%2Fdaily-article-scrapper/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":267327534,"owners_count":24069442,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-07-27T02:00:11.917Z","response_time":82,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["articles","latest","news","rss"],"created_at":"2025-07-27T08:08:44.115Z","updated_at":"2025-07-27T08:08:44.732Z","avatar_url":"https://github.com/athrvk.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Daily Article Scraper\n\nA Python application that scrapes articles from various tech news sources and stores them in MongoDB. Designed to run as a scheduled job via GitHub Actions.\n\n## Features\n\n- **Multi-source scraping**: Extracts articles from Medium, TechCrunch, HackerNews, Dev.to, BBC Tech, CNN Tech, Reuters Tech, and more\n- **MongoDB integration**: Stores articles with deduplication and indexing\n- **Automatic cleanup**: Removes articles older than 2 months (configurable)\n- **GitHub Actions workflow**: Automated daily scraping via cron jobs\n- **Professional structure**: Modular codebase with proper configuration management\n- **Comprehensive logging**: Detailed logging with file and console output\n- **Error handling**: Robust error handling and retry mechanisms\n- **Rate limiting**: Respectful scraping with configurable delays\n- **Database monitoring**: Statistics and health checking tools\n\n## Project Structure\n\n```\ndaily-article-scrapper/\n├── .github/\n│   └── workflows/\n│       └── daily-scraper.yml      # GitHub Actions workflow\n├── config/\n│   ├── __init__.py\n│   └── settings.py                # Configuration settings\n├── src/\n│   ├── __init__.py\n│   ├── database.py               # MongoDB operations\n│   └── scraper.py                # Core scraping logic\n├── tests/                        # Test files\n├── scripts/                      # Utility scripts\n│   ├── setup.sh                  # Environment setup\n│   ├── manage.sh                 # Project management\n│   ├── status_check.py           # Health monitoring\n│   └── cleanup_articles.py       # Database cleanup\n├── logs/                         # Log files (created at runtime)\n├── .env.example                  # Environment variables template\n├── .gitignore                    # Git ignore file\n├── main.py                       # Main application entry point\n├── requirements.txt              # Production dependencies\n├── requirements-dev.txt          # Development dependencies\n├── CLEANUP_GUIDE.md              # Database cleanup documentation\n└── README.md                     # This file\n```\n\n## Installation\n\n### 1. Clone the repository\n\n```bash\ngit clone \u003cyour-repo-url\u003e\ncd daily-article-scrapper\n```\n\n### 2. Create a virtual environment\n\n```bash\npython -m venv venv\nsource venv/bin/activate  # On Windows: venv\\Scripts\\activate\n```\n\n### 3. Install dependencies\n\n```bash\npip install -r requirements.txt\n```\n\n### 4. Set up environment variables\n\n```bash\ncp .env.example .env\n# Edit .env with your MongoDB configuration\n```\n\n### 5. Set up MongoDB\n\nMake sure you have access to a MongoDB instance. You can use:\n- Local MongoDB installation\n- MongoDB Atlas (cloud)\n- Docker container\n\n## Configuration\n\n### Environment Variables\n\nCreate a `.env` file based on `.env.example`:\n\n```env\n# MongoDB Configuration\nMONGODB_URI=mongodb://localhost:27017/\nMONGODB_DATABASE=article_scraper\nMONGODB_COLLECTION=articles\n\n# Scraping Configuration\nTARGET_ARTICLE_COUNT=20\nRATE_LIMIT_DELAY=2\nMAX_RETRIES=3\n\n# Logging Configuration\nLOG_LEVEL=INFO\nLOG_FILE=logs/scraper.log\n\n# Cleanup Configuration\nAUTO_CLEANUP_ENABLED=true\nCLEANUP_MONTHS_OLD=2\n```\n\n### MongoDB Setup\n\nThe application will automatically:\n- Create the database and collection if they don't exist\n- Set up indexes for optimal performance\n- Handle duplicate articles based on URL\n\n## Usage\n\n### Local Development\n\n```bash\n# Run the scraper once\npython main.py\n\n# Run with custom article count\nTARGET_ARTICLE_COUNT=50 python main.py\n\n# Check database statistics\npython scripts/cleanup_articles.py --stats\n\n# Manual cleanup (dry run)\npython scripts/cleanup_articles.py --dry-run\n\n# Manual cleanup (execute)\npython scripts/cleanup_articles.py\n```\n\n### GitHub Actions Setup\n\n1. **Set up repository secrets**:\n   - Go to your repository Settings → Secrets and variables → Actions\n   - Add the following secrets:\n     - `MONGODB_URI`: Your MongoDB connection string\n     - `MONGODB_DATABASE`: Database name\n     - `MONGODB_COLLECTION`: Collection name\n\n2. **Configure the schedule**:\n   - Edit `.github/workflows/daily-scraper.yml`\n   - Modify the cron expression to your preferred time\n\n3. **Manual trigger**:\n   - Go to Actions tab in your repository\n   - Select \"Daily Article Scraper\"\n   - Click \"Run workflow\"\n\n## Development\n\n### Setting up development environment\n\n```bash\n# Install development dependencies\npip install -r requirements-dev.txt\n\n# Format code\nblack src/ config/ main.py\n\n# Lint code\nflake8 src/ config/ main.py\n\n# Check database statistics\nbash scripts/manage.sh stats\n\n# Manual cleanup\nbash scripts/manage.sh cleanup\n```\n\n### Adding new sources\n\n1. **RSS feeds**: Add to `config/settings.py` in the `RSS_FEEDS` dictionary\n2. **Custom scrapers**: Add methods to `src/scraper.py`\n3. **Configuration**: Update environment variables as needed\n\n## Database Cleanup\n\nThe application includes an automated cleanup system that removes articles older than 2 months by default. This prevents the database from growing indefinitely and ensures optimal performance.\n\n### Cleanup Features\n\n- **Automatic cleanup**: Runs before each scraping session\n- **Configurable retention**: Adjust with `CLEANUP_MONTHS_OLD` environment variable\n- **Manual control**: Can be disabled with `AUTO_CLEANUP_ENABLED=false`\n- **Safe operations**: Dry-run mode available for testing\n\n### Cleanup Commands\n\n```bash\n# View database statistics\npython scripts/cleanup_articles.py --stats\nbash scripts/manage.sh stats\n\n# Preview cleanup (dry run)\npython scripts/cleanup_articles.py --dry-run\n\n# Manual cleanup\npython scripts/cleanup_articles.py\nbash scripts/manage.sh cleanup\n\n# Custom retention period\npython scripts/cleanup_articles.py --months 3\n```\n\n### Configuration\n\nSet cleanup behavior in your `.env` file:\n\n```env\nAUTO_CLEANUP_ENABLED=true    # Enable/disable automatic cleanup\nCLEANUP_MONTHS_OLD=2         # Keep articles for 2 months\n```\n\nFor detailed cleanup documentation, see `CLEANUP_GUIDE.md`.\n\n## API Documentation\n\n### ArticleScraper Class\n\nMain scraping functionality:\n\n```python\nfrom src.scraper import ArticleScraper\n\nscraper = ArticleScraper()\narticles = scraper.scrape_daily_articles(target_count=20)\n```\n\n### DatabaseManager Class\n\nMongoDB operations:\n\n```python\nfrom src.database import DatabaseManager\n\nwith DatabaseManager() as db:\n    db.save_articles(articles)\n    recent = db.get_recent_articles(days=7)\n```\n\n## Data Structure\n\nArticles are stored with the following structure:\n\n```json\n{\n  \"_id\": \"https://example.com/article_20250706\",\n  \"title\": \"Article Title\",\n  \"url\": \"https://example.com/article\",\n  \"published\": \"2025-07-06T10:30:00Z\",\n  \"summary\": \"Article summary text\",\n  \"source\": \"techcrunch.com\",\n  \"tags\": [\"technology\", \"ai\"],\n  \"scraped_at\": \"2025-07-06T13:22:46.123Z\"\n}\n```\n\n## Monitoring and Logs\n\n- **Local logs**: Check `logs/scraper.log`\n- **Database stats**: Run `python scripts/cleanup_articles.py --stats`\n- **Management tools**: Use `bash scripts/manage.sh [command]`\n- **GitHub Actions**: View logs in the Actions tab\n- **MongoDB**: Query the database for article statistics\n\n## Troubleshooting\n\n### Common Issues\n\n1. **MongoDB connection failed**:\n   - Check your `MONGODB_URI` configuration\n   - Ensure MongoDB is running and accessible\n   - Verify network connectivity\n\n2. **No articles found**:\n   - Check internet connectivity\n   - Some RSS feeds might be temporarily unavailable\n   - Increase `MAX_RETRIES` in configuration\n\n3. **Rate limiting**:\n   - Increase `RATE_LIMIT_DELAY` to be more respectful to servers\n   - Some sites might block requests; consider using proxies\n\n4. **Database growing too large**:\n   - Check if cleanup is enabled: `AUTO_CLEANUP_ENABLED=true`\n   - Adjust retention period: `CLEANUP_MONTHS_OLD=2`\n   - Run manual cleanup: `bash scripts/manage.sh cleanup`\n\n5. **Old articles not being deleted**:\n   - Verify cleanup configuration in `.env`\n   - Check logs for cleanup errors\n   - Run cleanup manually to test\n\n### GitHub Actions Issues\n\n1. **Secrets not configured**:\n   - Ensure all required secrets are set in repository settings\n\n2. **Workflow not running**:\n   - Check the cron expression syntax\n   - Ensure the repository is not dormant (GitHub disables workflows on inactive repos)\n\n## Contributing\n\n1. Fork the repository\n2. Create a feature branch\n3. Make your changes\n4. Submit a pull request\n\n## License\n\nThis project is licensed under the MIT License. See the LICENSE file for details.\n\n## Support\n\nFor issues and questions:\n1. Check the troubleshooting section\n2. Search existing GitHub issues\n3. Create a new issue with detailed information\n\n## Acknowledgments\n\n- Built with Python 3.11+\n- Uses feedparser for RSS parsing\n- Beautiful Soup for web scraping\n- MongoDB for data storage\n- GitHub Actions for automation\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fathrvk%2Fdaily-article-scrapper","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fathrvk%2Fdaily-article-scrapper","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fathrvk%2Fdaily-article-scrapper/lists"}