{"id":30237848,"url":"https://github.com/mooshieblob1/tag-scraper","last_synced_at":"2025-08-15T02:39:37.673Z","repository":{"id":308639738,"uuid":"1033497422","full_name":"Mooshieblob1/tag-scraper","owner":"Mooshieblob1","description":null,"archived":false,"fork":false,"pushed_at":"2025-08-07T03:03:57.000Z","size":144,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2025-08-07T03:31:07.298Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Mooshieblob1.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2025-08-06T23:03:17.000Z","updated_at":"2025-08-07T03:04:01.000Z","dependencies_parsed_at":"2025-08-07T03:31:09.123Z","dependency_job_id":"27bf2aae-c6d4-4628-acd7-dc159ca4b25c","html_url":"https://github.com/Mooshieblob1/tag-scraper","commit_stats":null,"previous_names":["mooshieblob1/tag-scraper"],"tags_count":null,"template":false,"template_full_name":null,"purl":"pkg:github/Mooshieblob1/tag-scraper","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Mooshieblob1%2Ftag-scraper","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Mooshieblob1%2Ftag-scraper/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Mooshieblob1%2Ftag-scraper/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Mooshieblob1%2Ftag-scraper/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Mooshieblob1","download_url":"https://codeload.github.com/Mooshieblob1/tag-scraper/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Mooshieblob1%2Ftag-scraper/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":270515244,"owners_count":24598435,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-08-15T02:00:12.559Z","response_time":110,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2025-08-15T02:39:31.585Z","updated_at":"2025-08-15T02:39:37.658Z","avatar_url":"https://github.com/Mooshieblob1.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# 🎨 Danbooru Artist Scraper\n\nA comprehensive web application for scraping and searching artist data from Danbooru. Features a modern web interface for easy searching and filtering of artists by various criteria, with advanced 429 rate limit detection and adaptive rate limiting.\n\n## ✨ Features\n\n- **Web-based Interface**: Modern, responsive UI with dark/light mode toggle\n- **Artist Scraping**: Collect artist data from Danbooru pages 1-1000 (free tier limit)\n- **Post Count Fetching**: Optional individual post count retrieval (requires API credentials)\n- **Image Previews**: Visual artwork previews directly in search results with:\n  - **Smart Blur**: Sensitive/questionable/explicit content blurred by default  \n  - **Extreme Blur Effect**: 4.3 billion pixel blur (2^32) for maximum privacy\n  - **Rating Priority**: General → Sensitive → Questionable → Explicit display order\n  - **Toggle Control**: Per-artist blur toggle in preview interface\n  - **Safety Badges**: Clear rating indicators (General/Sensitive/Questionable/Explicit)\n- **Duplicate Detection**: Automatically handles duplicate artists with database constraints\n- **Page Continuation**: Scrapes until no more pages are found (up to 3 consecutive empty pages)\n- **Advanced Search**: Filter artists by:\n  - Name (starts with, contains)\n  - Post count (minimum/maximum) - when post counts are fetched\n  - Multiple criteria combinations\n- **Real-time Progress**: Live progress tracking during scraping\n- **Enhanced 429 Detection**: Intelligent rate limiting with automatic adaptation\n- **Health Monitoring**: Real-time rate limiting status with color-coded health indicators\n- **Database Storage**: SQLite database for efficient data storage and querying\n- **Export Functionality**: Export search results as JSON and CSV\n- **Statistics Dashboard**: View database statistics and top artists\n\n## 🚫 Advanced Rate Limiting\n\nThis scraper includes sophisticated 429 (Too Many Requests) detection and adaptive rate limiting:\n\n### � Adaptive Features\n- **Automatic 429 Detection**: Detects and handles rate limiting automatically\n- **Intelligent Backoff**: Uses exponential backoff with jitter to avoid thundering herd\n- **Health Monitoring**: Tracks rate limiting health (🟢 Healthy, 🟡 Warning, 🔴 Critical)\n- **Progressive Recovery**: Gradually returns to normal speed after successful requests\n- **Cooldown Management**: Implements temporary cooldowns during heavy rate limiting\n\n### 📊 Monitoring Dashboard\n- **Real-time Status**: Live rate limiting status updates every 10 seconds\n- **429 Error Tracking**: Total and consecutive 429 error counts\n- **Adaptive Indicators**: Visual indicators when adaptive rate limiting is active\n- **Health Status**: Color-coded health status with detailed information\n\n## �🚀 Quick Start\n\n### Prerequisites\n- Python 3.7 or higher\n- Internet connection\n\n### Installation\n\n1. **Clone or download this repository**\n\n2. **Set up API credentials (REQUIRED for post counts):**\n   ```bash\n   # Create .env file with your Danbooru credentials\n   echo \"DANBOORU_USERNAME=your_username\" \u003e .env\n   echo \"DANBOORU_API_KEY=your_api_key\" \u003e\u003e .env\n   ```\n   \n   **How to get API credentials:**\n   - Create a free account at https://danbooru.donmai.us/\n   - Go to https://danbooru.donmai.us/api_keys and generate an API key\n   - Replace `your_username` and `your_api_key` with your actual credentials\n   \n   **⚠️ IMPORTANT**: Without API credentials, post counts will always be 0!\n\n3. **Run the setup script:**\n   ```bash\n   ./setup.sh\n   ```\n   This will:\n   - Create a virtual environment\n   - Install all dependencies\n   - Set up the project structure\n\n4. **Start the application:**\n   ```bash\n   source venv/bin/activate\n   python app.py\n   ```\n\n5. **Open your browser to:** http://localhost:5000\n\n## 🎨 User Interface Features\n\n### Dark/Light Mode\n- **Toggle Switch**: Click the moon icon (🌙) in the header to switch between dark and light themes\n- **Automatic Adaptation**: All UI elements, text, and backgrounds smoothly transition between modes\n- **Persistent Setting**: Your theme preference is saved in browser storage\n\n### Image Content Controls\n- **Smart Blur**: Sensitive/questionable/explicit content automatically blurred for privacy\n- **Rating Priority**: Images sorted by General → Sensitive → Questionable → Explicit\n- **Blur Toggle**: Per-artist toggle to enable/disable content blurring\n- **Extreme Privacy**: 4.3 billion pixel blur effect (2^32) for maximum content protection\n\n## 📊 How to Use\n\n### 1. Data Collection\n- Navigate to the \"Data Collection\" section\n- Set your page range (start with 1-10 for testing)\n- **Choose post count mode:**\n  - ✅ **With post counts**: Accurate data but much slower (1-2 API calls per artist)\n  - ⚡ **Fast mode**: Basic artist info only, post counts will be 0\n- Click \"Start Scraping\" to begin collecting artist data\n- Monitor progress with the real-time progress bar\n- Watch the rate limiting health status for any issues\n- **Note**: Full scrape with post counts can take many hours; without post counts takes 2-4 hours\n\n### 2. Searching Artists\nUse the search interface to find artists by:\n\n**Example Searches:**\n- **Artists starting with \"A\"**: Set \"Name starts with\" to \"A\"\n- **Popular artists**: Set \"Minimum posts\" to \"300\"\n- **Specific artist**: Set \"Name contains\" to search for partial matches\n- **Range filtering**: Combine minimum and maximum post counts\n\n### 3. Results Management\n- View results in an organized grid layout\n- See artist names, post counts (if fetched), and alternative names\n- **Image Previews**: Click \"📸 Show Preview\" to see sample artwork from each artist\n  - Safe/Questionable/Explicit content ratings are color-coded\n  - Click any preview image to view it full-size\n  - Direct links to artist's Danbooru page\n- Export results as JSON or CSV for external use\n- Clear results and start new searches\n\n## 🔧 Technical Details\n\n### Architecture\n- **Backend**: Flask web framework\n- **Frontend**: Modern HTML/CSS/JavaScript\n- **Database**: SQLite for local data storage\n- **Scraping**: BeautifulSoup for HTML parsing\n- **HTTP Requests**: Requests library with session management\n\n### Database Schema\n```sql\nCREATE TABLE artists (\n    id INTEGER PRIMARY KEY,\n    name TEXT UNIQUE,\n    post_count INTEGER,\n    other_names TEXT,\n    group_name TEXT,\n    url_string TEXT,\n    is_active BOOLEAN,\n    created_at TEXT,\n    updated_at TEXT,\n    is_banned BOOLEAN\n)\n```\n\n### API Endpoints\n- `GET /`: Main interface\n- `POST /search`: Search artists with criteria\n- `POST /scrape`: Start scraping process\n- `GET /scrape/status`: Get scraping progress\n- `POST /scrape/stop`: Stop scraping\n- `GET /stats`: Get database statistics\n- `GET /export`: Export all data as JSON\n- `GET /export/csv`: Export all data as CSV\n\n## 🛠️ Advanced Usage\n\n### Command Line Scraping\nYou can also use the scraper directly from Python:\n\n```python\nfrom scraper import DanbooruArtistScraper\n\n# Initialize scraper\nscraper = DanbooruArtistScraper()\n\n# Scrape specific pages\nscraper.scrape_all_pages(start_page=1, end_page=50)\n\n# Search for artists\nartists = scraper.get_artists_by_criteria(\n    name_starts_with=\"A\",\n    min_post_count=100,\n    limit=50\n)\n\n# Get statistics\nstats = scraper.get_database_stats()\nprint(f\"Total artists: {stats['total_artists']}\")\n```\n\n### Rate Limiting\nThe scraper includes advanced rate limiting with 429 detection:\n- **Base Rate**: 6.7 requests per second (conservative)\n- **429 Detection**: Automatic detection and handling of rate limit responses\n- **Exponential Backoff**: Smart retry logic with increasing delays\n- **Retry-After Support**: Respects server-provided timing guidance\n- **Dynamic Adjustment**: Automatically reduces rate when needed\n- **Gradual Recovery**: Returns to optimal rate after successful operations\n\nSee `RATE_LIMITING.md` for detailed technical documentation.\n\n### Database Management\nThe SQLite database (`artists.db`) stores all scraped data:\n- **Location**: Same directory as the application\n- **Backup**: Copy the `.db` file to backup your data\n- **Reset**: Delete the `.db` file to start fresh\n\n## 🚨 Important Notes\n\n### Danbooru API Limits\n- **Free tier**: Limited to pages 1-1000\n- **Rate limiting**: Built-in delays to respect server resources\n- **Content**: Public artist information only\n\n### Performance Considerations\n- **Initial scrape**: Can take 2-4 hours for all 1000 pages\n- **Memory usage**: Efficient SQLite storage\n- **Network**: Requires stable internet connection\n- **Updates**: Re-run scraper to get latest data\n\n### Legal and Ethical Use\n- Respect Danbooru's terms of service\n- Don't overload their servers\n- Use data responsibly\n- Consider supporting Danbooru if you find the service valuable\n\n## 🧪 Rate Limiting Testing\n\n### Test Tools\n\n**Rate Limit Monitor**: Comprehensive testing of 429 detection\n```bash\npython rate_limit_monitor.py\n```\nOptions:\n- **Quick Test**: Normal rate limiting behavior\n- **Aggressive Test**: Trigger 429s to test adaptive behavior\n- **Full Test Suite**: Comprehensive testing with recovery monitoring\n- **Status Monitor**: Check current rate limiting status\n\n**Enhanced 429 Test**: Simple demonstration script\n```bash\npython test_enhanced_429.py\n```\n\n### Test Reports\n\nThe rate limit monitor generates detailed reports:\n- `rate_limit_test_report.json`: Comprehensive test results\n- `rate_limit_monitor.log`: Detailed logging of test execution\n\n### API Endpoint\n\nCheck rate limiting status programmatically:\n```bash\ncurl http://localhost:5000/rate-limit-status\n```\n\nReturns comprehensive status including:\n- Current rate limit and health status\n- 429 error counts and patterns\n- Adaptive rate limiting status\n- Cooldown information\n\n## 🔧 Troubleshooting\n\n### Common Issues\n\n**\"Import could not be resolved\" errors:**\n- Make sure you've activated the virtual environment: `source venv/bin/activate`\n- Install dependencies: `pip install -r requirements.txt`\n\n**Scraping fails or stops:**\n- Check your internet connection\n- Monitor rate limiting health status (🟢🟡🔴)\n- If seeing 429s, wait for adaptive recovery\n- Restart the scraping from the last successful page\n- Some pages might be temporarily unavailable\n\n**Rate limiting issues:**\n- Check the rate limiting health status in web interface\n- 🔴 Critical status indicates server overload - wait before retrying\n- Use the rate limit monitor to diagnose issues: `python rate_limit_monitor.py`\n- Review logs for 429 error patterns\n\n**Search returns no results:**\n- Ensure you've scraped some data first\n- Check your search criteria aren't too restrictive\n- Verify the database contains data: check the statistics section\n\n**Web interface not loading:**\n- Ensure Flask is running: `python app.py`\n- Check the correct URL: http://localhost:5000\n- Look for error messages in the terminal\n\n### Performance Optimization\n\n**For faster searching:**\n- Use specific criteria to limit result sets\n- Consider the result limit setting\n\n**For efficient scraping:**\n- Monitor rate limiting health status\n- Start with smaller page ranges for testing\n- Let adaptive rate limiting handle server load automatically\n- Monitor system resources during large scrapes\n- Consider running overnight for full database collection\n\n### Rate Limiting Best Practices\n\n**Normal Operation:**\n- Monitor the web interface health indicators\n- 🟢 Healthy: Normal operation\n- 🟡 Warning: Some 429s detected, system adapting\n- 🔴 Critical: Multiple 429s, heavily rate limited\n\n**When Issues Occur:**\n- Don't restart immediately after 429 errors\n- Let the adaptive system recover naturally\n- Check server status if persistent issues occur\n- Use test tools to diagnose rate limiting problems\n\n## 📁 File Structure\n\n```\ntag scraper/\n├── app.py                      # Flask web application\n├── scraper.py                  # Core scraping functionality with enhanced 429 detection\n├── rate_limit_monitor.py       # Rate limiting test and monitoring tool\n├── test_enhanced_429.py        # Simple 429 detection test\n├── requirements.txt            # Core Python dependencies\n├── requirements-dev.txt        # Development dependencies (optional)\n├── setup.sh                   # Installation script\n├── .gitignore                 # Git ignore file\n├── README.md                  # This documentation\n├── RATE_LIMIT_GUIDE.md        # Detailed 429 detection documentation\n├── IMPLEMENTATION_SUMMARY.md  # Technical implementation details\n├── templates/\n│   └── index.html             # Web interface template with rate limiting status\n├── artists.db                 # SQLite database (created after first run, gitignored)\n├── rate_limit_test_report.json # Test reports (created by monitor, gitignored)\n└── venv/                      # Virtual environment (gitignored)\n```\n\n### Files Excluded from Git\n\nThe `.gitignore` file excludes the following:\n- **Virtual environments**: `venv/`, `env/`, `.venv`\n- **Database files**: `*.db`, `artists.db`\n- **Log files**: `*.log`, `rate_limit_monitor.log`\n- **Test reports**: `rate_limit_test_report.json`\n- **Environment files**: `.env`\n- **Python cache**: `__pycache__/`, `*.pyc`\n- **IDE files**: `.vscode/`, `.idea/`\n- **OS files**: `.DS_Store`, `Thumbs.db`\n\n## 🔧 Development Setup\n\n### For Contributors\n\n1. **Clone the repository:**\n   ```bash\n   git clone \u003crepository-url\u003e\n   cd \"tag scraper\"\n   ```\n\n2. **Install dependencies:**\n   ```bash\n   # Production dependencies\n   pip install -r requirements.txt\n   \n   # Or for development (includes testing tools)\n   pip install -r requirements-dev.txt\n   ```\n\n3. **Set up environment (optional):**\n   ```bash\n   # Create .env file for API credentials\n   echo \"DANBOORU_API_KEY=your_api_key_here\" \u003e .env\n   echo \"DANBOORU_USERNAME=your_username_here\" \u003e\u003e .env\n   ```\n\n### Git Best Practices\n\n- The database file (`artists.db`) is gitignored - each user maintains their own local data\n- Log files and test reports are excluded from version control\n- Virtual environments are not tracked\n- Environment files with API keys are excluded for security\n\n### Quick Git Setup\n\n```bash\n# Initialize repository\ngit init\n\n# Add all files (respects .gitignore)\ngit add .\n\n# Initial commit\ngit commit -m \"Initial commit: Danbooru Artist Scraper\"\n\n# Add remote and push\ngit remote add origin \u003cyour-repository-url\u003e\ngit push -u origin main\n```\n\nCheck what files will be tracked:\n```bash\n./git_setup_summary.sh\n````\n\n## 🔧 Troubleshooting\n\n### Common Issues\n\n**\"Import could not be resolved\" errors:**\n- Make sure you've activated the virtual environment: `source venv/bin/activate`\n- Install dependencies: `pip install -r requirements.txt`\n\n**Scraping fails or stops:**\n- Check your internet connection\n- Restart the scraping from the last successful page\n- Some pages might be temporarily unavailable\n\n**Search returns no results:**\n- Ensure you've scraped some data first\n- Check your search criteria aren't too restrictive\n- Verify the database contains data: check the statistics section\n\n**Web interface not loading:**\n- Ensure Flask is running: `python app.py`\n- Check the correct URL: http://localhost:5000\n- Look for error messages in the terminal\n\n### Performance Optimization\n\n**For faster searching:**\n- Use specific criteria to limit result sets\n- Consider the result limit setting\n\n**For efficient scraping:**\n- Start with smaller page ranges for testing\n- Monitor system resources during large scrapes\n- Consider running overnight for full database collection\n\n## 📁 File Structure\n\n```\ntag scraper/\n├── app.py              # Flask web application\n├── scraper.py          # Core scraping functionality\n├── requirements.txt    # Python dependencies\n├── setup.sh           # Installation script\n├── README.md          # This documentation\n├── templates/\n│   └── index.html     # Web interface template\n├── artists.db         # SQLite database (created after first run)\n└── venv/              # Virtual environment (created by setup)\n```\n\n## 🤝 Contributing\n\nFeel free to improve this project by:\n- Adding new search criteria\n- Improving the UI/UX\n- Optimizing scraping performance\n- Adding data export formats\n- Enhancing error handling\n\n## 📄 License\n\nThis project is for educational and personal use. Please respect Danbooru's terms of service and API guidelines.\n\n---\n\n**Happy scraping! 🎨✨**\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmooshieblob1%2Ftag-scraper","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmooshieblob1%2Ftag-scraper","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmooshieblob1%2Ftag-scraper/lists"}