{"id":31585024,"url":"https://github.com/trafexofive/simplecrawler-mk4","last_synced_at":"2026-04-08T23:34:04.966Z","repository":{"id":318091969,"uuid":"1069961176","full_name":"Trafexofive/SimpleCrawler-MK4","owner":"Trafexofive","description":"Production-ready microservices web crawling platform with FastAPI, PostgreSQL, Redis, and Docker. Transform documentation sites into LLM-friendly structured data.","archived":false,"fork":false,"pushed_at":"2025-10-05T01:43:33.000Z","size":249,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2025-10-05T03:48:06.756Z","etag":null,"topics":["async","docker","documentation","fastapi","llm","microservices","nginx","postgresql","pydantic","python","redis","rest-api","scraping","web-crawler"],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Trafexofive.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":"ROADMAP.md","authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2025-10-05T01:05:19.000Z","updated_at":"2025-10-05T01:43:36.000Z","dependencies_parsed_at":"2025-10-05T03:48:13.415Z","dependency_job_id":null,"html_url":"https://github.com/Trafexofive/SimpleCrawler-MK4","commit_stats":null,"previous_names":["trafexofive/simplecrawler-mk4"],"tags_count":null,"template":false,"template_full_name":null,"purl":"pkg:github/Trafexofive/SimpleCrawler-MK4","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Trafexofive%2FSimpleCrawler-MK4","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Trafexofive%2FSimpleCrawler-MK4/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Trafexofive%2FSimpleCrawler-MK4/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Trafexofive%2FSimpleCrawler-MK4/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Trafexofive","download_url":"https://codeload.github.com/Trafexofive/SimpleCrawler-MK4/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Trafexofive%2FSimpleCrawler-MK4/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":278547764,"owners_count":26004772,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-10-05T02:00:06.059Z","response_time":54,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["async","docker","documentation","fastapi","llm","microservices","nginx","postgresql","pydantic","python","redis","rest-api","scraping","web-crawler"],"created_at":"2025-10-06T01:26:18.710Z","updated_at":"2025-10-06T01:26:28.704Z","avatar_url":"https://github.com/Trafexofive.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# SimpleCrawler MK4 🚀\n\nA production-ready microservices web crawling platform built with FastAPI, PostgreSQL, Redis, and Docker. Transform any documentation site into structured, LLM-friendly data.\n\n## 🏗️ Architecture\n\n**Enterprise Microservices Platform:**\n- 🚀 **FastAPI API Service** - REST API with Pydantic validation\n- ⚙️ **Background Workers** - Scalable async job processing\n- 🗄️ **PostgreSQL Database** - Job persistence with ACID transactions\n- 🚀 **Redis Queue** - Job queue and caching layer\n- 🌐 **Nginx Proxy** - Load balancing with rate limiting\n\n## 🌟 Features\n\n- **Production Microservices**: Docker containers, health checks, scaling\n- **Type-Safe APIs**: FastAPI + Pydantic validation\n- **Async Processing**: Background job queue with Redis\n- **Smart Content Extraction**: Code blocks, markdown, readable formats\n- **Multiple Export Formats**: JSON, Markdown, **Human-Readable**, **Executive Summary**\n- **Enterprise Database**: PostgreSQL with connection pooling\n- **Zero Host Pollution**: Everything runs in containers\n\n## 🔥 Validated Against Real Sites\n\nSuccessfully tested against:\n\n### Python Ecosystem\n- ✅ **Python Official Docs** (docs.python.org)\n- ✅ **FastAPI** (fastapi.tiangolo.com)\n- ✅ **Django** (docs.djangoproject.com)\n- ✅ **Flask** (flask.palletsprojects.com)\n- ✅ **Requests** (requests.readthedocs.io)\n- ✅ **NumPy** (numpy.org/doc)\n\n### JavaScript Frameworks\n- ✅ **React** (react.dev)\n- ✅ **Vue.js** (vuejs.org)\n- ✅ **Svelte** (svelte.dev)\n\n### Documentation \u0026 Static Sites\n- ✅ **Bootstrap** (getbootstrap.com)\n- ✅ **Tailwind CSS** (tailwindcss.com)\n- ✅ **GitHub Docs** (docs.github.com)\n- ✅ **Rich** (rich.readthedocs.io)\n- ✅ **Pytest** (docs.pytest.org)\n\n## 🚀 Quick Start\n\n```bash\n# Start the entire platform (builds everything)\nmake quick-start\n\n# Check services health\nmake health\n\n# Monitor all services\nmake logs\n\n# Scale background workers\nmake scale-workers WORKERS=5\n\n# Access API documentation\nopen http://localhost:8000/docs\n```\n\n### Using the API\n\n```bash\n# Start a crawl job\ncurl -X POST \"http://localhost:8000/crawl\" \\\n  -H \"Content-Type: application/json\" \\\n  -d '{\"start_url\": \"https://fastapi.tiangolo.com\", \"max_pages\": 10, \"export_format\": \"readable\"}'\n\n# List jobs\ncurl \"http://localhost:8000/jobs\"\n\n# Download results\ncurl \"http://localhost:8000/download/{job_id}/{filename}\"\n```\n\n## 📁 Project Structure\n\n```\nSimpleCrawler-MK4/\n├── services/                    # Microservices\n│   ├── api/                    # FastAPI REST API\n│   │   ├── main.py            # API service\n│   │   ├── Dockerfile         # API container\n│   │   └── requirements.txt   # API dependencies\n│   └── worker/                # Background workers\n│       ├── worker.py          # Worker service\n│       ├── Dockerfile         # Worker container\n│       └── requirements.txt   # Worker dependencies\n├── app/                       # Core crawler engine\n│   ├── main.py               # Crawler implementation\n│   └── tests/                # Unit tests\n├── docs/                     # Documentation\n├── examples/                 # Usage examples\n├── docker-compose.yml        # Service orchestration\n├── nginx.conf               # Reverse proxy config\n└── Makefile                 # Operations toolkit\n```\n\n## 🛠 Installation\n\n**Prerequisites:** Docker and Docker Compose\n\n```bash\n# Clone repository\ngit clone https://github.com/yourusername/SimpleCrawler-MK4.git\ncd SimpleCrawler-MK4\n\n# Start all services (builds containers automatically)\nmake quick-start\n\n# Verify installation\nmake health\nmake status\n```\n\n**No Python setup required** - everything runs in containers!\n\n## 💡 Usage\n\n### Basic Crawling\n\n```bash\n# Activate virtual environment\nsource venv/bin/activate\n\n# Basic crawl\npython app/main.py https://example.com\n\n# Advanced options\npython app/main.py https://fastapi.tiangolo.com/ \\\n    --max-pages 20 \\\n    --max-depth 3 \\\n    --single-domain \\\n    --format json \\\n    --verbose\n```\n\n### Configuration Options\n\n| Option | Description | Default |\n|--------|-------------|---------|\n| `--max-pages` | Maximum pages to crawl | 100 |\n| `--max-depth` | Maximum crawl depth | 3 |\n| `--single-domain` | Restrict to same domain | False |\n| `--delay` | Delay between requests (seconds) | 1.0 |\n| `--concurrency` | Max concurrent requests | 10 |\n| `--format` | Export format (markdown/json/csv) | markdown |\n| `--output-dir` | Output directory | crawled_pages |\n| `--verbose` | Verbose logging | False |\n\n## 🧪 Testing\n\n```bash\n# Run unit tests\nmake test-unit\n\n# Test with real documentation sites\nmake test-real-sites\n\n# Run specific tests\npython -m pytest app/tests/test_crawler.py -v\n```\n\n## 📊 Performance Results\n\nRecent test results from real documentation sites:\n\n```\n🏆 OVERALL RESULTS:\n   Tests: 14/14 successful (100.0%)\n   Pages crawled: 31\n   Total time: 28.36s\n   Average speed: 1.09 pages/second\n```\n\n### Detailed Results by Category\n\n| Category | Sites Tested | Success Rate | Avg Speed |\n|----------|-------------|-------------|-----------|\n| Python Ecosystem | 6 sites | 100% | 1.1 pages/s |\n| JS Frameworks | 3 sites | 100% | 1.2 pages/s |\n| Documentation | 5 sites | 100% | 1.0 pages/s |\n\n## 🔧 Development\n\n### Available Make Commands\n\n```bash\nmake help                 # Show all available commands\nmake setup               # Setup development environment\nmake test                # Run all tests\nmake test-real-sites     # Test against real documentation sites\nmake lint                # Run code linting\nmake format              # Format code with black\nmake clean               # Clean up generated files\nmake crawl-example       # Demo crawl of example.com\n```\n\n### Project Components\n\n- **WebCrawler**: Main crawler class with async worker pool\n- **RateLimiter**: Smart rate limiting with domain-specific delays\n- **ContentExtractor**: Advanced content extraction using multiple strategies\n- **RobotsCache**: Robots.txt compliance and caching\n- **PageData**: Structured data model for crawled pages\n\n## 📈 Features in Detail\n\n### Smart Rate Limiting\n- Domain-specific delays\n- Exponential backoff on errors\n- Automatic recovery after successful requests\n\n### Content Extraction\n- Primary: Trafilatura for article extraction\n- Fallback: BeautifulSoup with smart content detection\n- Metadata extraction (title, description, keywords)\n- Link and image URL extraction\n\n### Export Formats\n- **Markdown**: Clean, readable format with metadata headers\n- **JSON**: Structured data for programmatic use\n- **CSV**: Spreadsheet-compatible format\n\n## 🤝 Contributing\n\n1. Fork the repository\n2. Create a feature branch\n3. Make your changes\n4. Run tests: `make test`\n5. Format code: `make format`\n6. Submit a pull request\n\n## 📝 License\n\n[MIT License](LICENSE)\n\n## 🙏 Acknowledgments\n\nBuilt with:\n- [aiohttp](https://aiohttp.readthedocs.io/) - Async HTTP client/server\n- [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/) - HTML parsing\n- [Rich](https://rich.readthedocs.io/) - Terminal formatting\n- [Trafilatura](https://trafilatura.readthedocs.io/) - Content extraction\n- [pytest](https://docs.pytest.org/) - Testing framework\n\n---\n\n**SimpleCrawler MK4** - *Fast, reliable, respectful web crawling* 🕷️","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftrafexofive%2Fsimplecrawler-mk4","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ftrafexofive%2Fsimplecrawler-mk4","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftrafexofive%2Fsimplecrawler-mk4/lists"}