https://github.com/trafexofive/simplecrawler-mk4
Production-ready microservices web crawling platform with FastAPI, PostgreSQL, Redis, and Docker. Transform documentation sites into LLM-friendly structured data.
https://github.com/trafexofive/simplecrawler-mk4
async docker documentation fastapi llm microservices nginx postgresql pydantic python redis rest-api scraping web-crawler
Last synced: 3 months ago
JSON representation
Production-ready microservices web crawling platform with FastAPI, PostgreSQL, Redis, and Docker. Transform documentation sites into LLM-friendly structured data.
- Host: GitHub
- URL: https://github.com/trafexofive/simplecrawler-mk4
- Owner: Trafexofive
- License: mit
- Created: 2025-10-05T01:05:19.000Z (9 months ago)
- Default Branch: main
- Last Pushed: 2025-10-05T01:43:33.000Z (9 months ago)
- Last Synced: 2025-10-05T03:48:06.756Z (9 months ago)
- Topics: async, docker, documentation, fastapi, llm, microservices, nginx, postgresql, pydantic, python, redis, rest-api, scraping, web-crawler
- Language: Python
- Size: 243 KB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- Contributing: CONTRIBUTING.md
- License: LICENSE
- Roadmap: ROADMAP.md
Awesome Lists containing this project
README
# SimpleCrawler MK4 ๐
A production-ready microservices web crawling platform built with FastAPI, PostgreSQL, Redis, and Docker. Transform any documentation site into structured, LLM-friendly data.
## ๐๏ธ Architecture
**Enterprise Microservices Platform:**
- ๐ **FastAPI API Service** - REST API with Pydantic validation
- โ๏ธ **Background Workers** - Scalable async job processing
- ๐๏ธ **PostgreSQL Database** - Job persistence with ACID transactions
- ๐ **Redis Queue** - Job queue and caching layer
- ๐ **Nginx Proxy** - Load balancing with rate limiting
## ๐ Features
- **Production Microservices**: Docker containers, health checks, scaling
- **Type-Safe APIs**: FastAPI + Pydantic validation
- **Async Processing**: Background job queue with Redis
- **Smart Content Extraction**: Code blocks, markdown, readable formats
- **Multiple Export Formats**: JSON, Markdown, **Human-Readable**, **Executive Summary**
- **Enterprise Database**: PostgreSQL with connection pooling
- **Zero Host Pollution**: Everything runs in containers
## ๐ฅ Validated Against Real Sites
Successfully tested against:
### Python Ecosystem
- โ
**Python Official Docs** (docs.python.org)
- โ
**FastAPI** (fastapi.tiangolo.com)
- โ
**Django** (docs.djangoproject.com)
- โ
**Flask** (flask.palletsprojects.com)
- โ
**Requests** (requests.readthedocs.io)
- โ
**NumPy** (numpy.org/doc)
### JavaScript Frameworks
- โ
**React** (react.dev)
- โ
**Vue.js** (vuejs.org)
- โ
**Svelte** (svelte.dev)
### Documentation & Static Sites
- โ
**Bootstrap** (getbootstrap.com)
- โ
**Tailwind CSS** (tailwindcss.com)
- โ
**GitHub Docs** (docs.github.com)
- โ
**Rich** (rich.readthedocs.io)
- โ
**Pytest** (docs.pytest.org)
## ๐ Quick Start
```bash
# Start the entire platform (builds everything)
make quick-start
# Check services health
make health
# Monitor all services
make logs
# Scale background workers
make scale-workers WORKERS=5
# Access API documentation
open http://localhost:8000/docs
```
### Using the API
```bash
# Start a crawl job
curl -X POST "http://localhost:8000/crawl" \
-H "Content-Type: application/json" \
-d '{"start_url": "https://fastapi.tiangolo.com", "max_pages": 10, "export_format": "readable"}'
# List jobs
curl "http://localhost:8000/jobs"
# Download results
curl "http://localhost:8000/download/{job_id}/{filename}"
```
## ๐ Project Structure
```
SimpleCrawler-MK4/
โโโ services/ # Microservices
โ โโโ api/ # FastAPI REST API
โ โ โโโ main.py # API service
โ โ โโโ Dockerfile # API container
โ โ โโโ requirements.txt # API dependencies
โ โโโ worker/ # Background workers
โ โโโ worker.py # Worker service
โ โโโ Dockerfile # Worker container
โ โโโ requirements.txt # Worker dependencies
โโโ app/ # Core crawler engine
โ โโโ main.py # Crawler implementation
โ โโโ tests/ # Unit tests
โโโ docs/ # Documentation
โโโ examples/ # Usage examples
โโโ docker-compose.yml # Service orchestration
โโโ nginx.conf # Reverse proxy config
โโโ Makefile # Operations toolkit
```
## ๐ Installation
**Prerequisites:** Docker and Docker Compose
```bash
# Clone repository
git clone https://github.com/yourusername/SimpleCrawler-MK4.git
cd SimpleCrawler-MK4
# Start all services (builds containers automatically)
make quick-start
# Verify installation
make health
make status
```
**No Python setup required** - everything runs in containers!
## ๐ก Usage
### Basic Crawling
```bash
# Activate virtual environment
source venv/bin/activate
# Basic crawl
python app/main.py https://example.com
# Advanced options
python app/main.py https://fastapi.tiangolo.com/ \
--max-pages 20 \
--max-depth 3 \
--single-domain \
--format json \
--verbose
```
### Configuration Options
| Option | Description | Default |
|--------|-------------|---------|
| `--max-pages` | Maximum pages to crawl | 100 |
| `--max-depth` | Maximum crawl depth | 3 |
| `--single-domain` | Restrict to same domain | False |
| `--delay` | Delay between requests (seconds) | 1.0 |
| `--concurrency` | Max concurrent requests | 10 |
| `--format` | Export format (markdown/json/csv) | markdown |
| `--output-dir` | Output directory | crawled_pages |
| `--verbose` | Verbose logging | False |
## ๐งช Testing
```bash
# Run unit tests
make test-unit
# Test with real documentation sites
make test-real-sites
# Run specific tests
python -m pytest app/tests/test_crawler.py -v
```
## ๐ Performance Results
Recent test results from real documentation sites:
```
๐ OVERALL RESULTS:
Tests: 14/14 successful (100.0%)
Pages crawled: 31
Total time: 28.36s
Average speed: 1.09 pages/second
```
### Detailed Results by Category
| Category | Sites Tested | Success Rate | Avg Speed |
|----------|-------------|-------------|-----------|
| Python Ecosystem | 6 sites | 100% | 1.1 pages/s |
| JS Frameworks | 3 sites | 100% | 1.2 pages/s |
| Documentation | 5 sites | 100% | 1.0 pages/s |
## ๐ง Development
### Available Make Commands
```bash
make help # Show all available commands
make setup # Setup development environment
make test # Run all tests
make test-real-sites # Test against real documentation sites
make lint # Run code linting
make format # Format code with black
make clean # Clean up generated files
make crawl-example # Demo crawl of example.com
```
### Project Components
- **WebCrawler**: Main crawler class with async worker pool
- **RateLimiter**: Smart rate limiting with domain-specific delays
- **ContentExtractor**: Advanced content extraction using multiple strategies
- **RobotsCache**: Robots.txt compliance and caching
- **PageData**: Structured data model for crawled pages
## ๐ Features in Detail
### Smart Rate Limiting
- Domain-specific delays
- Exponential backoff on errors
- Automatic recovery after successful requests
### Content Extraction
- Primary: Trafilatura for article extraction
- Fallback: BeautifulSoup with smart content detection
- Metadata extraction (title, description, keywords)
- Link and image URL extraction
### Export Formats
- **Markdown**: Clean, readable format with metadata headers
- **JSON**: Structured data for programmatic use
- **CSV**: Spreadsheet-compatible format
## ๐ค Contributing
1. Fork the repository
2. Create a feature branch
3. Make your changes
4. Run tests: `make test`
5. Format code: `make format`
6. Submit a pull request
## ๐ License
[MIT License](LICENSE)
## ๐ Acknowledgments
Built with:
- [aiohttp](https://aiohttp.readthedocs.io/) - Async HTTP client/server
- [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/) - HTML parsing
- [Rich](https://rich.readthedocs.io/) - Terminal formatting
- [Trafilatura](https://trafilatura.readthedocs.io/) - Content extraction
- [pytest](https://docs.pytest.org/) - Testing framework
---
**SimpleCrawler MK4** - *Fast, reliable, respectful web crawling* ๐ท๏ธ