An open API service indexing awesome lists of open source software.

https://github.com/trafexofive/simplecrawler-mk4

Production-ready microservices web crawling platform with FastAPI, PostgreSQL, Redis, and Docker. Transform documentation sites into LLM-friendly structured data.
https://github.com/trafexofive/simplecrawler-mk4

async docker documentation fastapi llm microservices nginx postgresql pydantic python redis rest-api scraping web-crawler

Last synced: 3 months ago
JSON representation

Production-ready microservices web crawling platform with FastAPI, PostgreSQL, Redis, and Docker. Transform documentation sites into LLM-friendly structured data.

Awesome Lists containing this project

README

          

# SimpleCrawler MK4 ๐Ÿš€

A production-ready microservices web crawling platform built with FastAPI, PostgreSQL, Redis, and Docker. Transform any documentation site into structured, LLM-friendly data.

## ๐Ÿ—๏ธ Architecture

**Enterprise Microservices Platform:**
- ๐Ÿš€ **FastAPI API Service** - REST API with Pydantic validation
- โš™๏ธ **Background Workers** - Scalable async job processing
- ๐Ÿ—„๏ธ **PostgreSQL Database** - Job persistence with ACID transactions
- ๐Ÿš€ **Redis Queue** - Job queue and caching layer
- ๐ŸŒ **Nginx Proxy** - Load balancing with rate limiting

## ๐ŸŒŸ Features

- **Production Microservices**: Docker containers, health checks, scaling
- **Type-Safe APIs**: FastAPI + Pydantic validation
- **Async Processing**: Background job queue with Redis
- **Smart Content Extraction**: Code blocks, markdown, readable formats
- **Multiple Export Formats**: JSON, Markdown, **Human-Readable**, **Executive Summary**
- **Enterprise Database**: PostgreSQL with connection pooling
- **Zero Host Pollution**: Everything runs in containers

## ๐Ÿ”ฅ Validated Against Real Sites

Successfully tested against:

### Python Ecosystem
- โœ… **Python Official Docs** (docs.python.org)
- โœ… **FastAPI** (fastapi.tiangolo.com)
- โœ… **Django** (docs.djangoproject.com)
- โœ… **Flask** (flask.palletsprojects.com)
- โœ… **Requests** (requests.readthedocs.io)
- โœ… **NumPy** (numpy.org/doc)

### JavaScript Frameworks
- โœ… **React** (react.dev)
- โœ… **Vue.js** (vuejs.org)
- โœ… **Svelte** (svelte.dev)

### Documentation & Static Sites
- โœ… **Bootstrap** (getbootstrap.com)
- โœ… **Tailwind CSS** (tailwindcss.com)
- โœ… **GitHub Docs** (docs.github.com)
- โœ… **Rich** (rich.readthedocs.io)
- โœ… **Pytest** (docs.pytest.org)

## ๐Ÿš€ Quick Start

```bash
# Start the entire platform (builds everything)
make quick-start

# Check services health
make health

# Monitor all services
make logs

# Scale background workers
make scale-workers WORKERS=5

# Access API documentation
open http://localhost:8000/docs
```

### Using the API

```bash
# Start a crawl job
curl -X POST "http://localhost:8000/crawl" \
-H "Content-Type: application/json" \
-d '{"start_url": "https://fastapi.tiangolo.com", "max_pages": 10, "export_format": "readable"}'

# List jobs
curl "http://localhost:8000/jobs"

# Download results
curl "http://localhost:8000/download/{job_id}/{filename}"
```

## ๐Ÿ“ Project Structure

```
SimpleCrawler-MK4/
โ”œโ”€โ”€ services/ # Microservices
โ”‚ โ”œโ”€โ”€ api/ # FastAPI REST API
โ”‚ โ”‚ โ”œโ”€โ”€ main.py # API service
โ”‚ โ”‚ โ”œโ”€โ”€ Dockerfile # API container
โ”‚ โ”‚ โ””โ”€โ”€ requirements.txt # API dependencies
โ”‚ โ””โ”€โ”€ worker/ # Background workers
โ”‚ โ”œโ”€โ”€ worker.py # Worker service
โ”‚ โ”œโ”€โ”€ Dockerfile # Worker container
โ”‚ โ””โ”€โ”€ requirements.txt # Worker dependencies
โ”œโ”€โ”€ app/ # Core crawler engine
โ”‚ โ”œโ”€โ”€ main.py # Crawler implementation
โ”‚ โ””โ”€โ”€ tests/ # Unit tests
โ”œโ”€โ”€ docs/ # Documentation
โ”œโ”€โ”€ examples/ # Usage examples
โ”œโ”€โ”€ docker-compose.yml # Service orchestration
โ”œโ”€โ”€ nginx.conf # Reverse proxy config
โ””โ”€โ”€ Makefile # Operations toolkit
```

## ๐Ÿ›  Installation

**Prerequisites:** Docker and Docker Compose

```bash
# Clone repository
git clone https://github.com/yourusername/SimpleCrawler-MK4.git
cd SimpleCrawler-MK4

# Start all services (builds containers automatically)
make quick-start

# Verify installation
make health
make status
```

**No Python setup required** - everything runs in containers!

## ๐Ÿ’ก Usage

### Basic Crawling

```bash
# Activate virtual environment
source venv/bin/activate

# Basic crawl
python app/main.py https://example.com

# Advanced options
python app/main.py https://fastapi.tiangolo.com/ \
--max-pages 20 \
--max-depth 3 \
--single-domain \
--format json \
--verbose
```

### Configuration Options

| Option | Description | Default |
|--------|-------------|---------|
| `--max-pages` | Maximum pages to crawl | 100 |
| `--max-depth` | Maximum crawl depth | 3 |
| `--single-domain` | Restrict to same domain | False |
| `--delay` | Delay between requests (seconds) | 1.0 |
| `--concurrency` | Max concurrent requests | 10 |
| `--format` | Export format (markdown/json/csv) | markdown |
| `--output-dir` | Output directory | crawled_pages |
| `--verbose` | Verbose logging | False |

## ๐Ÿงช Testing

```bash
# Run unit tests
make test-unit

# Test with real documentation sites
make test-real-sites

# Run specific tests
python -m pytest app/tests/test_crawler.py -v
```

## ๐Ÿ“Š Performance Results

Recent test results from real documentation sites:

```
๐Ÿ† OVERALL RESULTS:
Tests: 14/14 successful (100.0%)
Pages crawled: 31
Total time: 28.36s
Average speed: 1.09 pages/second
```

### Detailed Results by Category

| Category | Sites Tested | Success Rate | Avg Speed |
|----------|-------------|-------------|-----------|
| Python Ecosystem | 6 sites | 100% | 1.1 pages/s |
| JS Frameworks | 3 sites | 100% | 1.2 pages/s |
| Documentation | 5 sites | 100% | 1.0 pages/s |

## ๐Ÿ”ง Development

### Available Make Commands

```bash
make help # Show all available commands
make setup # Setup development environment
make test # Run all tests
make test-real-sites # Test against real documentation sites
make lint # Run code linting
make format # Format code with black
make clean # Clean up generated files
make crawl-example # Demo crawl of example.com
```

### Project Components

- **WebCrawler**: Main crawler class with async worker pool
- **RateLimiter**: Smart rate limiting with domain-specific delays
- **ContentExtractor**: Advanced content extraction using multiple strategies
- **RobotsCache**: Robots.txt compliance and caching
- **PageData**: Structured data model for crawled pages

## ๐Ÿ“ˆ Features in Detail

### Smart Rate Limiting
- Domain-specific delays
- Exponential backoff on errors
- Automatic recovery after successful requests

### Content Extraction
- Primary: Trafilatura for article extraction
- Fallback: BeautifulSoup with smart content detection
- Metadata extraction (title, description, keywords)
- Link and image URL extraction

### Export Formats
- **Markdown**: Clean, readable format with metadata headers
- **JSON**: Structured data for programmatic use
- **CSV**: Spreadsheet-compatible format

## ๐Ÿค Contributing

1. Fork the repository
2. Create a feature branch
3. Make your changes
4. Run tests: `make test`
5. Format code: `make format`
6. Submit a pull request

## ๐Ÿ“ License

[MIT License](LICENSE)

## ๐Ÿ™ Acknowledgments

Built with:
- [aiohttp](https://aiohttp.readthedocs.io/) - Async HTTP client/server
- [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/) - HTML parsing
- [Rich](https://rich.readthedocs.io/) - Terminal formatting
- [Trafilatura](https://trafilatura.readthedocs.io/) - Content extraction
- [pytest](https://docs.pytest.org/) - Testing framework

---

**SimpleCrawler MK4** - *Fast, reliable, respectful web crawling* ๐Ÿ•ท๏ธ