{"id":30522980,"url":"https://github.com/dylan-gluck/freecrawl-mcp","last_synced_at":"2026-05-07T10:33:59.896Z","repository":{"id":310746932,"uuid":"1041097588","full_name":"dylan-gluck/freecrawl-mcp","owner":"dylan-gluck","description":"A production-ready mcp server for web scraping and document processing. Drop-in replacement for FireCrawl.","archived":false,"fork":false,"pushed_at":"2025-08-20T03:34:45.000Z","size":180,"stargazers_count":3,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-02-25T23:22:41.728Z","etag":null,"topics":["claude-code","firecrawl","freecrawl","mcp","mcp-scraper","mcp-server","scrape","scraper"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/dylan-gluck.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2025-08-20T01:28:36.000Z","updated_at":"2026-02-25T06:15:59.000Z","dependencies_parsed_at":"2025-08-20T02:18:09.390Z","dependency_job_id":"e31bf686-4801-44b1-9449-6907a084043d","html_url":"https://github.com/dylan-gluck/freecrawl-mcp","commit_stats":null,"previous_names":["dylan-gluck/freecrawl-mcp"],"tags_count":2,"template":false,"template_full_name":null,"purl":"pkg:github/dylan-gluck/freecrawl-mcp","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dylan-gluck%2Ffreecrawl-mcp","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dylan-gluck%2Ffreecrawl-mcp/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dylan-gluck%2Ffreecrawl-mcp/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dylan-gluck%2Ffreecrawl-mcp/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/dylan-gluck","download_url":"https://codeload.github.com/dylan-gluck/freecrawl-mcp/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dylan-gluck%2Ffreecrawl-mcp/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":32733643,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-07T02:14:30.463Z","status":"ssl_error","status_checked_at":"2026-05-07T02:14:29.405Z","response_time":62,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.6:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["claude-code","firecrawl","freecrawl","mcp","mcp-scraper","mcp-server","scrape","scraper"],"created_at":"2025-08-26T19:44:02.342Z","updated_at":"2026-05-07T10:33:59.890Z","avatar_url":"https://github.com/dylan-gluck.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# FreeCrawl MCP Server\n\nA production-ready Model Context Protocol (MCP) server for web scraping and document processing, designed as a self-hosted replacement for Firecrawl.\n\n## 🚀 Features\n\n- **JavaScript-enabled web scraping** with Playwright and anti-detection measures\n- **Document processing** with fallback support for various formats\n- **Concurrent batch processing** with configurable limits\n- **Intelligent caching** with SQLite backend\n- **Rate limiting** per domain\n- **Comprehensive error handling** with retry logic\n- **Easy installation** via `uvx` or local development setup\n- **Health monitoring** and metrics collection\n\n## MCP Config (using `uvx`)\n\n```json\n{\n  \"mcpServers\": {\n    \"freecrawl\": {\n      \"command\": \"uvx\",\n      \"args\": [\"freecrawl-mcp\"],\n    }\n  }\n}\n```\n\n## 📦 Installation \u0026 Usage\n\n### Quick Start with uvx (Recommended)\n\nThe easiest way to use FreeCrawl is with `uvx`, which automatically manages dependencies:\n\n```bash\n# Install browsers on first run\nuvx freecrawl-mcp --install-browsers\n\n# Test functionality\nuvx freecrawl-mcp --test\n```\n\n### Local Development Setup\n\nFor local development or customization:\n\n1. **Clone from GitHub:**\n   ```bash\n   git clone https://github.com/dylan-gluck/freecrawl-mcp.git\n   cd freecrawl-mcp\n   ```\n\n2. **Set up environment:**\n   ```bash\n   # Sync dependencies\n   uv sync\n\n   # Install browser dependencies\n   uv run freecrawl-mcp --install-browsers\n\n   # Run tests\n   uv run freecrawl-mcp --test\n   ```\n\n3. **Run the server:**\n   ```bash\n   uv run freecrawl-mcp\n   ```\n\n## 🛠 Configuration\n\nConfigure FreeCrawl using environment variables:\n\n### Basic Configuration\n```bash\n# Transport (stdio for MCP, http for REST API)\nexport FREECRAWL_TRANSPORT=stdio\n\n# Browser pool settings\nexport FREECRAWL_MAX_BROWSERS=3\nexport FREECRAWL_HEADLESS=true\n\n# Concurrency limits\nexport FREECRAWL_MAX_CONCURRENT=10\nexport FREECRAWL_MAX_PER_DOMAIN=3\n\n# Cache settings\nexport FREECRAWL_CACHE=true\nexport FREECRAWL_CACHE_DIR=/tmp/freecrawl_cache\nexport FREECRAWL_CACHE_TTL=3600\nexport FREECRAWL_CACHE_SIZE=536870912  # 512MB\n\n# Rate limiting\nexport FREECRAWL_RATE_LIMIT=60  # requests per minute\n\n# Logging\nexport FREECRAWL_LOG_LEVEL=INFO\n```\n\n### Security Settings\n```bash\n# API authentication (optional)\nexport FREECRAWL_REQUIRE_API_KEY=false\nexport FREECRAWL_API_KEYS=key1,key2,key3\n\n# Domain blocking\nexport FREECRAWL_BLOCKED_DOMAINS=localhost,127.0.0.1\n\n# Anti-detection\nexport FREECRAWL_ANTI_DETECT=true\nexport FREECRAWL_ROTATE_UA=true\n```\n\n## 🔧 MCP Tools\n\nFreeCrawl provides the following MCP tools:\n\n### `freecrawl_scrape`\nScrape content from a single URL with advanced options.\n\n**Parameters:**\n- `url` (string): URL to scrape\n- `formats` (array): Output formats - `[\"markdown\", \"html\", \"text\", \"screenshot\", \"structured\"]`\n- `javascript` (boolean): Enable JavaScript execution (default: true)\n- `wait_for` (string, optional): CSS selector or time (ms) to wait\n- `anti_bot` (boolean): Enable anti-detection measures (default: true)\n- `headers` (object, optional): Custom HTTP headers\n- `cookies` (object, optional): Custom cookies\n- `cache` (boolean): Use cached results if available (default: true)\n- `timeout` (number): Total timeout in milliseconds (default: 30000)\n\n**Example:**\n```json\n{\n  \"name\": \"freecrawl_scrape\",\n  \"arguments\": {\n    \"url\": \"https://example.com\",\n    \"formats\": [\"markdown\", \"screenshot\"],\n    \"javascript\": true,\n    \"wait_for\": \"2000\"\n  }\n}\n```\n\n### `freecrawl_batch_scrape`\nScrape multiple URLs concurrently.\n\n**Parameters:**\n- `urls` (array): List of URLs to scrape (max 100)\n- `concurrency` (number): Maximum concurrent requests (default: 5)\n- `formats` (array): Output formats (default: `[\"markdown\"]`)\n- `common_options` (object, optional): Options applied to all URLs\n- `continue_on_error` (boolean): Continue if individual URLs fail (default: true)\n\n**Example:**\n```json\n{\n  \"name\": \"freecrawl_batch_scrape\",\n  \"arguments\": {\n    \"urls\": [\n      \"https://example.com/page1\",\n      \"https://example.com/page2\"\n    ],\n    \"concurrency\": 3,\n    \"formats\": [\"markdown\", \"text\"]\n  }\n}\n```\n\n### `freecrawl_extract`\nExtract structured data using schema-driven approach.\n\n**Parameters:**\n- `url` (string): URL to extract data from\n- `schema` (object): JSON Schema or Pydantic model definition\n- `prompt` (string, optional): Custom extraction instructions\n- `validation` (boolean): Validate against schema (default: true)\n- `multiple` (boolean): Extract multiple matching items (default: false)\n\n**Example:**\n```json\n{\n  \"name\": \"freecrawl_extract\",\n  \"arguments\": {\n    \"url\": \"https://example.com/product\",\n    \"schema\": {\n      \"type\": \"object\",\n      \"properties\": {\n        \"title\": {\"type\": \"string\"},\n        \"price\": {\"type\": \"number\"}\n      }\n    }\n  }\n}\n```\n\n### `freecrawl_process_document`\nProcess documents (PDF, DOCX, etc.) with OCR support.\n\n**Parameters:**\n- `file_path` (string, optional): Path to document file\n- `url` (string, optional): URL to download document from\n- `strategy` (string): Processing strategy - `\"fast\"`, `\"hi_res\"`, `\"ocr_only\"` (default: \"hi_res\")\n- `formats` (array): Output formats - `[\"markdown\", \"structured\", \"text\"]`\n- `languages` (array, optional): OCR languages (e.g., `[\"eng\", \"fra\"]`)\n- `extract_images` (boolean): Extract embedded images (default: false)\n- `extract_tables` (boolean): Extract and structure tables (default: true)\n\n**Example:**\n```json\n{\n  \"name\": \"freecrawl_process_document\",\n  \"arguments\": {\n    \"url\": \"https://example.com/document.pdf\",\n    \"strategy\": \"hi_res\",\n    \"formats\": [\"markdown\", \"structured\"]\n  }\n}\n```\n\n### `freecrawl_health_check`\nGet server health status and metrics.\n\n**Example:**\n```json\n{\n  \"name\": \"freecrawl_health_check\",\n  \"arguments\": {}\n}\n```\n\n## 🔄 Integration with Claude Code\n\n### MCP Configuration\n\nAdd FreeCrawl to your MCP configuration:\n\n**Using uvx (Recommended):**\n```json\n{\n  \"mcpServers\": {\n    \"freecrawl\": {\n      \"command\": \"uvx\",\n      \"args\": [\"freecrawl-mcp\"]\n    }\n  }\n}\n```\n\n**Using local development setup:**\n```json\n{\n  \"mcpServers\": {\n    \"freecrawl\": {\n      \"command\": \"uv\",\n      \"args\": [\"run\", \"freecrawl-mcp\"],\n      \"cwd\": \"/path/to/freecrawl-mcp\"\n    }\n  }\n}\n```\n\n### Usage in Prompts\n\n```\nPlease scrape the content from https://example.com and extract the main article text in markdown format.\n```\n\nClaude Code will automatically use the `freecrawl_scrape` tool to fetch and process the content.\n\n## 🚀 Performance \u0026 Scalability\n\n### Resource Usage\n- **Memory**: ~100MB base + ~50MB per browser instance\n- **CPU**: Moderate usage during active scraping\n- **Storage**: Cache grows based on configured limits\n\n### Throughput\n- **Single requests**: 2-5 seconds typical response time\n- **Batch processing**: 10-50 concurrent requests depending on configuration\n- **Cache hit ratio**: 30%+ for repeated content\n\n### Optimization Tips\n1. **Enable caching** for frequently accessed content\n2. **Adjust concurrency** based on target site rate limits\n3. **Use appropriate formats** - markdown is faster than screenshots\n4. **Configure rate limiting** to avoid being blocked\n\n## 🛡 Security Considerations\n\n### Anti-Detection\n- Rotating user agents\n- Realistic browser fingerprints\n- Request timing randomization\n- JavaScript execution in sandboxed environment\n\n### Input Validation\n- URL format validation\n- Private IP blocking\n- Domain blocklist support\n- Request size limits\n\n### Resource Protection\n- Memory usage monitoring\n- Browser pool size limits\n- Request timeout enforcement\n- Rate limiting per domain\n\n## 🔧 Troubleshooting\n\n### Common Issues\n\n| Issue | Possible Cause | Solution |\n|-------|----------------|----------|\n| High memory usage | Too many browser instances | Reduce `FREECRAWL_MAX_BROWSERS` |\n| Slow responses | JavaScript-heavy sites | Increase timeout or disable JS |\n| Bot detection | Missing anti-detection | Ensure `FREECRAWL_ANTI_DETECT=true` |\n| Cache misses | TTL too short | Increase `FREECRAWL_CACHE_TTL` |\n| Import errors | Missing dependencies | Run `uvx freecrawl-mcp --test` |\n\n### Debug Mode\n\n**With uvx:**\n```bash\nexport FREECRAWL_LOG_LEVEL=DEBUG\nuvx freecrawl-mcp --test\n```\n\n**Local development:**\n```bash\nexport FREECRAWL_LOG_LEVEL=DEBUG\nuv run freecrawl-mcp --test\n```\n\n## 📈 Monitoring \u0026 Observability\n\n### Health Metrics\n- Browser pool status\n- Memory and CPU usage\n- Cache hit rates\n- Request success rates\n- Response times\n\n### Logging\nFreeCrawl provides structured logging with configurable levels:\n- ERROR: Critical failures\n- WARNING: Recoverable issues\n- INFO: General operations\n- DEBUG: Detailed troubleshooting\n\n## 🔧 Development\n\n### Running Tests\n\n**With uvx:**\n```bash\n# Basic functionality test\nuvx freecrawl-mcp --test\n```\n\n**Local development:**\n```bash\n# Basic functionality test\nuv run freecrawl-mcp --test\n```\n\n### Code Structure\n- **Core server**: `FreeCrawlServer` class\n- **Browser management**: `BrowserPool` for resource pooling\n- **Content extraction**: `ContentExtractor` with multiple strategies\n- **Caching**: `CacheManager` with SQLite backend\n- **Rate limiting**: `RateLimiter` with token bucket algorithm\n\n## 📄 License\n\nThis project is licensed under the MIT License - see the technical specification for details.\n\n## 🤝 Contributing\n\n1. Fork the repository at https://github.com/dylan-gluck/freecrawl-mcp\n2. Create a feature branch\n3. Set up local development: `uv sync`\n4. Run tests: `uv run freecrawl-mcp --test`\n5. Submit a pull request\n\n## 📚 Technical Specification\n\nFor detailed technical information, see `ai_docs/FREECRAWL_TECHNICAL_SPEC.md`.\n\n---\n\n**FreeCrawl MCP Server** - Self-hosted web scraping for the modern web 🚀\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdylan-gluck%2Ffreecrawl-mcp","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdylan-gluck%2Ffreecrawl-mcp","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdylan-gluck%2Ffreecrawl-mcp/lists"}