https://github.com/bjornmelin/crawl4ai-mcp-server
π High-performance MCP Server for Crawl4AI - Enable AI assistants to access web scraping, crawling, and deep research via Model Context Protocol. Faster and more efficient than FireCrawl!
https://github.com/bjornmelin/crawl4ai-mcp-server
Last synced: 4 months ago
JSON representation
π High-performance MCP Server for Crawl4AI - Enable AI assistants to access web scraping, crawling, and deep research via Model Context Protocol. Faster and more efficient than FireCrawl!
- Host: GitHub
- URL: https://github.com/bjornmelin/crawl4ai-mcp-server
- Owner: BjornMelin
- Created: 2025-05-04T05:19:25.000Z (5 months ago)
- Default Branch: main
- Last Pushed: 2025-05-04T06:15:48.000Z (5 months ago)
- Last Synced: 2025-05-04T06:21:07.486Z (5 months ago)
- Size: 43.9 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 11
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# β οΈ NOTICE
> **MCP SERVER CURRENTLY UNDER DEVELOPMENT**
> **NOT READY FOR PRODUCTION USE**
> **WILL UPDATE WHEN OPERATIONAL**# Crawl4AI MCP Server
π High-performance MCP Server for Crawl4AI - Enable AI assistants to access web scraping, crawling, and deep research via Model Context Protocol. Faster and more efficient than FireCrawl!
## Overview
This project implements a custom Model Context Protocol (MCP) Server that integrates with Crawl4AI, an open-source web scraping and crawling library. The server is deployed as a remote MCP server on CloudFlare Workers, allowing AI assistants like Claude to access Crawl4AI's powerful web scraping capabilities.
## Documentation
For comprehensive details about this project, please refer to the following documentation:
- [Migration Plan](docs/MIGRATION_PLAN.md) - Detailed plan for migrating from Firecrawl to Crawl4AI
- [Enhanced Architecture](docs/ENHANCED_ARCHITECTURE.md) - Multi-tenant architecture with cloud provider flexibility
- [Implementation Guide](docs/IMPLEMENTATION_GUIDE.md) - Technical implementation details and code examples
- [Codebase Simplification](docs/SIMPLIFICATION.md) - Details on code simplification and best practices implemented
- [Docker Setup Guide](docs/DOCKER.md) - Instructions for Docker setup for local development and production## Features
### Web Data Acquisition
- π **Single Webpage Scraping**: Extract content from individual webpages
- πΈοΈ **Web Crawling**: Crawl websites with configurable depth and page limits
- πΊοΈ **URL Discovery**: Map and discover URLs from a starting point
- πΈοΈ **Asynchronous Crawling**: Crawl entire websites efficiently### Content Processing
- π **Deep Research**: Conduct comprehensive research across multiple pages
- π **Structured Data Extraction**: Extract specific data using CSS selectors or LLM-based extraction
- π **Content Search**: Search through previously crawled content### Integration & Security
- π **MCP Integration**: Seamless integration with MCP clients (Claude Desktop, etc.)
- π **OAuth Authentication**: Secure access with proper authorization
- π **Authentication Options**: Secure access via OAuth or API key (Bearer token)
- β‘ **High Performance**: Optimized for speed and efficiency## Project Structure
```plaintext
crawl4ai-mcp/
βββ src/
β βββ index.ts # Main entry point with OAuth provider setup
β βββ auth-handler.ts # Authentication handler
β βββ mcp-server.ts # MCP server implementation
β βββ crawl4ai-adapter.ts # Adapter for Crawl4AI API
β βββ tool-schemas/ # MCP tool schema definitions
β β βββ [...].ts # Tool schemas
β βββ handlers/
β β βββ crawl.ts # Web crawling implementation
β β βββ search.ts # Search functionality
β β βββ extract.ts # Content extraction
β βββ utils/ # Utility functions
βββ tests/ # Test cases
βββ .github/ # GitHub configuration
βββ wrangler.toml # CloudFlare Workers configuration
βββ tsconfig.json # TypeScript configuration
βββ package.json # Node.js dependencies
βββ README.md # Project documentation
```## Getting Started
### Prerequisites
- [Node.js](https://nodejs.org/) (v18 or higher)
- [npm](https://www.npmjs.com/)
- [Wrangler](https://developers.cloudflare.com/workers/wrangler/install-and-update/) (CloudFlare Workers CLI)
- A CloudFlare account### Installation
1. Clone the repository:
```bash
git clone https://github.com/BjornMelin/crawl4ai-mcp-server.git
cd crawl4ai-mcp-server
```2. Install dependencies:
```bash
npm install
```3. Set up CloudFlare KV namespace:
```bash
wrangler kv:namespace create CRAWL_DATA
```4. Update `wrangler.toml` with the KV namespace ID:
```toml
kv_namespaces = [
{ binding = "CRAWL_DATA", id = "your-namespace-id" }
]
```## Development
### Local Development
#### Using NPM
1. Start the development server:
```bash
npm run dev
```2. The server will be available at
#### Using Docker
You can also use Docker for local development, which includes the Crawl4AI API and a debug UI:
1. Set up environment variables:
```bash
cp .env.example .env
# Edit .env file with your API key
```2. Start the Docker development environment:
```bash
docker-compose up -d
```3. Access the services:
- MCP Server:
- Crawl4AI UI:See the [Docker Setup Guide](docs/DOCKER.md) for more details.
### Testing
The project includes a comprehensive test suite using Jest. To run tests:
```bash
# Run all tests
npm test# Run tests with watch mode during development
npm run test:watch# Run tests with coverage report
npm run test:coverage# Run only unit tests
npm run test:unit# Run only integration tests
npm run test:integration
```When running in Docker:
```bash
docker-compose exec mcp-server npm test
```## Deployment
1. Deploy to CloudFlare Workers:
```bash
npm run deploy
```2. Your server will be available at the CloudFlare Workers URL assigned to your deployed worker.
## Usage with MCP Clients
This server implements the Model Context Protocol, allowing AI assistants to access its tools.
### Authentication
- Implement OAuth authentication with workers-oauth-provider
- Add API key authentication using Bearer tokens
- Create login page and token management### Connecting to an MCP Client
1. Use the CloudFlare Workers URL assigned to your deployed worker
2. In Claude Desktop or other MCP clients, add this server as a tool source### Available Tools
- `crawl`: Crawl web pages from a starting URL
- `getCrawl`: Retrieve crawl data by ID
- `listCrawls`: List all crawls or filter by domain
- `search`: Search indexed documents by query
- `extract`: Extract structured content from a URL## Configuration
The server can be configured by modifying environment variables in `wrangler.toml`:
- `MAX_CRAWL_DEPTH`: Maximum depth for web crawling (default: 3)
- `MAX_CRAWL_PAGES`: Maximum pages to crawl (default: 100)
- `API_VERSION`: API version string (default: "v1")
- `OAUTH_CLIENT_ID`: OAuth client ID for authentication
- `OAUTH_CLIENT_SECRET`: OAuth client secret for authentication## Roadmap
The project is being developed with these components in mind:
1. **Project Setup and Configuration**: CloudFlare Worker setup, TypeScript configuration
2. **MCP Server and Tool Schemas**: Implementation of MCP server with tool definitions
3. **Crawl4AI Adapter**: Integration with the Crawl4AI functionality
4. **OAuth Authentication**: Secure authentication implementation
5. **Performance Optimizations**: Enhancing speed and reliability
6. **Advanced Extraction Features**: Improving structured data extraction capabilities## Contributing
Contributions are welcome! Please check the open issues or create a new one before starting work on a feature or bug fix. See [Contributing Guidelines](CONTRIBUTING.md) for detailed guidelines.
## Support
If you encounter issues or have questions:
- Open an issue on the GitHub repository
- Check the [Crawl4AI documentation](https://crawl4ai.com/docs)
- Refer to the [Model Context Protocol specification](https://github.com/anthropics/model-context-protocol)## How to Cite
If you use Crawl4AI MCP Server in your research or projects, please cite it using the following BibTeX entry:
```bibtex
@software{crawl4ai_mcp_2025,
author = {Melin, Bjorn},
title = {Crawl4AI MCP Server: High-performance Web Crawling for AI Assistants},
url = {https://github.com/BjornMelin/crawl4ai-mcp-server},
version = {1.0.0},
year = {2025},
month = {5}
}
```## License
[MIT](LICENSE)