An open API service indexing awesome lists of open source software.

https://github.com/code-hex/light-research-mcp

A lightweight MCP server for LLM orchestration with DuckDuckGo/GitHub Code search and content extraction
https://github.com/code-hex/light-research-mcp

Last synced: 2 months ago
JSON representation

A lightweight MCP server for LLM orchestration with DuckDuckGo/GitHub Code search and content extraction

Awesome Lists containing this project

README

          

# LLM Researcher

A lightweight MCP (Model Context Protocol) server for LLM orchestration that provides efficient web content search and extraction capabilities. This CLI tool enables LLMs to search DuckDuckGo and extract clean, LLM-friendly content from web pages.

Built with **TypeScript**, **tsup**, and **vitest** for modern development experience.

## Features

- **MCP Server Support**: Provides Model Context Protocol server for LLM integration
- **Free Operation**: Uses DuckDuckGo HTML endpoint (no API costs)
- **GitHub Code Search**: Search GitHub repositories for code examples and implementation patterns
- **Smart Content Extraction**: Playwright + @mozilla/readability for clean content
- **LLM-Optimized Output**: Sanitized Markdown (h1-h3, bold, italic, links only)
- **Rate Limited**: Respects DuckDuckGo with 1 req/sec limit
- **Cross-Platform**: Works on macOS, Linux, and WSL
- **Multiple Modes**: CLI, MCP server, search, direct URL, and interactive modes
- **Type Safe**: Full TypeScript implementation with strict typing
- **Modern Tooling**: Built with tsup bundler and vitest testing

## Installation

### Prerequisites

- Node.js 20.0.0 or higher
- No local Chrome installation required (uses Playwright's bundled Chromium)

### Setup

```bash
# Clone or download the project
cd light-research-mcp

# Install dependencies (using pnpm)
pnpm install

# Build the project
pnpm build

# Install Playwright browsers
pnpm install-browsers

# Optional: Link globally for system-wide access
pnpm link --global
```

## Usage

### MCP Server Mode

Use as a Model Context Protocol server to provide search and content extraction tools to LLMs:

```bash
# Start MCP server (stdio transport)
llmresearcher --mcp

# The server provides these tools to MCP clients:
# - github_code_search: Search GitHub repositories for code
# - duckduckgo_web_search: Search the web with DuckDuckGo
# - extract_content: Extract detailed content from URLs
```

#### Setting up with Claude Code

```bash
# Add as an MCP server to Claude Code
claude mcp add light-research-mcp /path/to/light-research-mcp/dist/bin/llmresearcher.js --mcp

# Or with project scope for team sharing
claude mcp add light-research-mcp -s project /path/to/light-research-mcp/dist/bin/llmresearcher.js --mcp

# List configured servers
claude mcp list

# Check server status
claude mcp get light-research-mcp
```

#### MCP Tool Usage Examples

Once configured, you can use these tools in Claude:

```
> Search for React hooks examples on GitHub
Tool: github_code_search
Query: "useState useEffect hooks language:javascript"

> Search for TypeScript best practices
Tool: duckduckgo_web_search
Query: "TypeScript best practices 2024"
Locale: us-en (or wt-wt for no region)

> Extract content from a search result
Tool: extract_content
URL: https://example.com/article-from-search-results
```

### Command Line Interface

```bash
# Search mode - Search DuckDuckGo and interactively browse results
llmresearcher "machine learning transformers"

# GitHub Code Search mode - Search GitHub for code
llmresearcher -g "useState hooks language:typescript"

# Direct URL mode - Extract content from specific URL
llmresearcher -u https://example.com/article

# Interactive mode - Enter interactive search session
llmresearcher

# Verbose logging - See detailed operation logs
llmresearcher -v "search query"

# MCP Server mode - Start as Model Context Protocol server
llmresearcher --mcp
```

## Development

### Scripts

```bash
# Build the project
pnpm build

# Build in watch mode (for development)
pnpm dev

# Run tests
pnpm test

# Run tests in CI mode (single run)
pnpm test:run

# Type checking
pnpm type-check

# Clean build artifacts
pnpm clean

# Install Playwright browsers
pnpm install-browsers
```

### Interactive Commands

When in search results view:
- **1-10**: Select a result by number
- **b** or **back**: Return to search results
- **open \**: Open result #n in external browser
- **q** or **quit**: Exit the program

When viewing content:
- **b** or **back**: Return to search results
- **/\**: Search for term within the extracted content
- **open**: Open current page in external browser
- **q** or **quit**: Exit the program

## Configuration

### Environment Variables

Create a `.env` file in the project root:

```env
USER_AGENT=Mozilla/5.0 (compatible; LLMResearcher/1.0)
TIMEOUT=30000
MAX_RETRIES=3
RATE_LIMIT_DELAY=1000
CACHE_ENABLED=true
MAX_RESULTS=10
```

### Configuration File

Create `~/.llmresearcherrc` in your home directory:

```json
{
"userAgent": "Mozilla/5.0 (compatible; LLMResearcher/1.0)",
"timeout": 30000,
"maxRetries": 3,
"rateLimitDelay": 1000,
"cacheEnabled": true,
"maxResults": 10
}
```

### Configuration Options

| Option | Default | Description |
|--------|---------|-------------|
| `userAgent` | `Mozilla/5.0 (compatible; LLMResearcher/1.0)` | User agent for HTTP requests |
| `timeout` | `30000` | Request timeout in milliseconds |
| `maxRetries` | `3` | Maximum retry attempts for failed requests |
| `rateLimitDelay` | `1000` | Delay between requests in milliseconds |
| `cacheEnabled` | `true` | Enable/disable local caching |
| `maxResults` | `10` | Maximum search results to display |

## Architecture

### Core Components

1. **MCPResearchServer** (`src/mcp-server.ts`)
- Model Context Protocol server implementation
- Three main tools: github_code_search, duckduckgo_web_search, extract_content
- JSON-based responses for LLM consumption

2. **DuckDuckGoSearcher** (`src/search.ts`)
- HTML scraping of DuckDuckGo search results with locale support
- URL decoding for `/l/?uddg=` format links
- Rate limiting and retry logic

3. **GitHubCodeSearcher** (`src/github-code-search.ts`)
- GitHub Code Search API integration via gh CLI
- Advanced query support with language, repo, and file filters
- Authentication and rate limiting

4. **ContentExtractor** (`src/extractor.ts`)
- Playwright-based page rendering with resource blocking
- @mozilla/readability for main content extraction
- DOMPurify sanitization and Markdown conversion

5. **CLIInterface** (`src/cli.ts`)
- Interactive command-line interface
- Search result navigation
- Content viewing and text search

6. **Configuration** (`src/config.ts`)
- Environment and RC file configuration loading
- Verbose logging support

### Content Processing Pipeline

#### MCP Server Mode
1. **Search**:
- DuckDuckGo: HTML endpoint → Parse results → JSON response with pagination
- GitHub: Code Search API → Format results → JSON response with code snippets
2. **Extract**: URL from search results → Playwright navigation → Content extraction
3. **Process**: @mozilla/readability → DOMPurify sanitization → Clean JSON output
4. **Output**: Structured JSON for LLM consumption

#### CLI Mode
1. **Search**: DuckDuckGo HTML endpoint → Parse results → Display numbered list
2. **Extract**: Playwright navigation → Resource blocking → JS rendering
3. **Process**: @mozilla/readability → DOMPurify sanitization → Turndown Markdown
4. **Output**: Clean Markdown with h1-h3, **bold**, *italic*, [links](url) only

### Security Features

- **Resource Blocking**: Prevents loading of images, CSS, fonts for speed and security
- **Content Sanitization**: DOMPurify removes scripts, iframes, and dangerous elements
- **Limited Markdown**: Only allows safe formatting elements (h1-h3, strong, em, a)
- **Rate Limiting**: Respects DuckDuckGo's rate limits with exponential backoff

## Examples

### MCP Server Usage with Claude Code

#### 1. GitHub Code Search

```
You: "Find React hook examples for state management"

Claude uses github_code_search tool:
{
"query": "useState useReducer state management language:javascript",
"results": [
{
"title": "facebook/react/packages/react/src/ReactHooks.js",
"url": "https://raw.githubusercontent.com/facebook/react/main/packages/react/src/ReactHooks.js",
"snippet": "function useState(initialState) {\n return dispatcher.useState(initialState);\n}"
}
],
"pagination": {
"currentPage": 1,
"hasNextPage": true,
"nextPageToken": "2"
}
}
```

#### 2. Web Search with Locale

```
You: "Search for Vue.js tutorials in Japanese"

Claude uses duckduckgo_web_search tool:
{
"query": "Vue.js チュートリアル 入門",
"locale": "jp-jp",
"results": [
{
"title": "Vue.js入門ガイド",
"url": "https://example.com/vue-tutorial",
"snippet": "Vue.jsの基本的な使い方を学ぶチュートリアル..."
}
]
}
```

#### 3. Content Extraction

```
You: "Extract the full content from that Vue.js tutorial"

Claude uses extract_content tool:
{
"url": "https://example.com/vue-tutorial",
"title": "Vue.js入門ガイド",
"extractedAt": "2024-01-15T10:30:00.000Z",
"content": "# Vue.js入門ガイド\n\nVue.jsは...\n\n## インストール\n\n..."
}
```

### CLI Examples

#### Basic Search

```bash
$ llmresearcher "python web scraping"

🔍 Search Results:
══════════════════════════════════════════════════

1. Python Web Scraping Tutorial
URL: https://realpython.com/python-web-scraping-practical-introduction/
Complete guide to web scraping with Python using requests and Beautiful Soup...

2. Web Scraping with Python - BeautifulSoup and requests
URL: https://www.dataquest.io/blog/web-scraping-python-tutorial/
Learn how to scrape websites with Python, Beautiful Soup, and requests...

══════════════════════════════════════════════════
Commands: [1-10] select result | b) back | q) quit | open ) open in browser

> 1

📥 Extracting content from: Python Web Scraping Tutorial

📄 Content:
══════════════════════════════════════════════════

**Python Web Scraping Tutorial**
Source: https://realpython.com/python-web-scraping-practical-introduction/
Extracted: 2024-01-15T10:30:00.000Z

──────────────────────────────────────────────────

# Python Web Scraping: A Practical Introduction

Web scraping is the process of collecting and parsing raw data from the web...

## What Is Web Scraping?

Web scraping is a technique to automatically access and extract large amounts...

══════════════════════════════════════════════════
Commands: b) back to results | /) search in text | q) quit | open) open in browser

> /beautiful soup

🔍 Found 3 matches for "beautiful soup":
──────────────────────────────────────────────────
Line 15: Beautiful Soup is a Python library for parsing HTML and XML documents.
Line 42: from bs4 import BeautifulSoup
Line 67: soup = BeautifulSoup(html_content, 'html.parser')
```

### Direct URL Mode

```bash
$ llmresearcher -u https://docs.python.org/3/tutorial/

📄 Content:
══════════════════════════════════════════════════

**The Python Tutorial**
Source: https://docs.python.org/3/tutorial/
Extracted: 2024-01-15T10:35:00.000Z

──────────────────────────────────────────────────

# The Python Tutorial

Python is an easy to learn, powerful programming language...

## An Informal Introduction to Python

In the following examples, input and output are distinguished...
```

### Verbose Mode

```bash
$ llmresearcher -v "nodejs tutorial"

[VERBOSE] Searching: https://duckduckgo.com/html/?q=nodejs%20tutorial&kl=us-en
[VERBOSE] Response: 200 in 847ms
[VERBOSE] Parsed 10 results
[VERBOSE] Launching browser...
[VERBOSE] Blocking resource: https://example.com/style.css
[VERBOSE] Blocking resource: https://example.com/image.png
[VERBOSE] Navigating to page...
[VERBOSE] Page loaded in 1243ms
[VERBOSE] Processing content with Readability...
[VERBOSE] Readability extraction successful
[VERBOSE] Closing browser...
```

## Testing

### Running Tests

```bash
# Run tests in watch mode
pnpm test

# Run tests once (CI mode)
pnpm test:run

# Run tests with coverage
pnpm test -- --coverage
```

### Test Coverage

The test suite includes:

- **Unit Tests**: Individual component testing
- `search.test.ts`: DuckDuckGo search functionality, URL decoding, rate limiting
- `extractor.test.ts`: Content extraction, Markdown conversion, resource management
- `config.test.ts`: Configuration validation and environment handling

- **Integration Tests**: End-to-end workflow testing
- `integration.test.ts`: Complete search-to-extraction workflows, error handling, cleanup

### Test Features

- **Fast**: Powered by vitest for quick feedback
- **Type-safe**: Full TypeScript support in tests
- **Isolated**: Each test cleans up its resources
- **Comprehensive**: Covers search, extraction, configuration, and integration scenarios

## Troubleshooting

### Common Issues

**"Browser not found" Error**
```bash
pnpm install-browsers
```

**Rate Limiting Issues**
- The tool automatically handles rate limiting with 1-second delays
- If you encounter 429 errors, the tool will automatically retry with exponential backoff

**Content Extraction Failures**
- Some sites may block automated access
- The tool includes fallback extraction methods (main → body content)
- Use verbose mode (`-v`) to see detailed error information

**Permission Denied (Unix/Linux)**
```bash
chmod +x bin/llmresearcher.js
```

### Performance Optimization

The tool is optimized for speed:
- **Resource Blocking**: Automatically blocks images, CSS, fonts
- **Network Idle**: Waits for JavaScript to complete rendering
- **Content Caching**: Supports local caching to avoid repeated requests
- **Minimal Dependencies**: Uses lightweight, focused libraries

## Development

### Project Structure

```
light-research-mcp/
├── dist/ # Built JavaScript files (generated)
│ ├── bin/
│ │ └── llmresearcher.js # CLI entry point (executable)
│ └── *.js # Compiled TypeScript modules
├── src/ # TypeScript source files
│ ├── bin.ts # CLI entry point
│ ├── index.ts # Main LLMResearcher class
│ ├── mcp-server.ts # MCP server implementation
│ ├── search.ts # DuckDuckGo search implementation
│ ├── github-code-search.ts # GitHub Code Search implementation
│ ├── extractor.ts # Content extraction with Playwright
│ ├── cli.ts # Interactive CLI interface
│ ├── config.ts # Configuration management
│ └── types.ts # TypeScript type definitions
├── test/ # Test files (vitest)
│ ├── search.test.ts # Search functionality tests
│ ├── extractor.test.ts # Content extraction tests
│ ├── config.test.ts # Configuration tests
│ ├── mcp-locale.test.ts # MCP locale functionality tests
│ ├── mcp-content-extractor.test.ts # MCP content extractor tests
│ └── integration.test.ts # End-to-end integration tests
├── tsconfig.json # TypeScript configuration
├── tsup.config.ts # Build configuration
├── vitest.config.ts # Test configuration
├── package.json
└── README.md
```

### Dependencies

#### Runtime Dependencies
- **@modelcontextprotocol/sdk**: Model Context Protocol server implementation
- **@mozilla/readability**: Content extraction from HTML
- **cheerio**: HTML parsing for search results
- **commander**: CLI argument parsing
- **dompurify**: HTML sanitization
- **dotenv**: Environment variable loading
- **jsdom**: DOM manipulation for server-side processing
- **playwright**: Browser automation for JS rendering
- **turndown**: HTML to Markdown conversion

#### Development Dependencies
- **typescript**: TypeScript compiler
- **tsup**: Fast TypeScript bundler
- **vitest**: Fast unit test framework
- **@types/***: TypeScript type definitions

## License

MIT License - see LICENSE file for details.

## Contributing

1. Fork the repository
2. Create a feature branch
3. Make your changes
4. Add tests if applicable
5. Submit a pull request

## Roadmap

### Planned Features

- **Enhanced MCP Tools**: Additional specialized search tools for documentation, APIs, etc.
- **Caching Layer**: SQLite-based URL → Markdown caching with 24-hour TTL
- **Search Engine Abstraction**: Support for Brave Search, Bing, and other engines
- **Content Summarization**: Optional AI-powered content summarization
- **Export Formats**: JSON, plain text, and other output formats
- **Batch Processing**: Process multiple URLs from file input
- **SSE Transport**: Support for Server-Sent Events MCP transport

### Performance Improvements

- **Parallel Processing**: Concurrent content extraction for multiple results
- **Smart Caching**: Intelligent cache invalidation based on content freshness
- **Memory Optimization**: Streaming content processing for large documents