{"id":29885646,"url":"https://github.com/code-hex/light-research-mcp","last_synced_at":"2025-10-13T18:58:50.991Z","repository":{"id":299755174,"uuid":"1003053569","full_name":"Code-Hex/light-research-mcp","owner":"Code-Hex","description":"A lightweight MCP server for LLM orchestration with DuckDuckGo/GitHub Code search and content extraction","archived":false,"fork":false,"pushed_at":"2025-06-16T14:49:30.000Z","size":89,"stargazers_count":6,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2025-07-29T04:33:42.231Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"TypeScript","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Code-Hex.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2025-06-16T14:49:19.000Z","updated_at":"2025-07-09T08:53:41.000Z","dependencies_parsed_at":"2025-06-18T06:12:04.936Z","dependency_job_id":"89618861-46e3-46a0-a901-2925894030f5","html_url":"https://github.com/Code-Hex/light-research-mcp","commit_stats":null,"previous_names":["code-hex/light-research-mcp"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/Code-Hex/light-research-mcp","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Code-Hex%2Flight-research-mcp","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Code-Hex%2Flight-research-mcp/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Code-Hex%2Flight-research-mcp/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Code-Hex%2Flight-research-mcp/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Code-Hex","download_url":"https://codeload.github.com/Code-Hex/light-research-mcp/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Code-Hex%2Flight-research-mcp/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":268067678,"owners_count":24190406,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-07-31T02:00:08.723Z","response_time":66,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2025-07-31T16:02:14.510Z","updated_at":"2025-10-13T18:58:45.941Z","avatar_url":"https://github.com/Code-Hex.png","language":"TypeScript","funding_links":[],"categories":["Search \u0026 Data Extraction"],"sub_categories":["How to Submit"],"readme":"# LLM Researcher\n\nA lightweight MCP (Model Context Protocol) server for LLM orchestration that provides efficient web content search and extraction capabilities. This CLI tool enables LLMs to search DuckDuckGo and extract clean, LLM-friendly content from web pages.\n\nBuilt with **TypeScript**, **tsup**, and **vitest** for modern development experience.\n\n## Features\n\n- **MCP Server Support**: Provides Model Context Protocol server for LLM integration\n- **Free Operation**: Uses DuckDuckGo HTML endpoint (no API costs)\n- **GitHub Code Search**: Search GitHub repositories for code examples and implementation patterns\n- **Smart Content Extraction**: Playwright + @mozilla/readability for clean content\n- **LLM-Optimized Output**: Sanitized Markdown (h1-h3, bold, italic, links only)\n- **Rate Limited**: Respects DuckDuckGo with 1 req/sec limit\n- **Cross-Platform**: Works on macOS, Linux, and WSL\n- **Multiple Modes**: CLI, MCP server, search, direct URL, and interactive modes\n- **Type Safe**: Full TypeScript implementation with strict typing\n- **Modern Tooling**: Built with tsup bundler and vitest testing\n\n## Installation\n\n### Prerequisites\n\n- Node.js 20.0.0 or higher\n- No local Chrome installation required (uses Playwright's bundled Chromium)\n\n### Setup\n\n```bash\n# Clone or download the project\ncd light-research-mcp\n\n# Install dependencies (using pnpm)\npnpm install\n\n# Build the project\npnpm build\n\n# Install Playwright browsers\npnpm install-browsers\n\n# Optional: Link globally for system-wide access\npnpm link --global\n```\n\n## Usage\n\n### MCP Server Mode\n\nUse as a Model Context Protocol server to provide search and content extraction tools to LLMs:\n\n```bash\n# Start MCP server (stdio transport)\nllmresearcher --mcp\n\n# The server provides these tools to MCP clients:\n# - github_code_search: Search GitHub repositories for code\n# - duckduckgo_web_search: Search the web with DuckDuckGo\n# - extract_content: Extract detailed content from URLs\n```\n\n#### Setting up with Claude Code\n\n```bash\n# Add as an MCP server to Claude Code\nclaude mcp add light-research-mcp /path/to/light-research-mcp/dist/bin/llmresearcher.js --mcp\n\n# Or with project scope for team sharing\nclaude mcp add light-research-mcp -s project /path/to/light-research-mcp/dist/bin/llmresearcher.js --mcp\n\n# List configured servers\nclaude mcp list\n\n# Check server status\nclaude mcp get light-research-mcp\n```\n\n#### MCP Tool Usage Examples\n\nOnce configured, you can use these tools in Claude:\n\n```\n\u003e Search for React hooks examples on GitHub\nTool: github_code_search\nQuery: \"useState useEffect hooks language:javascript\"\n\n\u003e Search for TypeScript best practices\nTool: duckduckgo_web_search  \nQuery: \"TypeScript best practices 2024\"\nLocale: us-en (or wt-wt for no region)\n\n\u003e Extract content from a search result\nTool: extract_content\nURL: https://example.com/article-from-search-results\n```\n\n### Command Line Interface\n\n```bash\n# Search mode - Search DuckDuckGo and interactively browse results\nllmresearcher \"machine learning transformers\"\n\n# GitHub Code Search mode - Search GitHub for code\nllmresearcher -g \"useState hooks language:typescript\"\n\n# Direct URL mode - Extract content from specific URL\nllmresearcher -u https://example.com/article\n\n# Interactive mode - Enter interactive search session\nllmresearcher\n\n# Verbose logging - See detailed operation logs\nllmresearcher -v \"search query\"\n\n# MCP Server mode - Start as Model Context Protocol server\nllmresearcher --mcp\n```\n\n## Development\n\n### Scripts\n\n```bash\n# Build the project\npnpm build\n\n# Build in watch mode (for development)\npnpm dev\n\n# Run tests\npnpm test\n\n# Run tests in CI mode (single run)\npnpm test:run\n\n# Type checking\npnpm type-check\n\n# Clean build artifacts\npnpm clean\n\n# Install Playwright browsers\npnpm install-browsers\n```\n\n### Interactive Commands\n\nWhen in search results view:\n- **1-10**: Select a result by number\n- **b** or **back**: Return to search results\n- **open \\\u003cn\u003e**: Open result #n in external browser\n- **q** or **quit**: Exit the program\n\nWhen viewing content:\n- **b** or **back**: Return to search results\n- **/\\\u003cterm\u003e**: Search for term within the extracted content\n- **open**: Open current page in external browser\n- **q** or **quit**: Exit the program\n\n## Configuration\n\n### Environment Variables\n\nCreate a `.env` file in the project root:\n\n```env\nUSER_AGENT=Mozilla/5.0 (compatible; LLMResearcher/1.0)\nTIMEOUT=30000\nMAX_RETRIES=3\nRATE_LIMIT_DELAY=1000\nCACHE_ENABLED=true\nMAX_RESULTS=10\n```\n\n### Configuration File\n\nCreate `~/.llmresearcherrc` in your home directory:\n\n```json\n{\n  \"userAgent\": \"Mozilla/5.0 (compatible; LLMResearcher/1.0)\",\n  \"timeout\": 30000,\n  \"maxRetries\": 3,\n  \"rateLimitDelay\": 1000,\n  \"cacheEnabled\": true,\n  \"maxResults\": 10\n}\n```\n\n### Configuration Options\n\n| Option | Default | Description |\n|--------|---------|-------------|\n| `userAgent` | `Mozilla/5.0 (compatible; LLMResearcher/1.0)` | User agent for HTTP requests |\n| `timeout` | `30000` | Request timeout in milliseconds |\n| `maxRetries` | `3` | Maximum retry attempts for failed requests |\n| `rateLimitDelay` | `1000` | Delay between requests in milliseconds |\n| `cacheEnabled` | `true` | Enable/disable local caching |\n| `maxResults` | `10` | Maximum search results to display |\n\n## Architecture\n\n### Core Components\n\n1. **MCPResearchServer** (`src/mcp-server.ts`)\n   - Model Context Protocol server implementation\n   - Three main tools: github_code_search, duckduckgo_web_search, extract_content\n   - JSON-based responses for LLM consumption\n\n2. **DuckDuckGoSearcher** (`src/search.ts`)\n   - HTML scraping of DuckDuckGo search results with locale support\n   - URL decoding for `/l/?uddg=` format links\n   - Rate limiting and retry logic\n\n3. **GitHubCodeSearcher** (`src/github-code-search.ts`)\n   - GitHub Code Search API integration via gh CLI\n   - Advanced query support with language, repo, and file filters\n   - Authentication and rate limiting\n\n4. **ContentExtractor** (`src/extractor.ts`)\n   - Playwright-based page rendering with resource blocking\n   - @mozilla/readability for main content extraction\n   - DOMPurify sanitization and Markdown conversion\n\n5. **CLIInterface** (`src/cli.ts`)\n   - Interactive command-line interface\n   - Search result navigation\n   - Content viewing and text search\n\n6. **Configuration** (`src/config.ts`)\n   - Environment and RC file configuration loading\n   - Verbose logging support\n\n### Content Processing Pipeline\n\n#### MCP Server Mode\n1. **Search**: \n   - DuckDuckGo: HTML endpoint → Parse results → JSON response with pagination\n   - GitHub: Code Search API → Format results → JSON response with code snippets\n2. **Extract**: URL from search results → Playwright navigation → Content extraction\n3. **Process**: @mozilla/readability → DOMPurify sanitization → Clean JSON output\n4. **Output**: Structured JSON for LLM consumption\n\n#### CLI Mode  \n1. **Search**: DuckDuckGo HTML endpoint → Parse results → Display numbered list\n2. **Extract**: Playwright navigation → Resource blocking → JS rendering\n3. **Process**: @mozilla/readability → DOMPurify sanitization → Turndown Markdown\n4. **Output**: Clean Markdown with h1-h3, **bold**, *italic*, [links](url) only\n\n### Security Features\n\n- **Resource Blocking**: Prevents loading of images, CSS, fonts for speed and security\n- **Content Sanitization**: DOMPurify removes scripts, iframes, and dangerous elements\n- **Limited Markdown**: Only allows safe formatting elements (h1-h3, strong, em, a)\n- **Rate Limiting**: Respects DuckDuckGo's rate limits with exponential backoff\n\n## Examples\n\n### MCP Server Usage with Claude Code\n\n#### 1. GitHub Code Search\n\n```\nYou: \"Find React hook examples for state management\"\n\nClaude uses github_code_search tool:\n{\n  \"query\": \"useState useReducer state management language:javascript\",\n  \"results\": [\n    {\n      \"title\": \"facebook/react/packages/react/src/ReactHooks.js\",\n      \"url\": \"https://raw.githubusercontent.com/facebook/react/main/packages/react/src/ReactHooks.js\",\n      \"snippet\": \"function useState(initialState) {\\n  return dispatcher.useState(initialState);\\n}\"\n    }\n  ],\n  \"pagination\": {\n    \"currentPage\": 1,\n    \"hasNextPage\": true,\n    \"nextPageToken\": \"2\"\n  }\n}\n```\n\n#### 2. Web Search with Locale\n\n```\nYou: \"Search for Vue.js tutorials in Japanese\"\n\nClaude uses duckduckgo_web_search tool:\n{\n  \"query\": \"Vue.js チュートリアル 入門\",\n  \"locale\": \"jp-jp\",\n  \"results\": [\n    {\n      \"title\": \"Vue.js入門ガイド\",\n      \"url\": \"https://example.com/vue-tutorial\",\n      \"snippet\": \"Vue.jsの基本的な使い方を学ぶチュートリアル...\"\n    }\n  ]\n}\n```\n\n#### 3. Content Extraction\n\n```\nYou: \"Extract the full content from that Vue.js tutorial\"\n\nClaude uses extract_content tool:\n{\n  \"url\": \"https://example.com/vue-tutorial\",\n  \"title\": \"Vue.js入門ガイド\",\n  \"extractedAt\": \"2024-01-15T10:30:00.000Z\",\n  \"content\": \"# Vue.js入門ガイド\\n\\nVue.jsは...\\n\\n## インストール\\n\\n...\"\n}\n```\n\n### CLI Examples\n\n#### Basic Search\n\n```bash\n$ llmresearcher \"python web scraping\"\n\n🔍 Search Results:\n══════════════════════════════════════════════════\n\n1. Python Web Scraping Tutorial\n   URL: https://realpython.com/python-web-scraping-practical-introduction/\n   Complete guide to web scraping with Python using requests and Beautiful Soup...\n\n2. Web Scraping with Python - BeautifulSoup and requests\n   URL: https://www.dataquest.io/blog/web-scraping-python-tutorial/\n   Learn how to scrape websites with Python, Beautiful Soup, and requests...\n\n══════════════════════════════════════════════════\nCommands: [1-10] select result | b) back | q) quit | open \u003cn\u003e) open in browser\n\n\u003e 1\n\n📥 Extracting content from: Python Web Scraping Tutorial\n\n📄 Content:\n══════════════════════════════════════════════════\n\n**Python Web Scraping Tutorial**\nSource: https://realpython.com/python-web-scraping-practical-introduction/\nExtracted: 2024-01-15T10:30:00.000Z\n\n──────────────────────────────────────────────────\n\n# Python Web Scraping: A Practical Introduction\n\nWeb scraping is the process of collecting and parsing raw data from the web...\n\n## What Is Web Scraping?\n\nWeb scraping is a technique to automatically access and extract large amounts...\n\n══════════════════════════════════════════════════\nCommands: b) back to results | /\u003cterm\u003e) search in text | q) quit | open) open in browser\n\n\u003e /beautiful soup\n\n🔍 Found 3 matches for \"beautiful soup\":\n──────────────────────────────────────────────────\nLine 15: Beautiful Soup is a Python library for parsing HTML and XML documents.\nLine 42: from bs4 import BeautifulSoup\nLine 67: soup = BeautifulSoup(html_content, 'html.parser')\n```\n\n### Direct URL Mode\n\n```bash\n$ llmresearcher -u https://docs.python.org/3/tutorial/\n\n📄 Content:\n══════════════════════════════════════════════════\n\n**The Python Tutorial**\nSource: https://docs.python.org/3/tutorial/\nExtracted: 2024-01-15T10:35:00.000Z\n\n──────────────────────────────────────────────────\n\n# The Python Tutorial\n\nPython is an easy to learn, powerful programming language...\n\n## An Informal Introduction to Python\n\nIn the following examples, input and output are distinguished...\n```\n\n### Verbose Mode\n\n```bash\n$ llmresearcher -v \"nodejs tutorial\"\n\n[VERBOSE] Searching: https://duckduckgo.com/html/?q=nodejs%20tutorial\u0026kl=us-en\n[VERBOSE] Response: 200 in 847ms\n[VERBOSE] Parsed 10 results\n[VERBOSE] Launching browser...\n[VERBOSE] Blocking resource: https://example.com/style.css\n[VERBOSE] Blocking resource: https://example.com/image.png\n[VERBOSE] Navigating to page...\n[VERBOSE] Page loaded in 1243ms\n[VERBOSE] Processing content with Readability...\n[VERBOSE] Readability extraction successful\n[VERBOSE] Closing browser...\n```\n\n## Testing\n\n### Running Tests\n\n```bash\n# Run tests in watch mode\npnpm test\n\n# Run tests once (CI mode)\npnpm test:run\n\n# Run tests with coverage\npnpm test -- --coverage\n```\n\n### Test Coverage\n\nThe test suite includes:\n\n- **Unit Tests**: Individual component testing\n  - `search.test.ts`: DuckDuckGo search functionality, URL decoding, rate limiting\n  - `extractor.test.ts`: Content extraction, Markdown conversion, resource management\n  - `config.test.ts`: Configuration validation and environment handling\n\n- **Integration Tests**: End-to-end workflow testing\n  - `integration.test.ts`: Complete search-to-extraction workflows, error handling, cleanup\n\n### Test Features\n\n- **Fast**: Powered by vitest for quick feedback\n- **Type-safe**: Full TypeScript support in tests\n- **Isolated**: Each test cleans up its resources\n- **Comprehensive**: Covers search, extraction, configuration, and integration scenarios\n\n## Troubleshooting\n\n### Common Issues\n\n**\"Browser not found\" Error**\n```bash\npnpm install-browsers\n```\n\n**Rate Limiting Issues**\n- The tool automatically handles rate limiting with 1-second delays\n- If you encounter 429 errors, the tool will automatically retry with exponential backoff\n\n**Content Extraction Failures**\n- Some sites may block automated access\n- The tool includes fallback extraction methods (main → body content)\n- Use verbose mode (`-v`) to see detailed error information\n\n**Permission Denied (Unix/Linux)**\n```bash\nchmod +x bin/llmresearcher.js\n```\n\n### Performance Optimization\n\nThe tool is optimized for speed:\n- **Resource Blocking**: Automatically blocks images, CSS, fonts\n- **Network Idle**: Waits for JavaScript to complete rendering\n- **Content Caching**: Supports local caching to avoid repeated requests\n- **Minimal Dependencies**: Uses lightweight, focused libraries\n\n## Development\n\n### Project Structure\n\n```\nlight-research-mcp/\n├── dist/                      # Built JavaScript files (generated)\n│   ├── bin/\n│   │   └── llmresearcher.js   # CLI entry point (executable)\n│   └── *.js                   # Compiled TypeScript modules\n├── src/                       # TypeScript source files\n│   ├── bin.ts                 # CLI entry point\n│   ├── index.ts               # Main LLMResearcher class\n│   ├── mcp-server.ts          # MCP server implementation\n│   ├── search.ts              # DuckDuckGo search implementation\n│   ├── github-code-search.ts  # GitHub Code Search implementation\n│   ├── extractor.ts           # Content extraction with Playwright\n│   ├── cli.ts                 # Interactive CLI interface\n│   ├── config.ts              # Configuration management\n│   └── types.ts               # TypeScript type definitions\n├── test/                      # Test files (vitest)\n│   ├── search.test.ts         # Search functionality tests\n│   ├── extractor.test.ts      # Content extraction tests\n│   ├── config.test.ts         # Configuration tests\n│   ├── mcp-locale.test.ts     # MCP locale functionality tests\n│   ├── mcp-content-extractor.test.ts # MCP content extractor tests\n│   └── integration.test.ts    # End-to-end integration tests\n├── tsconfig.json              # TypeScript configuration\n├── tsup.config.ts             # Build configuration\n├── vitest.config.ts           # Test configuration\n├── package.json\n└── README.md\n```\n\n### Dependencies\n\n#### Runtime Dependencies\n- **@modelcontextprotocol/sdk**: Model Context Protocol server implementation\n- **@mozilla/readability**: Content extraction from HTML\n- **cheerio**: HTML parsing for search results\n- **commander**: CLI argument parsing\n- **dompurify**: HTML sanitization\n- **dotenv**: Environment variable loading\n- **jsdom**: DOM manipulation for server-side processing\n- **playwright**: Browser automation for JS rendering\n- **turndown**: HTML to Markdown conversion\n\n#### Development Dependencies\n- **typescript**: TypeScript compiler\n- **tsup**: Fast TypeScript bundler\n- **vitest**: Fast unit test framework\n- **@types/***: TypeScript type definitions\n\n## License\n\nMIT License - see LICENSE file for details.\n\n## Contributing\n\n1. Fork the repository\n2. Create a feature branch\n3. Make your changes\n4. Add tests if applicable\n5. Submit a pull request\n\n## Roadmap\n\n### Planned Features\n\n- **Enhanced MCP Tools**: Additional specialized search tools for documentation, APIs, etc.\n- **Caching Layer**: SQLite-based URL → Markdown caching with 24-hour TTL\n- **Search Engine Abstraction**: Support for Brave Search, Bing, and other engines\n- **Content Summarization**: Optional AI-powered content summarization\n- **Export Formats**: JSON, plain text, and other output formats\n- **Batch Processing**: Process multiple URLs from file input\n- **SSE Transport**: Support for Server-Sent Events MCP transport\n\n### Performance Improvements\n\n- **Parallel Processing**: Concurrent content extraction for multiple results\n- **Smart Caching**: Intelligent cache invalidation based on content freshness\n- **Memory Optimization**: Streaming content processing for large documents","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcode-hex%2Flight-research-mcp","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fcode-hex%2Flight-research-mcp","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcode-hex%2Flight-research-mcp/lists"}