{"id":18710653,"url":"https://github.com/apify/rag-web-browser","last_synced_at":"2025-11-03T16:30:37.884Z","repository":{"id":253395530,"uuid":"842902659","full_name":"apify/rag-web-browser","owner":"apify","description":"RAG Web Browser is a tool to provide your RAG pipelines with up-to-date information from the web.","archived":false,"fork":false,"pushed_at":"2024-10-31T20:08:29.000Z","size":789,"stargazers_count":2,"open_issues_count":1,"forks_count":0,"subscribers_count":2,"default_branch":"master","last_synced_at":"2024-10-31T21:19:05.508Z","etag":null,"topics":["crawling","llm","scraper","serp"],"latest_commit_sha":null,"homepage":"","language":"TypeScript","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/apify.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-08-15T10:57:04.000Z","updated_at":"2024-10-31T20:59:44.000Z","dependencies_parsed_at":"2024-10-31T21:18:52.151Z","dependency_job_id":null,"html_url":"https://github.com/apify/rag-web-browser","commit_stats":null,"previous_names":["apify/actor-serp-content-crawler","apify/rag-web-browser"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/apify%2Frag-web-browser","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/apify%2Frag-web-browser/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/apify%2Frag-web-browser/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/apify%2Frag-web-browser/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/apify","download_url":"https://codeload.github.com/apify/rag-web-browser/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":223515307,"owners_count":17158352,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["crawling","llm","scraper","serp"],"created_at":"2024-11-07T12:35:09.720Z","updated_at":"2025-11-03T16:30:37.837Z","avatar_url":"https://github.com/apify.png","language":"TypeScript","funding_links":[],"categories":[],"sub_categories":[],"readme":"# 🌐 RAG Web Browser\n\n[![RAG Web Browser](https://apify.com/actor-badge?actor=apify/rag-web-browser)](https://apify.com/apify/rag-web-browser)\n\nThis Actor provides web browsing functionality for AI agents and LLM applications,\nsimilar to the [web browsing](https://openai.com/index/introducing-chatgpt-search/) feature in ChatGPT.\nIt accepts a search phrase or a URL, queries Google Search, then crawls web pages from the top search results, cleans the HTML, converts it to text or Markdown,\nand returns it back for processing by the LLM application.\nThe extracted text can then be injected into prompts and retrieval augmented generation (RAG) pipelines, to provide your LLM application with up-to-date context from the web.\n\n## Main features\n\n- 🚀 **Quick response times** for great user experience\n- ⚙️ Supports **dynamic JavaScript-heavy websites** using a headless browser\n- 🔄 **Flexible scraping** with Browser mode for complex websites or Plain HTML mode for faster scraping\n- 🕷 Automatically **bypasses anti-scraping protections** using proxies and browser fingerprints\n- 📝 Output formats include **Markdown**, plain text, and HTML\n- 🔌 Supports **OpenAPI and MCP** for easy integration\n- 🪟 It's **open source**, so you can review and modify it\n\n## Example\n\nFor a search query like `fast web browser in RAG pipelines`, the Actor will return an array with a content of top results from Google Search, which looks like this:\n\n```json\n[\n    {\n        \"crawl\": {\n            \"httpStatusCode\": 200,\n            \"httpStatusMessage\": \"OK\",\n            \"loadedAt\": \"2024-11-25T21:23:58.336Z\",\n            \"uniqueKey\": \"eM0RDxDQ3q\",\n            \"requestStatus\": \"handled\"\n        },\n        \"searchResult\": {\n            \"title\": \"apify/rag-web-browser\",\n            \"description\": \"Sep 2, 2024 — The RAG Web Browser is designed for Large Language Model (LLM) applications or LLM agents to provide up-to-date ....\",\n            \"url\": \"https://github.com/apify/rag-web-browser\"\n        },\n        \"metadata\": {\n            \"title\": \"GitHub - apify/rag-web-browser: RAG Web Browser is an Apify Actor to feed your LLM applications ...\",\n            \"description\": \"RAG Web Browser is an Apify Actor to feed your LLM applications ...\",\n            \"languageCode\": \"en\",\n            \"url\": \"https://github.com/apify/rag-web-browser\"\n        },\n        \"markdown\": \"# apify/rag-web-browser: RAG Web Browser is an Apify Actor ...\"\n    }\n]\n```\n\nIf you enter a specific URL such as `https://openai.com/index/introducing-chatgpt-search/`, the Actor will extract\nthe web page content directly like this:\n\n```json\n[{\n    \"crawl\": {\n        \"httpStatusCode\": 200,\n        \"httpStatusMessage\": \"OK\",\n        \"loadedAt\": \"2024-11-21T14:04:28.090Z\"\n    },\n    \"metadata\": {\n        \"url\": \"https://openai.com/index/introducing-chatgpt-search/\",\n        \"title\": \"Introducing ChatGPT search | OpenAI\",\n        \"description\": \"Get fast, timely answers with links to relevant web sources\",\n        \"languageCode\": \"en-US\"\n    },\n    \"markdown\": \"# Introducing ChatGPT search | OpenAI\\n\\nGet fast, timely answers with links to relevant web sources.\\n\\nChatGPT can now search the web in a much better way than before. ...\"\n}]\n```\n\n## ⚙️ Usage\n\nThe RAG Web Browser can be used in two ways: **as a standard Actor** by passing it an input object with the settings,\nor in the **Standby mode** by sending it an HTTP request.\n\nSee the [Performance Optimization](#-performance-optimization) section below for detailed benchmarks and configuration recommendations to achieve optimal response times.\n\n### Normal Actor run\n\nYou can run the Actor \"normally\" via the Apify API, schedule, integrations, or manually in Console.\nOn start, you pass the Actor an input JSON object with settings including the search phrase or URL,\nand it stores the results to the default dataset.\nThis mode is useful for testing and evaluation, but might be too slow for production applications and RAG pipelines,\nbecause it takes some time to start the Actor's Docker container and a web browser.\nAlso, one Actor run can only handle one query, which isn't efficient.\n\n### Standby web server\n\nThe Actor also supports the [**Standby mode**](https://docs.apify.com/platform/actors/running/standby),\nwhere it runs an HTTP web server that receives requests with the search phrases and responds with the extracted web content.\nThis mode is preferred for production applications, because if the Actor is already running, it will\nreturn the results much faster. Additionally, in the Standby mode the Actor can handle multiple requests\nin parallel, and thus utilizes the computing resources more efficiently.\n\nTo use RAG Web Browser in the Standby mode, simply send an HTTP GET request to the following URL:\n\n```\nhttps://rag-web-browser.apify.actor/search?token=\u003cAPIFY_API_TOKEN\u003e\u0026query=hello+world\n```\n\nwhere `\u003cAPIFY_API_TOKEN\u003e` is your [Apify API token](https://console.apify.com/settings/integrations).\nNote that you can also pass the API token using the `Authorization` HTTP header with Basic authentication for increased security.\n\nThe response is a JSON array with objects containing the web content from the found web pages, as shown in the example [above](#example).\n\n#### Query parameters\n\nThe `/search` GET HTTP endpoint accepts all the input parameters [described on the Actor page](https://apify.com/apify/rag-web-browser/input-schema). Object parameters like `proxyConfiguration` should be passed as url-encoded JSON strings.\n\n\n## 🔌 Integration with LLMs\n\nRAG Web Browser has been designed for easy integration with LLM applications, GPTs, OpenAI Assistants, and RAG pipelines using function calling.\n\n### OpenAPI schema\n\nHere you can find the [OpenAPI 3.1.0 schema](https://apify.com/apify/rag-web-browser/api/openapi)\nor [OpenAPI 3.0.0 schema](https://raw.githubusercontent.com/apify/rag-web-browser/refs/heads/master/docs/standby-openapi-3.0.0.json)\nfor the Standby web server. Note that the OpenAPI definition contains\nall available query parameters, but only `query` is required.\nYou can remove all the others parameters from the definition if their default value is right for your application,\nin order to reduce the number of LLM tokens necessary and to reduce the risk of hallucinations in function calling.\n\n### OpenAI Assistants\n\nWhile OpenAI's ChatGPT and GPTs support web browsing natively, [Assistants](https://platform.openai.com/docs/assistants/overview) currently don't.\nWith RAG Web Browser, you can easily add the web search and browsing capability to your custom AI assistant and chatbots.\nFor detailed instructions,\nsee the [OpenAI Assistants integration](https://docs.apify.com/platform/integrations/openai-assistants#real-time-search-data-for-openai-assistant) in Apify documentation.\n\n### OpenAI GPTs\n\nYou can easily add the RAG Web Browser to your GPTs by creating a custom action. Here's a quick guide:\n\n1. Go to [**My GPTs**](https://chatgpt.com/gpts/mine) on ChatGPT website and click **+ Create a GPT**.\n2. Complete all required details in the form.\n3. Under the **Actions** section, click **Create new action**.\n4. In the Action settings, set **Authentication** to **API key** and choose Bearer as **Auth Type**.\n5. In the **schema** field, paste the [OpenAPI 3.1.0 schema](https://raw.githubusercontent.com/apify/rag-web-browser/refs/heads/master/docs/standby-openapi-3.1.0.json)\n   of the Standby web server HTTP API.\n\n![Apify-RAG-Web-Browser-custom-action](https://raw.githubusercontent.com/apify/rag-web-browser/refs/heads/master/docs/apify-gpt-custom-action.png)\n\nLearn more about [adding custom actions to your GPTs with Apify Actors](https://blog.apify.com/add-custom-actions-to-your-gpts/) on Apify Blog.\n\n### Anthropic: Model Context Protocol (MCP) Server\n\nThe RAG Web Browser Actor can also be used as an [MCP server](https://github.com/modelcontextprotocol) and integrated with AI applications and agents, such as Claude Desktop.\nFor example, in Claude Desktop, you can configure the MCP server in its settings to perform web searches and extract content.\nAlternatively, you can develop a custom MCP client to interact with the RAG Web Browser Actor.\n\nIn the Standby mode, the Actor runs an HTTP server that supports the MCP protocol via SSE (Server-Sent Events).\n\n1. Initiate SSE connection:\n    ```shell\n    curl https://rag-web-browser.apify.actor/sse?token=\u003cAPIFY_API_TOKEN\u003e\n    ```\n   On connection, you'll receive a `sessionId`:\n    ```text\n    event: endpoint\n    data: /message?sessionId=5b2\n    ```\n\n1. Send a message to the server by making a POST request with the `sessionId`, `APIFY-API-TOKEN` and your query:\n    ```shell\n    curl -X POST \"https://rag-web-browser.apify.actor/message?session_id=5b2\u0026token=\u003cAPIFY-API-TOKEN\u003e\" -H \"Content-Type: application/json\" -d '{\n      \"jsonrpc\": \"2.0\",\n      \"id\": 1,\n      \"method\": \"tools/call\",\n      \"params\": {\n        \"arguments\": { \"query\": \"recent news about LLMs\", \"maxResults\": 1 },\n        \"name\": \"rag-web-browser\"\n      }\n    }'\n    ```\n   For the POST request, the server will respond with:\n    ```text\n    Accepted\n    ```\n\n1. Receive a response at the initiated SSE connection:\n   The server invoked `Actor` and its tool using the provided query and sent the response back to the client via SSE.\n\n    ```text\n    event: message\n    data: {\"result\":{\"content\":[{\"type\":\"text\",\"text\":\"[{\\\"searchResult\\\":{\\\"title\\\":\\\"Language models recent news\\\",\\\"description\\\":\\\"Amazon Launches New Generation of LLM Foundation Model...\\\"}}\n    ```\n\nYou can try the MCP server using the [MCP Tester Client](https://apify.com/jiri.spilka/tester-mcp-client) available on Apify. In the MCP client, simply enter the URL `https://rag-web-browser.apify.actor/sse` in the Actor input field and click **Run** and interact with server in a UI.\nTo learn more about MCP servers, check out the blog post [What is Anthropic's Model Context Protocol](https://blog.apify.com/what-is-model-context-protocol/).\n\n## ⏳ Performance optimization\n\nTo get the most value from RAG Web Browsers in your LLM applications,\nalways use the Actor via the [Standby web server](#standby-web-server) as described above,\nand see the tips in the following sections.\n\n### Scraping tool\n\nThe **most critical performance decision** is selecting the appropriate scraping method for your use case:\n\n- **For static websites**: Use `scrapingTool=raw-http` to achieve up to 2x faster performance. This lightweight method directly fetches HTML without JavaScript processing.\n\n- **For dynamic websites**: Use the default `scrapingTool=browser-playwright` when targeting sites with JavaScript-rendered content or interactive elements\n\nThis single parameter choice can significantly impact both response times and content quality, so select based on your target websites' characteristics.\n\n### Request timeout\n\nMany user-facing RAG applications impose a time limit on external functions to provide a good user experience.\nFor example, OpenAI Assistants and GPTs have a limit of [45 seconds](https://platform.openai.com/docs/actions/production#timeouts) for custom actions.\n\nTo ensure the web search and content extraction is completed within the required timeout,\nyou can set the `requestTimeoutSecs` query parameter.\nIf this timeout is exceeded, **the Actor makes the best effort to return results it has scraped up to that point**\nin order to provide your LLM application with at least some context.\n\nHere are specific situations that might occur when the timeout is reached:\n\n- The Google Search query failed =\u003e the HTTP request fails with a 5xx error.\n- The requested `query` is a single URL that failed to load =\u003e the HTTP request fails with a 5xx error.\n- The requested `query` is a search term, but one of target web pages failed to load =\u003e the response contains at least\n  the `searchResult` for the specific page contains a URL, title, and description.\n- One of the target pages hasn't loaded dynamic content (within the `dynamicContentWaitSecs` deadline)\n  =\u003e the Actor extracts content from the currently loaded HTML\n\n\n### Reducing response time\n\nFor low-latency applications, it's recommended to run the RAG Web Browser in Standby mode\nwith the default settings, i.e. with 8 GB of memory and maximum of 24 requests per run.\nNote that on the first request, the Actor takes a little time to respond (cold start).\n\nAdditionally, you can adjust the following query parameters to reduce the response time:\n\n- `scrapingTool`: Use `raw-http` for static websites or `browser-playwright` for dynamic websites.\n- `maxResults`: The lower the number of search results to scrape, the faster the response time. Just note that the LLM application might not have sufficient context for the prompt.\n- `dynamicContentWaitSecs`: The lower the value, the faster the response time. However, the important web content might not be loaded yet, which will reduce the accuracy of your LLM application.\n- `removeCookieWarnings`: If the websites you're scraping don't have cookie warnings or if their presence can be tolerated, set this to `false` to slightly improve latency.\n- `debugMode`: If set to `true`, the Actor will store latency data to results so that you can see where it takes time.\n\n\n### Cost vs. throughput\n\nWhen running the RAG Web Browser in Standby web server, the Actor can process a number of requests in parallel.\nThis number is determined by the following [Standby mode](https://docs.apify.com/platform/actors/running/standby) settings:\n\n- **Max requests per run** and **Desired requests per run** - Determine how many requests can be sent by the system to one Actor run.\n- **Memory** - Determines how much memory and CPU resources the Actor run has available, and this how many web pages it can open and process in parallel.\n\nAdditionally, the Actor manages its internal pool of web browsers to handle the requests.\nIf the Actor memory or CPU is at capacity, the pool automatically scales down, and requests\nabove the capacity are delayed.\n\nBy default, these Standby mode settings are optimized for quick response time:\n8 GB of memory and maximum of 24 requests per run gives approximately ~340 MB per web page.\nIf you prefer to optimize the Actor for the cost, you can **Create task** for the Actor in Apify Console\nand override these settings. Just note that requests might take longer and so you should\nincrease `requestTimeoutSecs` accordingly.\n\n\n### Benchmark\n\nBelow is a typical latency breakdown for RAG Web Browser with **maxResults** set to either `1` or `3`, and various memory settings.\nThese settings allow for processing all search results in parallel.\nThe numbers below are based on the following search terms: \"apify\", \"Donald Trump\", \"boston\".\nResults were averaged for the three queries.\n\n| Memory (GB) | Max results | Latency (sec) |\n|-------------|-------------|---------------|\n| 4           | 1           | 22            |\n| 4           | 3           | 31            |\n| 8           | 1           | 16            |\n| 8           | 3           | 17            |\n\nPlease note the these results are only indicative and may vary based on the search term, target websites, and network latency.\n\n## 💰 Pricing\n\nThe RAG Web Browser is free of charge, and you only pay for the Apify platform consumption when it runs.\nThe main driver of the price is the Actor compute units (CUs), which are proportional to the amount of Actor run memory\nand run time (1 CU = 1 GB memory x 1 hour).\n\n## ⓘ Limitations and feedback\n\nThe Actor uses [Google Search](https://www.google.com/) in the United States with English language,\nand so queries like \"_best nearby restaurants_\" will return search results from the US.\n\nIf you need other regions or languages, or have some other feedback,\nplease [submit an issue](https://console.apify.com/actors/3ox4R101TgZz67sLr/issues) in Apify Console to let us know.\n\n\n## 👷🏼 Development\n\nThe RAG Web Browser Actor has open source available on [GitHub](https://github.com/apify/rag-web-browser),\nso that you can modify and develop it yourself. Here are the steps how to run it locally on your computer.\n\nDownload the source code:\n\n```bash\ngit clone https://github.com/apify/rag-web-browser\ncd rag-web-browser\n```\n\nInstall [Playwright](https://playwright.dev) with dependencies:\n\n```bash\nnpx playwright install --with-deps\n```\n\nAnd then you can run it locally using [Apify CLI](https://docs.apify.com/cli) as follows:\n\n```bash\nAPIFY_META_ORIGIN=STANDBY apify run -p\n```\n\nServer will start on `http://localhost:3000` and you can send requests to it, for example:\n\n```bash\ncurl \"http://localhost:3000/search?query=example.com\"\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fapify%2Frag-web-browser","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fapify%2Frag-web-browser","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fapify%2Frag-web-browser/lists"}