{"id":13454108,"url":"https://github.com/mishushakov/llm-scraper","last_synced_at":"2025-05-14T03:06:42.328Z","repository":{"id":234769546,"uuid":"789480255","full_name":"mishushakov/llm-scraper","owner":"mishushakov","description":"Turn any webpage into structured data using LLMs","archived":false,"fork":false,"pushed_at":"2024-08-30T17:36:16.000Z","size":209,"stargazers_count":4828,"open_issues_count":14,"forks_count":281,"subscribers_count":27,"default_branch":"main","last_synced_at":"2025-05-07T00:14:08.420Z","etag":null,"topics":["ai","artificial-intelligence","browser","browser-automation","gpt","gpt-4","langchain","llama","llm","openai","playwright","puppeteer","scraper"],"latest_commit_sha":null,"homepage":"","language":"TypeScript","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/mishushakov.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE.md","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-04-20T17:06:57.000Z","updated_at":"2025-05-06T20:23:22.000Z","dependencies_parsed_at":"2024-04-20T18:43:18.094Z","dependency_job_id":"3f22c093-600a-4408-a581-fdb57eb52af1","html_url":"https://github.com/mishushakov/llm-scraper","commit_stats":{"total_commits":77,"total_committers":3,"mean_commits":"25.666666666666668","dds":"0.038961038961038974","last_synced_commit":"4e601130c60d030b3e29f25c4cac0207afe2bdd5"},"previous_names":["mishushakov/llm-scraper"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mishushakov%2Fllm-scraper","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mishushakov%2Fllm-scraper/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mishushakov%2Fllm-scraper/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mishushakov%2Fllm-scraper/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/mishushakov","download_url":"https://codeload.github.com/mishushakov/llm-scraper/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":253501851,"owners_count":21918326,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["ai","artificial-intelligence","browser","browser-automation","gpt","gpt-4","langchain","llama","llm","openai","playwright","puppeteer","scraper"],"created_at":"2024-07-31T08:00:50.946Z","updated_at":"2025-05-14T03:06:42.282Z","avatar_url":"https://github.com/mishushakov.png","language":"TypeScript","funding_links":[],"categories":["🤖 AI-Powered Scraping","TypeScript","Research \u0026 Data Analysis","网络服务","ai","Browser-extensions","browser","Repos","openai","AI Web Scrapers/Crawlers"],"sub_categories":["网络爬虫","Dev Tools"],"readme":"# LLM Scraper\n\n\u003cimg width=\"1800\" alt=\"Screenshot 2024-04-20 at 23 11 16\" src=\"https://github.com/mishushakov/llm-scraper/assets/10400064/ab00e048-a9ff-43b6-81d5-2e58090e2e65\"\u003e\n\nLLM Scraper is a TypeScript library that allows you to extract structured data from **any** webpage using LLMs.\n\n\u003e [!IMPORTANT]\n\u003e [Code-generation](#code-generation) is now supported in LLM Scraper.\n\n\u003e [!TIP]\n\u003e Under the hood, it uses function calling to convert pages to structured data. You can find more about this approach [here](https://til.simonwillison.net/gpt3/openai-python-functions-data-extraction).\n\n### Features\n\n- Supports **Local (Ollama, GGUF)**, OpenAI, Vercel AI SDK Providers\n- Schemas defined with Zod\n- Full type-safety with TypeScript\n- Based on Playwright framework\n- Streaming objects\n- **NEW** [Code-generation](#code-generation)\n- Supports 4 formatting modes:\n  - `html` for loading raw HTML\n  - `markdown` for loading markdown\n  - `text` for loading extracted text (using [Readability.js](https://github.com/mozilla/readability))\n  - `image` for loading a screenshot (multi-modal only)\n\n**Make sure to give it a star!**\n\n\u003cimg width=\"165\" alt=\"Screenshot 2024-04-20 at 22 13 32\" src=\"https://github.com/mishushakov/llm-scraper/assets/10400064/11e2a79f-a835-48c4-9f85-5c104ca7bb49\"\u003e\n\n## Getting started\n\n1. Install the required dependencies from npm:\n\n   ```\n   npm i zod playwright llm-scraper\n   ```\n\n2. Initialize your LLM:\n\n   **OpenAI**\n\n   ```\n   npm i @ai-sdk/openai\n   ```\n\n   ```js\n   import { openai } from '@ai-sdk/openai'\n\n   const llm = openai.chat('gpt-4o')\n   ```\n\n   **Groq**\n\n   ```\n   npm i @ai-sdk/openai\n   ```\n\n   ```js\n   import { createOpenAI } from '@ai-sdk/openai'\n   const groq = createOpenAI({\n     baseURL: 'https://api.groq.com/openai/v1',\n     apiKey: process.env.GROQ_API_KEY,\n   })\n\n   const llm = groq('llama3-8b-8192')\n   ```\n\n   **Ollama**\n\n   ```\n   npm i ollama-ai-provider\n   ```\n\n   ```js\n   import { ollama } from 'ollama-ai-provider'\n\n   const llm = ollama('llama3')\n   ```\n\n   **GGUF**\n\n   ```js\n   import { LlamaModel } from 'node-llama-cpp'\n\n   const llm = new LlamaModel({ modelPath: 'model.gguf' })\n   ```\n\n3. Create a new scraper instance provided with the llm:\n\n   ```js\n   import LLMScraper from 'llm-scraper'\n\n   const scraper = new LLMScraper(llm)\n   ```\n\n## Example\n\nIn this example, we're extracting top stories from HackerNews:\n\n```ts\nimport { chromium } from 'playwright'\nimport { z } from 'zod'\nimport { openai } from '@ai-sdk/openai'\nimport LLMScraper from 'llm-scraper'\n\n// Launch a browser instance\nconst browser = await chromium.launch()\n\n// Initialize LLM provider\nconst llm = openai.chat('gpt-4o')\n\n// Create a new LLMScraper\nconst scraper = new LLMScraper(llm)\n\n// Open new page\nconst page = await browser.newPage()\nawait page.goto('https://news.ycombinator.com')\n\n// Define schema to extract contents into\nconst schema = z.object({\n  top: z\n    .array(\n      z.object({\n        title: z.string(),\n        points: z.number(),\n        by: z.string(),\n        commentsURL: z.string(),\n      })\n    )\n    .length(5)\n    .describe('Top 5 stories on Hacker News'),\n})\n\n// Run the scraper\nconst { data } = await scraper.run(page, schema, {\n  format: 'html',\n})\n\n// Show the result from LLM\nconsole.log(data.top)\n\nawait page.close()\nawait browser.close()\n```\n\n## Streaming\n\nReplace your `run` function with `stream` to get a partial object stream (Vercel AI SDK only).\n\n```ts\n// Run the scraper in streaming mode\nconst { stream } = await scraper.stream(page, schema)\n\n// Stream the result from LLM\nfor await (const data of stream) {\n  console.log(data.top)\n}\n```\n\n## Code-generation\n\nUsing the `generate` function you can generate re-usable playwright script that scrapes the contents according to a schema.\n\n```ts\n// Generate code and run it on the page\nconst { code } = await scraper.generate(page, schema)\nconst result = await page.evaluate(code)\nconst data = schema.parse(result)\n\n// Show the parsed result\nconsole.log(data.news)\n```\n\n## Contributing\n\nAs an open-source project, we welcome contributions from the community. If you are experiencing any bugs or want to add some improvements, please feel free to open an issue or pull request.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmishushakov%2Fllm-scraper","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmishushakov%2Fllm-scraper","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmishushakov%2Fllm-scraper/lists"}