https://github.com/threepointone/llm-scraper-worker
https://github.com/threepointone/llm-scraper-worker
Last synced: 6 months ago
JSON representation
- Host: GitHub
- URL: https://github.com/threepointone/llm-scraper-worker
- Owner: threepointone
- Created: 2025-01-07T14:11:01.000Z (12 months ago)
- Default Branch: main
- Last Pushed: 2025-01-09T22:43:39.000Z (12 months ago)
- Last Synced: 2025-06-29T21:37:42.243Z (6 months ago)
- Language: TypeScript
- Size: 283 KB
- Stars: 52
- Watchers: 1
- Forks: 5
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
llm-scraper-worker
---
A port of [llm-scraper](https://github.com/mishushakov/llm-scraper) to Cloudflare Workers, using the [browser rendering api](https://developers.cloudflare.com/browser-rendering) and [ai sdk](https://sdk.vercel.ai/).
### Usage
Setup your `wrangler.toml`
```toml
# ...
browser = { binding = "MYBROWSER" }
```
```ts
import { z } from "zod";
import LLMScraper from "llm-scraper-worker";
import puppeteer from "@cloudflare/puppeteer";
import { createOpenAI } from "@ai-sdk/openai";
// ...later, in your worker...
// Launch a browser instance
const browser = await puppeteer.launch(env.MYBROWSER);
// Initialize LLM provider
const openai = createOpenAI({
apiKey: env.OPENAI_API_KEY, // set this up in .dev.vars / secrets
});
const llm = openai.chat("gpt-4o");
// Create a new LLMScraper
const scraper = new LLMScraper(llm);
// Open new page
const page = await browser.newPage();
await page.goto("https://news.ycombinator.com");
// Define schema to extract contents into
const schema = z.object({
top: z
.array(
z.object({
title: z.string(),
points: z.number(),
by: z.string(),
commentsURL: z.string(),
})
)
.length(5)
.describe("Top 5 stories on Hacker News"),
});
// Run the scraper
const { data } = await scraper.run(page, schema, {
format: "html",
});
await page.close();
await browser.close();
// Show the result from LLM
console.log(data);
```
This will output:
```json
{
"top": [
{
"title": "A 2-ply minimax chess engine in 84,688 regular expressions",
"points": 245,
"by": "ilya_m",
"commentsURL": "https://news.ycombinator.com/item?id=42619652"
},
{
"title": "Stimulation Clicker",
"points": 2365,
"by": "meetpateltech",
"commentsURL": "https://news.ycombinator.com/item?id=42611536"
},
{
"title": "AI and Startup Moats",
"points": 37,
"by": "vismit2000",
"commentsURL": "https://news.ycombinator.com/item?id=42620994"
},
{
"title": "How I program with LLMs",
"points": 370,
"by": "stpn",
"commentsURL": "https://news.ycombinator.com/item?id=42617645"
},
{
"title": "First time a Blender-made production has won the Golden Globe",
"points": 155,
"by": "jgilias",
"commentsURL": "https://news.ycombinator.com/item?id=42620656"
}
]
}
```