Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/ainoya/cloudflare-dom-distiller
A Cloudflare Workers-based API for extracting and converting web page content to Markdown using DOM-Distiller and Turndown.
https://github.com/ainoya/cloudflare-dom-distiller
cloudflare cloudflare-workers puppeteer
Last synced: 3 months ago
JSON representation
A Cloudflare Workers-based API for extracting and converting web page content to Markdown using DOM-Distiller and Turndown.
- Host: GitHub
- URL: https://github.com/ainoya/cloudflare-dom-distiller
- Owner: ainoya
- Created: 2024-06-11T09:39:04.000Z (7 months ago)
- Default Branch: main
- Last Pushed: 2024-09-29T00:22:55.000Z (3 months ago)
- Last Synced: 2024-09-30T15:18:01.286Z (3 months ago)
- Topics: cloudflare, cloudflare-workers, puppeteer
- Language: TypeScript
- Homepage:
- Size: 302 KB
- Stars: 0
- Watchers: 1
- Forks: 1
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Cloudflare DOM Distiller
This repository provides an API implementation for easily retrieving content from target web pages on Cloudflare Workers.
## Features
- **Cloudflare Workers & Browser Rendering**: Utilizes Cloudflare Workers and browser rendering to fetch page information.
- **Readability**: Uses Readability to extract page content and remove unnecessary information.
- **DOM-Distiller**: If you set option `useReadability: false` in a request, uses dom-distiller to extract page content and remove unnecessary information.
- **Turndown**: Converts the extracted HTML to Markdown format for better readability.## Example Usage
To run the API in development mode:
```bash
npx wrangler dev --remote
```You can make a request to your local server and verify that the content of the target web page is converted to Markdown format:
```bash
$ curl -H 'Content-Type: application/json' \
-X POST http://localhost:8787/distill \
-d '{"url": "https://blog.samaltman.com/gpt-4o", "markdown": true}'{"body":"There ... to the team that poured so much work into making this happen!"}
```## Endpoint: `/distill`
### Request Format
- **url**: The URL of the target web page to fetch content from.
- **markdown**: Boolean value to indicate whether the content should be converted to Markdown format.### Response Format
- **body**: Returns the content of the web page.
## References
- [mixmark\-io/turndown: ๐ An HTML to Markdown converter written in JavaScript](https://github.com/mixmark-io/turndown)
- [mozilla/readability: A standalone version of the readability lib](https://github.com/mozilla/readability)
- [chromium/dom\-distiller: Distills the DOM](https://github.com/chromium/dom-distiller)
- [Puppeteer ยท Browser Rendering docs](https://developers.cloudflare.com/browser-rendering/platform/puppeteer/)