Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/b4rtaz/html2llm

An experimental project to convert HTML websites into a format compatible with large language models (LLMs), enabling seamless website navigation and content reading.
https://github.com/b4rtaz/html2llm

automation browser-automation llm vision yolov8

Last synced: 7 days ago
JSON representation

An experimental project to convert HTML websites into a format compatible with large language models (LLMs), enabling seamless website navigation and content reading.

Host: GitHub
URL: https://github.com/b4rtaz/html2llm
Owner: b4rtaz
License: mit
Created: 2024-11-30T15:13:28.000Z (2 months ago)
Default Branch: main
Last Pushed: 2024-12-01T12:52:34.000Z (about 2 months ago)
Last Synced: 2025-01-17T14:59:14.365Z (13 days ago)
Topics: automation, browser-automation, llm, vision, yolov8
Language: TypeScript
Homepage: https://b4rtaz.github.io/html2llm/app-website.html
Size: 10.1 MB
Stars: 18
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

![html2llm](.github/cover.png)

# html2llm

This project is an experiment aimed at converting an HTML website into a format understandable by large language models (LLMs). The output can be used for various purposes, such as website navigation or content reading. The project incorporates elements of Microsoft's [OmniParser](https://github.com/microsoft/OmniParser) release and operates in the browser using WebAssembly. Surprisingly, it performs quite efficiently, with inference taking less than 300ms on my Mac M1.

Demos:

* [⭕ OmniParser WebAssembly](https://b4rtaz.github.io/html2llm/omni-parser-webassembly.html) - a demo of YOLOv8 icon detection using WebAssembly
* [📺 App Website](https://b4rtaz.github.io/html2llm/app-website.html) - a demo of detecting UI elements by combining YOLOv8 with DOM tree traversal

## 🚧 Idea

The OmniParser released by Microsoft operates in three steps:

`OCR -> Icon Detection -> Icon/Box Captioning`

This approach enables control over almost any interface. However, it comes with a significant computational cost, particularly in the final step, which is the most resource-intensive part of the pipeline. The icon detection step requires 6.1MB of weights, while the icon captioning step demands 1GB of weights.

Interestingly, in a browser environment, the first and last step can be skipped because we can traverse the DOM tree to extract this information directly. Surprisingly, the second step, which uses YOLOv8, performs efficiently in the browser thanks to [WebAssembly](https://github.com/Hyuto/yolov8-onnxruntime-web).

From the universal approach, we derived the following process:

`Screenshot Capturing -> Icon Detection (OmniParser WebAssembly) -> Icon/Box Captioning via Traversing DOM Tree`

Now we have two problems:

* how to capture a screenshot of the website ([captureVisibleTab](https://developer.mozilla.org/en-US/docs/Mozilla/Add-ons/WebExtensions/API/tabs/captureVisibleTab) via a browser extension, [getScreenshotAs](https://www.selenium.dev/selenium/docs/api/java/org/openqa/selenium/TakesScreenshot.html) via Selenium, etc.)
* how to resolve found bounding boxes to useful information (this is definitely not trivial, this part is resolved in this project by the [element extractor](html2llm/src/element-extractor/element-extractor.ts)).

This project is on a very early stage.

## 🚀 How to Run on Any Page?

You can do it by using the [Playwright App Demo](./demos/playwright-app/).

1. Clone the repository.
2. Install all dependencies `pnpm install`.
3. Run `cd demos/playwright-app`.
4. Run `pnpm start `. For example, `pnpm start https://www.google.com`.

## 💡 License

This project is released under the MIT license.

The used part of the OmniParser is released under the [Creative Commons Attribution 4.0 International license](https://github.com/microsoft/OmniParser/blob/master/LICENSE).