Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/rasdani/--headful
Make a web browser multimodal, give it eyes and ears.
https://github.com/rasdani/--headful
Last synced: about 1 month ago
JSON representation
Make a web browser multimodal, give it eyes and ears.
- Host: GitHub
- URL: https://github.com/rasdani/--headful
- Owner: rasdani
- Created: 2023-10-27T20:08:17.000Z (about 1 year ago)
- Default Branch: main
- Last Pushed: 2023-11-24T15:11:47.000Z (about 1 year ago)
- Last Synced: 2024-04-24T02:47:35.544Z (9 months ago)
- Language: Python
- Homepage:
- Size: 104 KB
- Stars: 4
- Watchers: 2
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# --headful đ¤
Make a web browser multimodal, give it eyes and ears.Make it `--headful`, the opposite of `--headless`.
Use a vision/UI model (currently GPT4-V) to recognize website elements đī¸.
Use GPT function calling to understand commands đ§.
Use Playwright to direct your browser đšī¸.Use Whisper to record a captioned UI element dataset for finetuning. đī¸
## Setup
- Install the requirements.
- Run `playwright install chromium` to install the browser.
- Fire up `server.py`.
- Run `python main.py` in a different terminal.## Breakdown
- GPT function calling and `instructor` translate user request into URL to visit.
- Hitting 'f' fires up a Vimium fork and highlights clickable elements on the page.
- The Vimium fork renders and sends bounding box coordinates of clickable elements to a Flask server.
- Even clicking via Vimimum key bindings fails seldomly. Bounding boxes can be used to simulate a click into the middle of UI elements.
- A screenshot is taken and sent to GPT4-Vision together with the user request.
- GPT4-Vision selects webpage element and user can confirm click.In doing:
- Record and transcribe audio during normal browsing to create a captioned UI dataset for finetuning.## Take aways & ressources
Check out the 'experiments' branch or these links for a bunch of things I tried out:
- [UIED](https://github.com/MulongXie/UIED)
- Accurate bounding boxes, but not for all UI elements.
- Adept's [Fuyu-8B](https://huggingface.co/adept/fuyu-8b) to detect UI element bounding box for user request.
- See [HF space](https://huggingface.co/spaces/adept/fuyu-8b-demo/blob/beaba43434072de08478e7a1f90e621ece81aa93/app.py#L67) for how to properly prompt for bounding boxes. Works primarly for written text, so more akin to OCR.
- Tried hard coding `` tags into their transformers class to force to return bounding boxes for non text UI elements. Got bounding boxes for any request, but they were not accurate.
- [RICO dataset](https://github.com/google-research-datasets/uibert/tree/main/ref_exp)
- this UI detection task is called RefExp
- [pix2struct](https://github.com/google-research/pix2struct/tree/main)-refexp [base](https://huggingface.co/gitlost-murali/pix2struct-refexp-base) [large](https://huggingface.co/gitlost-murali/pix2struct-refexp-large)
- more lightweight than Vision LLMs/large multimodal models
- finetuning/inference should be easier/faster
- GPT4-Vision dropped
- hit or miss with Vimium labels
- tried passing before and after screenshot in case labels occlude UI elements
- tried bounding boxes as visual aid
- mobile user agent for slimmed down websites
- tried single highlighted UI element per image and batched request for simple yes/no classification, turns out ChatCompletion doesn't support batching ([without workarounds](https://github.com/openai/openai-cookbook/blob/main/examples/api_request_parallel_processor.py))!
- tried cutouts of single UI elements
- some of these are quite slow and all are still hit or miss- One could try one of the more recent large OSS multimodal models
- e.g. [BakLLaVA](https://huggingface.co/SkunkworksAI/BakLLaVA-1), [CogVLM](https://github.com/THUDM/CogVLM)
- finetune them with your own captioned data## Similiar projects
- [globot](https://github.com/Globe-Engineer/globot)
- [vimGPT](https://github.com/ishan0102/vimGPT)