Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/kyegomez/screenai

Implementation of the ScreenAI model from the paper: "A Vision-Language Model for UI and Infographics Understanding"
https://github.com/kyegomez/screenai

ai artificial-intelligence attention attention-is-all-you-need gpt-4 machine-learning ml pytorch tensorflow

Last synced: 4 days ago
JSON representation

Implementation of the ScreenAI model from the paper: "A Vision-Language Model for UI and Infographics Understanding"

Host: GitHub
URL: https://github.com/kyegomez/screenai
Owner: kyegomez
License: mit
Created: 2024-02-08T06:48:49.000Z (about 1 year ago)
Default Branch: main
Last Pushed: 2025-01-27T10:03:27.000Z (24 days ago)
Last Synced: 2025-02-09T03:01:59.239Z (11 days ago)
Topics: ai, artificial-intelligence, attention, attention-is-all-you-need, gpt-4, machine-learning, ml, pytorch, tensorflow
Language: Python
Homepage: https://discord.gg/GYbXvDGevY
Size: 2.18 MB
Stars: 317
Watchers: 8
Forks: 31
Open Issues: 4
Metadata Files:
- Readme: README.md
- Funding: .github/FUNDING.yml
- License: LICENSE

Awesome Lists containing this project

README

        [![Multi-Modality](agorabanner.png)](https://discord.gg/qUtxnK2NMf)

# Screen AI

Implementation of the ScreenAI model from the paper: "A Vision-Language Model for UI and Infographics Understanding". The flow is:

img + text -> patch sizes -> vit -> embed + concat -> attn + ffn -> cross attn + ffn + self attn -> to out. [PAPER LINK: ](https://arxiv.org/abs/2402.04615)

## Install

`pip3 install screenai`

## Usage

```python

import torch

from screenai.main import ScreenAI

# Create a tensor for the image

image = torch.rand(1, 3, 224, 224)

# Create a tensor for the text

text = torch.randn(1, 1, 512)

# Create an instance of the ScreenAI model with specified parameters

model = ScreenAI(

    patch_size=16,

    image_size=224,

    dim=512,

    depth=6,

    heads=8,

    vit_depth=4,

    multi_modal_encoder_depth=4,

    llm_decoder_depth=4,

    mm_encoder_ff_mult=4,

)

# Perform forward pass of the model with the given text and image tensors

out = model(text, image)

# Print the shape of the output tensor

print(out)

```

# License

MIT

## Citation

```bibtex

@misc{baechler2024screenai,

    title={ScreenAI: A Vision-Language Model for UI and Infographics Understanding}, 

    author={Gilles Baechler and Srinivas Sunkara and Maria Wang and Fedir Zubach and Hassan Mansoor and Vincent Etter and Victor Cărbune and Jason Lin and Jindong Chen and Abhanshu Sharma},

    year={2024},

    eprint={2402.04615},

    archivePrefix={arXiv},

    primaryClass={cs.CV}

}

```

# Todo

- [ ] Implement the nn.ModuleList([]) in the encoder and decoder