https://github.com/replicate/cog-vila
Cog wrapper for VILA
https://github.com/replicate/cog-vila
Last synced: 9 months ago
JSON representation
Cog wrapper for VILA
- Host: GitHub
- URL: https://github.com/replicate/cog-vila
- Owner: replicate
- License: apache-2.0
- Created: 2024-03-13T13:55:09.000Z (about 2 years ago)
- Default Branch: main
- Last Pushed: 2024-03-13T14:05:04.000Z (about 2 years ago)
- Last Synced: 2025-06-06T05:06:00.363Z (9 months ago)
- Language: Python
- Size: 3.2 MB
- Stars: 0
- Watchers: 2
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE.txt
Awesome Lists containing this project
README
## VILA
Cog wrapper for VILA, a visual language model (VLM) pretrained with interleaved image-text data. See the [paper](https://arxiv.org/abs/2312.07533), [official repo](https://github.com/Efficient-Large-Model/VILA) and Replicate [demos](https://replicate.com/adirik/vila-13b) for details.
## How to use the API
You need to have Cog and Docker installed to run this model locally. To build the docker container with cog and run a prediction:
```
cog predict -i image=@sample_images/1.jpg -i prompt="Can you describe this image?"
```
To start a server and send requests to your locally or remotely deployed API:
```
cog run -p 5000 python -m cog.server.http
```
To use VILA, provide an image and a text prompt. The response is generated by decoding the model's output using beam search with the specified parameters. The input arguments to the API are as follows:
- **image:** The image to discuss.
- **prompt:** The query to generate a response for.
- **top_p:** When decoding text, samples from the top p percentage of most likely tokens; lower to ignore less likely tokens.
- **temperature:** When decoding text, higher values make the model more creative.
- **num_beams:** Number of beams to use when decoding text; higher values are slower but more accurate.
- **max_tokens:** Maximum number of tokens to generate.
## References
```
@misc{lin2023vila,
title={VILA: On Pre-training for Visual Language Models},
author={Ji Lin and Hongxu Yin and Wei Ping and Yao Lu and Pavlo Molchanov and Andrew Tao and Huizi Mao and Jan Kautz and Mohammad Shoeybi and Song Han},
year={2023},
eprint={2312.07533},
archivePrefix={arXiv},
primaryClass={cs.CV}
}
```