https://github.com/ssube/label-prompt-caption
https://github.com/ssube/label-prompt-caption
annotations captioning captioning-images dataset llama3 llm vlm
Last synced: about 2 months ago
JSON representation
- Host: GitHub
- URL: https://github.com/ssube/label-prompt-caption
- Owner: ssube
- Created: 2024-09-09T00:26:42.000Z (almost 2 years ago)
- Default Branch: main
- Last Pushed: 2024-09-09T17:18:46.000Z (almost 2 years ago)
- Last Synced: 2025-04-09T02:19:47.696Z (about 1 year ago)
- Topics: annotations, captioning, captioning-images, dataset, llama3, llm, vlm
- Language: Python
- Homepage:
- Size: 127 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Label-Prompt-Caption Studio
- [Label-Prompt-Caption Studio](#label-prompt-caption-studio)
- [What](#what)
- [Method](#method)
- [Models](#models)
- [Why](#why)
- [How](#how)
- [Setup](#setup)
- [Usage](#usage)
- [Configuration](#configuration)
- [Metadata](#metadata)
- [TODOs](#todos)
## What
This is a Gradio UI for captioning small and medium datasets containing hundreds or thousands of images using a variety
of natural language and keyword/tag captioning models.
I have prepared [a dataset of animals in hats](https://huggingface.co/datasets/ssube/animals-in-hats) that can be used
to demonstrate the labels and the UI.
### Method
The name describes the captioning method:
1. Some **Labels** for critical details are applied to each image by humans or AI, such as `animal: duck` and `hat: red rain hat`
2. A **Prompt** is created for each image using a prompt template and the image labels, such as `Describe this image of a duck wearing a red rain hat.`
3. A **Caption** is generated by passing the prompt to Florence, Joy, or another captioning model, such as `a cartoon picture of duck wearing red rain hat. The image is a digital drawing in a cartoon style, featuring a cheerful, anthropomorphic duck character.`
You can add a prefix or suffix to the caption using [a Jinja
template](https://jinja.palletsprojects.com/en/3.0.x/templates/). Caption and prompt templates use the same syntax and
have access to the same image labels, except for the `{{ caption }}` variable, which is the caption returned by the
model.
### Models
This uses a few different ML models:
- https://huggingface.co/microsoft/Florence-2-large-ft for captioning
- https://huggingface.co/spaces/fancyfeast/joy-caption-pre-alpha and https://huggingface.co/meta-llama/Meta-Llama-3.1-8B for captioning
- compatible with other Llama models, including ones that do not require personal information
- https://github.com/vikhyat/moondream for captioning and question answering
## Why
Using labels to describe critical details and passing those labels on to the prompt helps the captioning model to
avoid mistakes and hallucinations.
Mistakes in the captions can cause problems later during training, especially if large or focal details are
mis-identified. Providing additional detail in the prompt can also help the captioning model to identify concepts it
is not familiar with.
## How
The required labels are defined for each group, along with templates for the image caption and templates for each
captioning model's prompt. The image labels are used to format those templates, providing more information to the
captioning models.
### Setup
Clone this repository:
```shell
> git clone git@github.com:ssube/label-prompt-caption.git
```
Set up a virtual environment:
```shell
> python3 -m venv venv
```
Install the requirements in the virtual environment:
```shell
> source venv/bin/activate
> pip3 install -r requirements.txt
```
### Usage
Using the virtual environment, run the server:
```shell
> source venv/bin/activate
> python3 -m lpc
```
Open the web UI in your browser. A link to the web UI will be shown in the logs, usually http://127.0.0.1:7860/.
1. On the `Dataset` tab, enter the `Base Path` for your dataset.
1. This is the top-level directory which contains all of the images and group sub-directories.
2. Press the `Load Groups` button
1. This will scan the dataset directory for images matching the `Image Formats`
3. Press the `View Group` button next to a group
4. Switch to the `Group` tab
5. Press the `Load Group` button
1. This will load four additional sections: `Group Captions`, `Group Prompts`, `Group Taxonomy`, and `Group Images`
6. Provide a `Caption Template`
1. You can use the template to add a prefix to every caption in the group, like `picture of {{ subject }}. {{ caption }}`
2. The `{{ caption }}` variable will be set to the captioning model's output
7. Provide one or more `Group Prompts`
1. For Florence, you can use one of ``, ``, or ``
2. For Joy, `Write a detailed description for this image of {{ subject }}.` is a good default but you can modify
the prompt to include more details, the mood of the image, or any other helpful information.
8. Add any required labels to the `Group Taxonomy`
1. These should include any variables in your `Caption Template` and `Group Prompts`, like `subject` in this example
2. You do not need to include `caption` here
9. Select an image
10. Switch to the `Image` tab
11. Add annotations for any missing labels
1. In this example, the `subject` might be a `dog`
12. The `Image Prompts` will show your `Group Prompts` templated with the labels and values from the image annotations
13. Press one of the `Caption with Florence` or `Caption with Joy` buttons
14. The `Image Caption` should update with a new caption describing your selected image
15. Modify the caption until it accurately describes the image
1. The `Shuffle Phrases` button will randomly shuffle each phrase, split on commas
2. The `Remove Newlines` button will remove any newlines in the caption
3. The `Strip Partial Phrases` button will remove any text after the last `.`, in case the captioning model returned
an incomplete phrase at the end of the prompt
16. Press the `Save Image Caption` button to save the caption to a `.txt` file
### Configuration
If you are not comfortable sharing your contact information with Meta, you can use an alternative Llama model by
setting the `LPC_LLAMA_MODEL` environment variable. For example:
```shell
> export LPC_LLAMA_MODEL=cognitivecomputations/dolphin-2.9.4-llama3.1-8b
```
### Metadata
For ease of editing, the metadata is stored in a `meta.yaml` file in each directory where images were found:
```yaml
group:
caption: a {{ style }} picture of {{ animal }} wearing {{ hat }}. {{ caption }}
prompt:
Florence:
Joy: Please write a detailed description of this {{ animal }} wearing {{ hat }}.
Moondream: Describe this image in detail.
required_labels:
- style
- animal
- hat
images:
00092-2473709667.png:
annotations:
- bounding_box: null
label: animal
value: duck
- bounding_box: null
label: style
value: cartoon
- bounding_box: null
label: hat
value: red rain hat
```
Images are stored by filename only, relative to the dataset and group, so that directories can be moved around and
shared without changing the metadata.
## TODOs
- Group captioning with batching
- Implement the previous/next buttons
- Switch the group/image tab after selecting a group/image
- Group-level default labels (mark the whole directory as `style=cartoon`)