https://github.com/mahmood-anaam/vinvl_bert
vinvl_bert is a Python package for generating Arabic image captions using Bidirectional Transformers (BiT). This library is designed to provide high-quality and accurate captions for Arabic datasets by leveraging pre-trained deep learning models.
https://github.com/mahmood-anaam/vinvl_bert
bertforimagecaptioning image-captioning pytorch-transformers vinvl
Last synced: 3 months ago
JSON representation
vinvl_bert is a Python package for generating Arabic image captions using Bidirectional Transformers (BiT). This library is designed to provide high-quality and accurate captions for Arabic datasets by leveraging pre-trained deep learning models.
- Host: GitHub
- URL: https://github.com/mahmood-anaam/vinvl_bert
- Owner: Mahmood-Anaam
- License: mit
- Created: 2024-12-24T16:51:48.000Z (5 months ago)
- Default Branch: main
- Last Pushed: 2024-12-31T21:48:32.000Z (5 months ago)
- Last Synced: 2024-12-31T22:25:18.364Z (5 months ago)
- Topics: bertforimagecaptioning, image-captioning, pytorch-transformers, vinvl
- Language: Jupyter Notebook
- Homepage: https://github.com/Mahmood-Anaam/vinvl_bert.git
- Size: 2.79 MB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# vinvl_bert
[](https://colab.research.google.com/github/Mahmood-Anaam/vinvl_bert/blob/main/notebooks/vinvl_bert_demo.ipynb)
## Overview
`vinvl_bert` is a vision-language model specifically tailored for **Arabic image captioning**, inspired by the methodologies outlined in the paper [Arabic Image Captioning using Pre-training of Deep Bidirectional Transformers](https://www.researchgate.net/publication/375946517_Arabic_Image_Captioning_using_Pre-training_of_Deep_Bidirectional_Transformers). The model leverages **pre-trained Bidirectional Transformers (BiT)**, integrating visual features from images with textual data to generate precise and contextually relevant captions in Arabic. This approach employs object tags as anchor points to facilitate semantic alignment between image regions and text, making it highly effective for Arabic datasets. The repository is optimized to support various **vision-language tasks** and offers extensive customization options.## Features
- **Integration with pre-trained models**: Easily use pre-trained models for captioning tasks.
- **Image and text feature fusion**: Incorporates image region features with textual data.
- **Flexible configurations**: Supports various decoding methods and customizations for generation.
- **Supports constrained beam search (CBS)**: Enables fine-grained control over output captions.## Installation:
#### Option 1: Install via `pip`
```bash
pip install git+https://github.com/Mahmood-Anaam/vinvl_bert.git --quiet
```#### Option 2: Clone Repository and Install in Editable Mode
```bash
git clone https://github.com/Mahmood-Anaam/vinvl_bert.git
cd vinvl_bert
pip install -e .
```#### Option 3: Use Conda Environment
```bash
git clone https://github.com/Mahmood-Anaam/vinvl_bert.git
cd vinvl_bert
conda env create -f environment.yml
conda activate vinvl_bert
pip install -e .
```## Quick Start
Here’s how to get started with `vinvl_bert`:
```python
import torch
from PIL import Image
import requests
from vinvl_bert.feature_extractors import VinVLFeatureExtractor
from vinvl_bert.pipelines import VinVLBertPipeline
from vinvl_bert.configs import VinVLBertConfig# Configure settings
cfg = VinVLBertConfig()
cfg.model_id = "jontooy/AraBERT32-Flickr8k" # model id from huggingface
cfg.device = "cuda" if torch.cuda.is_available() else "cpu" # Device for computation (GPU/CPU)# Image and object detection settings
cfg.add_od_labels = True # Whether to add object detection labels to input
cfg.max_img_seq_length = 50 # Maximum sequence length for image features# Generation settings
cfg.is_decode = True # Enable decoding (generation mode)
cfg.do_sample = False # Whether to use sampling for generation
cfg.max_gen_length = 50 # Maximum length for generated text
cfg.num_beams = 5 # Number of beams for beam search
cfg.temperature = 1.0 # Temperature for sampling (lower values make output more deterministic)
cfg.top_k = 50 # Top-k sampling (0 disables it)
cfg.top_p = 1.0 # Top-p (nucleus) sampling (0 disables it)
cfg.repetition_penalty = 1.0 # Penalty for repeating words (1.0 disables it)
cfg.length_penalty = 1.0 # Penalty for sequence length (used in beam search)
cfg.num_keep_best = 3 # Number of best sequences to keep# Load an example image
img_url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(img_url, stream=True).raw)# Extract image features
feature_extractor = VinVLFeatureExtractor(device=cfg.device, add_od_labels=cfg.add_od_labels)
image_features = feature_extractor([image])# Generate a caption
pipeline = VinVLBertPipeline(cfg)
features, captions = pipeline([image])
print("Generated Caption:", captions[0])
```## Customization
You can fine-tune or modify configurations in `VinVLBertConfig` to suit specific tasks, such as:
- Adjusting sequence lengths for text and images.
- Modifying beam search parameters for generation.
- Enabling or disabling constrained beam search (CBS) for specific constraints.## Limitations
This repository is a utility for integrating pre-trained models for Arabic image captioning. It is not a full-fledged library for vision-language tasks and assumes familiarity with PyTorch and Transformers.