https://github.com/ggldnl/clip
CLIP Like model fine tuned for the SemEval-2023 Visual-WSD task
https://github.com/ggldnl/clip
bert bert-multilingual clip contrastive-loss transformer vision-transformer
Last synced: 4 months ago
JSON representation
CLIP Like model fine tuned for the SemEval-2023 Visual-WSD task
- Host: GitHub
- URL: https://github.com/ggldnl/clip
- Owner: ggldnl
- Created: 2024-02-08T11:03:36.000Z (over 1 year ago)
- Default Branch: master
- Last Pushed: 2024-04-10T12:44:35.000Z (about 1 year ago)
- Last Synced: 2025-01-14T02:47:39.857Z (5 months ago)
- Topics: bert, bert-multilingual, clip, contrastive-loss, transformer, vision-transformer
- Language: Python
- Homepage:
- Size: 46.9 KB
- Stars: 0
- Watchers: 2
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# CLIP
CLIP Like model fine tuned for the SemEval-2023 Visual-WSD task. The model consists of a vision transformer (google/vit-base-patch16-224) as image encoder and on a encoder-only transformer (distilbert-base-multilingual-cased) as text encoder. The two embedding produced by the text and image encoder are projected on the same space by two projection layers and the CLIP loss is used to make them converge (positive image-text pair) or diverge (negative image-text pair). The model downloads a pretrained version of both image and text encoders and then freezes them, keeping only the projection layers as trainable. The model features two methods for inference: top_k_images(sentence, images), that given a sentence and a set of images returns the k most similar images to the sentence, and a top_k_texts(image, sentences), that given an image and a set of sentences returns the k most similar textual descriptions to the image.