https://github.com/ggldnl/clip

CLIP Like model fine tuned for the SemEval-2023 Visual-WSD task
https://github.com/ggldnl/clip

bert bert-multilingual clip contrastive-loss transformer vision-transformer

Last synced: 4 months ago
JSON representation

CLIP Like model fine tuned for the SemEval-2023 Visual-WSD task

Host: GitHub
URL: https://github.com/ggldnl/clip
Owner: ggldnl
Created: 2024-02-08T11:03:36.000Z (over 1 year ago)
Default Branch: master
Last Pushed: 2024-04-10T12:44:35.000Z (about 1 year ago)
Last Synced: 2025-01-14T02:47:39.857Z (5 months ago)
Topics: bert, bert-multilingual, clip, contrastive-loss, transformer, vision-transformer
Language: Python
Homepage:
Size: 46.9 KB
Stars: 0
Watchers: 2
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# CLIP

CLIP Like model fine tuned for the SemEval-2023 Visual-WSD task. The model consists of a vision transformer (google/vit-base-patch16-224) as image encoder and on a encoder-only transformer (distilbert-base-multilingual-cased) as text encoder. The two embedding produced by the text and image encoder are projected on the same space by two projection layers and the CLIP loss is used to make them converge (positive image-text pair) or diverge (negative image-text pair). The model downloads a pretrained version of both image and text encoders and then freezes them, keeping only the projection layers as trainable. The model features two methods for inference: top_k_images(sentence, images), that given a sentence and a set of images returns the k most similar images to the sentence, and a top_k_texts(image, sentences), that given an image and a set of sentences returns the k most similar textual descriptions to the image.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/ggldnl/clip

Awesome Lists containing this project

README