https://github.com/rathod-shubham/clip-classifier
CLIP is a multi-modal, zero-shot open-source paradigm. Without optimizing for a specific purpose, given a picture and text descriptions, the model can predict the best suitable text description for that image.
https://github.com/rathod-shubham/clip-classifier
ai artificial-intelligence artificial-neural-networks classification computer-vision deep-learning machine-learning machine-learning-algorithms python python3
Last synced: 7 months ago
JSON representation
CLIP is a multi-modal, zero-shot open-source paradigm. Without optimizing for a specific purpose, given a picture and text descriptions, the model can predict the best suitable text description for that image.
- Host: GitHub
- URL: https://github.com/rathod-shubham/clip-classifier
- Owner: RATHOD-SHUBHAM
- Created: 2023-09-24T04:00:39.000Z (about 2 years ago)
- Default Branch: master
- Last Pushed: 2024-06-05T23:29:57.000Z (over 1 year ago)
- Last Synced: 2025-01-31T15:12:57.513Z (8 months ago)
- Topics: ai, artificial-intelligence, artificial-neural-networks, classification, computer-vision, deep-learning, machine-learning, machine-learning-algorithms, python, python3
- Language: Jupyter Notebook
- Homepage:
- Size: 91.5 MB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: Readme.md
Awesome Lists containing this project
README
# CLIP
* What is CLIP?
* Contrastive Language-Image Pre-training (CLIP forshort) is a state-of-the-art model introduced by OpenAl in February.
* CLIP is a neural network trained on about 400 million (text and image) pairs.
* Training uses a contrastive learning approach that aims to unify text and images, allowing tasks like image classification to be done with text-image similarity.
* CLIP Architecture:
* Two encoders are jointly trained to predict the correct pairings of abatch of training (image, text) examples.
* The text encoder's backbone is a transformer model, and the base size uses 63 millions- parameters,12 layers, and a 512-wide modelcontaining 8 attention heads.
* The image encoder, on the other hand, uses both a Vision Transformer (ViT) and a ResNet50 as its backbone, responsible for generating the feature representation of the image.
* Run Code:
* Install:
1. !pip install git+https://github.com/PrithivirajDamodaran/ZSIC.git\
2. !pip install streamlit* Run app:
streamlit run app.py
---
# Image Search
## SentenceTransformers
SentenceTransformers provides models that allow to embed images and text into the same vector space.
This allows to find similar images as well as to implement image search.## clip-ViT-B-32
This is the Image & Text model CLIP, which maps text and images to a shared vector space## Usage
1. Git clone Repository.
2. cd ImageSearch.
3. pip install requirements.txt## Docker Image
* [Image](https://hub.docker.com/repository/docker/gibbo96/text2image/general)---