https://github.com/kausmeows/clothsy
Transformer based search/rec engine to fetch Amazon URLs for similar clothing items given a text description
https://github.com/kausmeows/clothsy
fastapi nlp transformer
Last synced: 9 months ago
JSON representation
Transformer based search/rec engine to fetch Amazon URLs for similar clothing items given a text description
- Host: GitHub
- URL: https://github.com/kausmeows/clothsy
- Owner: kausmeows
- Created: 2023-05-21T10:12:25.000Z (about 3 years ago)
- Default Branch: main
- Last Pushed: 2023-06-07T12:42:40.000Z (about 3 years ago)
- Last Synced: 2024-02-29T13:32:10.471Z (over 2 years ago)
- Topics: fastapi, nlp, transformer
- Language: Jupyter Notebook
- Homepage: https://huggingface.co/spaces/kausmos/clothsy
- Size: 83.7 MB
- Stars: 3
- Watchers: 2
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
---
title: Clothsy
emoji: 👕
colorFrom: purple
colorTo: purple
sdk: gradio
sdk_version: 3.24.1
app_file: main.py
pinned: false
---
## [HF Space Demo](https://huggingface.co/spaces/kausmos/clothsy)

## [Working Demo](https://youtu.be/LZ-mWgL5qx4)
[](https://youtu.be/LZ-mWgL5qx4)
## Data Collection
To scrape quality clothing data containing proper description and url for the product I used `Apify's` [Amazon Product Scraper](https://blog.apify.com/step-by-step-guide-to-scraping-amazon/#step-1-go-to-amazon-product-scraper-on-apify-store)
By creating an account and logging into the console we can input links of the amazon fashion category like- `Men's Fashion -> Shirts`
I downloaded all the scraped data for various clothing categories into a CSV file with columns `url|title|description`
Apify Console

The full data consists of 2900 different clothing products of men and women, it can be found at `data/clothing_similarity_search.csv`
## Data Cleaning
I used `pandas` to clean the data and preprocess the text data by cleaning it (remove special characters, lowercasing, etc.), and possibly by applying some form of text normalization (like stemming or lemmatization).
## Making Embeddings
`sentence-transformers` has been used to make embeddings for the cleaned data. I used `all-MiniLM-L6-v2` model to make the embeddings. The model card can be found [here](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2)
```py
from sentence_transformers import SentenceTransformer
sentences = ["This is an example sentence", "Each sentence is converted"]
model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')
embeddings = model.encode(sentences)
print(embeddings)
The choice of this model selection was based on its small size and good accuracy which favors the API response speed
```
The embeddings generated for the whole dataset has been saved into a `.npy` at `/data/embeddings.npy` file which can be loaded and used for similarity search retrieval. This makes sure searching takes place via vector-similarity which is faster.
I used the `cosine similarity` to find the similarity between the embeddings of the query and the embeddings of the products.
## API
Used `FastAPI` to create the API. The API has a single endpoint `/predict` which takes a query string and returns the top 5 most similar products as json
```py
We hit the endpoint http://0.0.0.0:8080/predict with a JSON payload as
{
"query": "Men's winter jacket black and white"
}
This will return
{
"similar_urls": [
"https://www.amazon.in/dp/B082L3BGGM",
"https://www.amazon.in/dp/B08KWFRY6W",
"https://www.amazon.in/dp/B08Q3VBFPD",
"https://www.amazon.com/dp/B07S1LMK58",
"https://www.amazon.in/dp/B0B8YY38VF"
]
}
```
## Deployment
I used `Docker` to containerize the API. Was trying to use Google Cloud Functions to deploy the endpoint but faced some issues since it was my first time using GCP:-
- Wasn't able to load the `embeddings.npy` file from cloud storage into the cloud function. Some help on this would be appreciated.
## Running Locally
- Clone the repo
- Make a virtual environment
- Install the dependencies `pip install -r requirements.txt`
- Run the server `python main.py`