https://github.com/amanpriyanshu/threaded-sparse-tfidf
Creating a repository for multithreading TF-IDF vectorization for similarity search using sparse matrices for computations.
https://github.com/amanpriyanshu/threaded-sparse-tfidf
Last synced: 6 months ago
JSON representation
Creating a repository for multithreading TF-IDF vectorization for similarity search using sparse matrices for computations.
- Host: GitHub
- URL: https://github.com/amanpriyanshu/threaded-sparse-tfidf
- Owner: AmanPriyanshu
- License: mit
- Created: 2022-04-24T18:43:52.000Z (over 3 years ago)
- Default Branch: main
- Last Pushed: 2022-04-27T10:02:56.000Z (over 3 years ago)
- Last Synced: 2025-03-12T01:05:08.739Z (7 months ago)
- Language: Python
- Size: 20.7 MB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Threaded-Sparse-TFIDF
Creating a repository for multithreading TF-IDF vectorization for similarity search using sparse matrices for computations.[PyPi](https://pypi.org/project/Threaded-Sparse-TFIDF/)
## Usage:
```py
from TF_IDF import TF_IDF_Vectorizertf_idf = TF_IDF_Vectorizer(use_cached=True, print_output=False)
_, ranking = tf_idf.get_similarity_score("science fiction super hero movie", num_workers=k)
```## Performance:
### Image:
### Table:
|num_workers|time |partition_size |
|-----------|------------------|-------------------|
|1.0 |1.1117637634277344|6.778499999999999 |
|2.0 |0.8195240020751953|3.4149000000000003 |
|3.0 |0.7357232332229614|2.2773 |
|4.0 |0.7232689380645752|1.7081 |
|5.0 |0.7375946760177612|1.3555999999999997 |
|6.0 |0.7682486534118652|1.1307000000000003 |
|7.0 |0.7640876531600952|0.9618 |
|8.0 |0.7513441801071167|0.8506 |
|9.0 |0.7795052766799927|0.7587 |
|10.0 |0.8141436100006103|0.6807 |
|11.0 |0.8003325223922729|0.6195000000000002 |
|12.0 |0.8441393852233887|0.5697 |
|13.0 |0.8490614175796509|0.5258000000000002 |
|14.0 |0.9322290658950806|0.48739999999999994|
|15.0 |0.8824400186538697|0.45729999999999993|## Data
A subset of the **Information Retrieval Dataset - Internet Movie Database (IMDB)** specifically movies after the year 2007.