https://github.com/kanishknavale/text-mining-with-tf-idf-and-cosine-similarity
A simple python repository for developing perceptron based text mining involving dataset linguistics preprocessing for text classification and extracting similar text for a given query.
https://github.com/kanishknavale/text-mining-with-tf-idf-and-cosine-similarity
cosine-similarity-scores information-retreival l2-regularization lemmatization linguistics machine-learning nltk optimization perceptron text-classification text-mining tf-idf tokenization torch-sparse-matrix
Last synced: 3 months ago
JSON representation
A simple python repository for developing perceptron based text mining involving dataset linguistics preprocessing for text classification and extracting similar text for a given query.
- Host: GitHub
- URL: https://github.com/kanishknavale/text-mining-with-tf-idf-and-cosine-similarity
- Owner: KanishkNavale
- License: mit
- Created: 2021-03-05T10:32:36.000Z (about 4 years ago)
- Default Branch: master
- Last Pushed: 2022-03-25T11:38:00.000Z (about 3 years ago)
- Last Synced: 2025-01-06T04:13:10.699Z (5 months ago)
- Topics: cosine-similarity-scores, information-retreival, l2-regularization, lemmatization, linguistics, machine-learning, nltk, optimization, perceptron, text-classification, text-mining, tf-idf, tokenization, torch-sparse-matrix
- Language: Jupyter Notebook
- Homepage:
- Size: 7.34 MB
- Stars: 1
- Watchers: 1
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Text Mining with TF-IDF & Cosine Similarity
A simple python repository for developing perceptron based text mining involving dataset linguistics preprocessing for text classification and extracting similar text for a given query.
New Implementation: Added PyTorch based optimization handling buggy loading of sparse 'csr_matrix' to cuda tensor.
## Outcomes
1. Numpy implementation,
|Vanilla Optimization|Optimization with L2-Regularization|
|:--:|:--:|
|
|
|
Top 5 weighted terms,
|Terms|Weights|Terms: L2|Weights: L2|
|:--:|:--:|:--:|:--:|
|langeweile|7.094|top|5.8911|
|geilo|7.0535|langeweile|5.8396|
|best|6.7828|geilo|5.7615|
|love|6.376|perfekt|5.6325|
|exzellent|6.3534|super|5.6279|2. PyTorch implementation,
|Vanilla Optimization|Optimization with L2-Regularization|
|:--:|:--:|
|
|
|
|Histogram:Weights|Penalized Weights|
|
|
|
Top 5 weighted terms,
|Terms|Weights|Terms: L2|Weights: L2|
|:--:|:--:|:--:|:--:|
|erfolgreichen|20.5452|cool|8.8814|
|anmeldungen|20.0064|geil|8.0933|
|angemessene|19.658|super|6.7332|
|eonfach|19.5906|top|5.4004|
|verarbeitung|19.5136|gut|4.8924|## Dependencies
Install dependencies using:
```bash
pip3 install -r requirements.txt
```## Contact
* Email: [email protected]
* Website: