Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/faisalahmedbijoy/document-similarity-using-doc2vec-and-gensim
Document Similarity Measurement Using Doc2Vec and Gensim Library
https://github.com/faisalahmedbijoy/document-similarity-using-doc2vec-and-gensim
doc2vec-model doc2vec-word2vec docuemntation-tool gensim gensim-library ner nlp nsl
Last synced: about 4 hours ago
JSON representation
Document Similarity Measurement Using Doc2Vec and Gensim Library
- Host: GitHub
- URL: https://github.com/faisalahmedbijoy/document-similarity-using-doc2vec-and-gensim
- Owner: FaisalAhmedBijoy
- Created: 2022-07-07T18:36:16.000Z (over 2 years ago)
- Default Branch: main
- Last Pushed: 2023-09-24T19:11:55.000Z (about 1 year ago)
- Last Synced: 2023-09-25T02:02:13.142Z (about 1 year ago)
- Topics: doc2vec-model, doc2vec-word2vec, docuemntation-tool, gensim, gensim-library, ner, nlp, nsl
- Language: Python
- Homepage:
- Size: 119 MB
- Stars: 1
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Document-similarity-using-doc2vec-and-gensim
Python implementation of a document similarity checking using Doc2Vec.
## File structure
```bash
Document-similarity-using-doc2vec-and-gensim/
├── data/
│ ├── 20news-bydate.tar.gz
│ ├── 20news-bydate-test
│ └── 20news-bydate-train
├── models/
│ ├── doc2vec_model.bin
│ ├── doc2vec_model.model
│ ├── doc2vec_vector.txt
│ └── doc2vec_model.bin.dv.vectors.npy
├── dataset_preprocess.py
├── inference.py
├── README.md
├── requirements.txt
└── train.py
```
- **data/train_data.txt**: Training data file
- **models/doc2vec_model.bin**: Trained Doc2Vec model file
- **models/doc2vec_model.bin.dv.vectors.npy**: Document vectors file for the trained model
- **README.md**: Project documentation file
- **requirements.txt**: Required Python packages
- **inference.py**: Script to check similarity between two documents
- **train.py**: Script to train the Doc2Vec model## Installation
Install the dependencies using pip:
```bash
gensim==4.2.0
nltk==3.5
numpy==1.23.1
numpy==1.23.2
pandas==1.2.0
scikit_learn==0.23.2
```
Install the required packages:```bash
pip install -r requirements.txt
```## Training the Doc2Vec model
```bash
python train.py
```## Inference
Check the similarity between two documents
```bash
python inference.py
```