https://github.com/sminerport/word2vecskipgramtensorflow

Word2Vec Skip-Gram model implementation using TensorFlow 2.0 to learn word embeddings from a small Wikipedia dataset (text8). Includes training, evaluation, and cosine similarity-based nearest neighbors
https://github.com/sminerport/word2vecskipgramtensorflow

cosine-similarity language-model machine-learning natural-language-processing neural-networks skip-gram tensorflow tensorflow2 text8-dataset wikipedia word-embeddings word2vec

Last synced: 4 months ago
JSON representation

Host: GitHub
URL: https://github.com/sminerport/word2vecskipgramtensorflow
Owner: sminerport
License: mit
Created: 2023-04-11T21:50:00.000Z (about 2 years ago)
Default Branch: main
Last Pushed: 2025-01-21T13:36:21.000Z (4 months ago)
Last Synced: 2025-01-21T14:30:27.619Z (4 months ago)
Topics: cosine-similarity, language-model, machine-learning, natural-language-processing, neural-networks, skip-gram, tensorflow, tensorflow2, text8-dataset, wikipedia, word-embeddings, word2vec
Language: Python
Homepage: https://scottminer.netlify.app
Size: 10.7 KB
Stars: 1
Watchers: 1
Forks: 2
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

# Word2Vec (Word Embedding) with TensorFlow 2.0

This repository contains an implementation of the Word2Vec algorithm using TensorFlow 2.0 to compute vector representations of words. The Word2Vec model used is the Skip-Gram model, which is trained on a small chunk of Wikipedia articles (the text8 dataset).

## Background

Word2Vec is a popular word embedding technique that represents words as vectors in a high-dimensional space. These embeddings can be used in various natural language processing tasks, such as sentiment analysis, document classification, and machine translation. The main idea behind Word2Vec is that words with similar meanings tend to occur in similar contexts.

For more information on Word2Vec, please refer to the following research paper:
[Mikolov, Tomas et al. "Efficient Estimation of Word Representations in Vector Space.", 2013](https://arxiv.org/pdf/1301.3781.pdf)

## Getting Started

To run the Word2Vec implementation, simply clone this repository and execute the `word2vec.py` script using Python 3.

### Prerequisites

- Python 3
- TensorFlow 2.0
- NumPy
- urllib
- zipfile

## Text8 Dataset

This project uses the Text8 dataset for natural language processing tasks. The dataset is not included in the repository to keep the repository size small. You can download the dataset using the provided Python script or by downloading it manually.

### Download using Python script

1. Run the `download_text8.py` script in the project folder:

```bash
python download_text8.py
```
This script will download the text8.zip file and save it in the text8_dataset folder. If the file already exists, it won't be downloaded again.

### Manual download

1. Download the text8.zip file from the following URL: http://mattmahoney.net/dc/text8.zip

2. Create a folder named `text8_dataset` in the project directory and move the downloaded `text8.zip` file into it.

## Implementation Details

- The text8 dataset of Wikipedia articles is downloaded and processed to create a vocabulary of words.
- Rare words with occurrences below the specified threshold are removed.
- A Skip-Gram model is trained on the dataset for a specified number of steps using Stochastic Gradient Descent (SGD) optimization and Noise Contrastive Estimation (NCE) loss.
- The model's performance is evaluated periodically by finding the nearest neighbors of a set of test words based on their vector representations.

## Author

- Aymeric Damien - [GitHub](https://github.com/aymericdamien)

## License

This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/sminerport/word2vecskipgramtensorflow

Awesome Lists containing this project

README