https://github.com/ashly1991/word2vec-tf2

Word2Vec Skipgram with negative sampling in TensorFlow 2. Self-supervised embeddings, efficient sampled softmax, and analogies evaluation.
https://github.com/ashly1991/word2vec-tf2

embeddings jupyter-notebook natural-language-processing negative-sampling nlp self-supervised-learning skipgram tensorflow word2vec

Last synced: about 1 month ago
JSON representation

Word2Vec Skipgram with negative sampling in TensorFlow 2. Self-supervised embeddings, efficient sampled softmax, and analogies evaluation.

Host: GitHub
URL: https://github.com/ashly1991/word2vec-tf2
Owner: Ashly1991
License: mit
Created: 2025-09-23T14:21:49.000Z (9 months ago)
Default Branch: main
Last Pushed: 2025-09-23T14:28:37.000Z (9 months ago)
Last Synced: 2025-09-23T14:42:42.659Z (9 months ago)
Topics: embeddings, jupyter-notebook, natural-language-processing, negative-sampling, nlp, self-supervised-learning, skipgram, tensorflow, word2vec
Language: Jupyter Notebook
Homepage:
Size: 0 Bytes
Stars: 0
Watchers: 0
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

# Word2Vec — Skipgram with Negative Sampling (TensorFlow 2)

This project implements **Word2Vec** to learn **word embeddings** from raw text using a **self‑supervised** objective. It focuses on **Skipgram** with **negative sampling**, discusses why full softmax is inefficient with large vocabularies, and explores simple intrinsic evaluations (nearest neighbors, analogies).

## Key Points
- Example of **self‑supervised learning**: define a prediction task directly from unlabeled text to learn useful representations.
- Understand why **full softmax** is problematic with large vocabularies and how **negative sampling** / sampled losses help.
- Build and analyze **word embeddings**.

## Questions for Understanding
1. Given “I like to cuddle dogs”, how many skipgrams are created with window size 2?
2. In general, how does the number of skipgrams relate to dataset size (as input‑target pairs)?
3. Why isn’t computing the **full softmax** a good idea?
4. For a fixed (target, context) pair, are the **negative samples** re‑drawn each time or reused?
5. For the Shakespeare dataset, do we create (target, context) pairs across **line breaks** (last word of one line + first of next)?
6. Are skipgrams generated for **padding** tokens (index 0)?
7. The **sampling table** is created without reading the text—how does it decide probabilities?

## Possible Improvements & Extensions
- **Skip padding**: prevent generating skipgrams for padding tokens.
- **Re‑draw negatives** each iteration to reduce bias.
- **Avoid true‑context collisions** in negatives (e.g., use `tf.nn.sampled_softmax_loss` with appropriate flags—requires refactors).
- **Analogies**: e.g., `king - man + woman ≈ queen`, compute via **cosine similarity** over embeddings.
- **Scale up**: larger vocabulary/corpora; compare **naive full softmax** vs **negative sampling** efficiency as vocab grows.

## Optional: CBOW Variant
- Build **CBOW** (predict the center word from surrounding context).
- Create windows (e.g., with `tf.data.Dataset.window`), average context embeddings, keep negative sampling.
- Compare CBOW vs Skipgram.

## How to Run
```bash
python -m venv .venv && source .venv/bin/activate # Windows: .venv\Scripts\activate
pip install -r requirements.txt
jupyter lab word2vec-skipgram.ipynb
```

## Requirements
```
tensorflow==2.13.0
numpy
matplotlib
jupyterlab
tqdm
```
(Adjust versions as needed for your environment.)

## Notes
- For reproducible negatives: set a random seed and re‑sample within the training loop.
- Monitor embedding quality with **nearest neighbors** and **analogy tests**; results improve with data size and training time.

## License
MIT — see `LICENSE`.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/ashly1991/word2vec-tf2

Awesome Lists containing this project

README