https://github.com/ashly1991/word2vec-tf2
Word2Vec Skipgram with negative sampling in TensorFlow 2. Self-supervised embeddings, efficient sampled softmax, and analogies evaluation.
https://github.com/ashly1991/word2vec-tf2
embeddings jupyter-notebook natural-language-processing negative-sampling nlp self-supervised-learning skipgram tensorflow word2vec
Last synced: about 1 month ago
JSON representation
Word2Vec Skipgram with negative sampling in TensorFlow 2. Self-supervised embeddings, efficient sampled softmax, and analogies evaluation.
- Host: GitHub
- URL: https://github.com/ashly1991/word2vec-tf2
- Owner: Ashly1991
- License: mit
- Created: 2025-09-23T14:21:49.000Z (9 months ago)
- Default Branch: main
- Last Pushed: 2025-09-23T14:28:37.000Z (9 months ago)
- Last Synced: 2025-09-23T14:42:42.659Z (9 months ago)
- Topics: embeddings, jupyter-notebook, natural-language-processing, negative-sampling, nlp, self-supervised-learning, skipgram, tensorflow, word2vec
- Language: Jupyter Notebook
- Homepage:
- Size: 0 Bytes
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Word2Vec — Skipgram with Negative Sampling (TensorFlow 2)
This project implements **Word2Vec** to learn **word embeddings** from raw text using a **self‑supervised** objective. It focuses on **Skipgram** with **negative sampling**, discusses why full softmax is inefficient with large vocabularies, and explores simple intrinsic evaluations (nearest neighbors, analogies).
## Key Points
- Example of **self‑supervised learning**: define a prediction task directly from unlabeled text to learn useful representations.
- Understand why **full softmax** is problematic with large vocabularies and how **negative sampling** / sampled losses help.
- Build and analyze **word embeddings**.
## Questions for Understanding
1. Given “I like to cuddle dogs”, how many skipgrams are created with window size 2?
2. In general, how does the number of skipgrams relate to dataset size (as input‑target pairs)?
3. Why isn’t computing the **full softmax** a good idea?
4. For a fixed (target, context) pair, are the **negative samples** re‑drawn each time or reused?
5. For the Shakespeare dataset, do we create (target, context) pairs across **line breaks** (last word of one line + first of next)?
6. Are skipgrams generated for **padding** tokens (index 0)?
7. The **sampling table** is created without reading the text—how does it decide probabilities?
## Possible Improvements & Extensions
- **Skip padding**: prevent generating skipgrams for padding tokens.
- **Re‑draw negatives** each iteration to reduce bias.
- **Avoid true‑context collisions** in negatives (e.g., use `tf.nn.sampled_softmax_loss` with appropriate flags—requires refactors).
- **Analogies**: e.g., `king - man + woman ≈ queen`, compute via **cosine similarity** over embeddings.
- **Scale up**: larger vocabulary/corpora; compare **naive full softmax** vs **negative sampling** efficiency as vocab grows.
## Optional: CBOW Variant
- Build **CBOW** (predict the center word from surrounding context).
- Create windows (e.g., with `tf.data.Dataset.window`), average context embeddings, keep negative sampling.
- Compare CBOW vs Skipgram.
## How to Run
```bash
python -m venv .venv && source .venv/bin/activate # Windows: .venv\Scripts\activate
pip install -r requirements.txt
jupyter lab word2vec-skipgram.ipynb
```
## Requirements
```
tensorflow==2.13.0
numpy
matplotlib
jupyterlab
tqdm
```
(Adjust versions as needed for your environment.)
## Notes
- For reproducible negatives: set a random seed and re‑sample within the training loop.
- Monitor embedding quality with **nearest neighbors** and **analogy tests**; results improve with data size and training time.
## License
MIT — see `LICENSE`.