https://github.com/kapshaul/nlp-wordvector

This repository explores word vectors in NLP, including tokenization, vocabulary building, and generating vectors with PPMI and GloVe, using t-SNE to visualize semantic relationships.
https://github.com/kapshaul/nlp-wordvector

embeddings glove natural-language-processing nlp t-sne word2vec

Last synced: about 2 months ago
JSON representation

This repository explores word vectors in NLP, including tokenization, vocabulary building, and generating vectors with PPMI and GloVe, using t-SNE to visualize semantic relationships.

Host: GitHub
URL: https://github.com/kapshaul/nlp-wordvector
Owner: kapshaul
Created: 2024-08-20T16:04:04.000Z (about 1 year ago)
Default Branch: master
Last Pushed: 2024-09-10T02:49:30.000Z (about 1 year ago)
Last Synced: 2025-01-19T17:59:35.137Z (9 months ago)
Topics: embeddings, glove, natural-language-processing, nlp, t-sne, word2vec
Language: Python
Homepage:
Size: 1.32 MB
Stars: 1
Watchers: 2
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

          # Word Vector in Natural Language Processing

## Overview

This project delves into the foundational aspects of natural language processing, focusing on the creation and analysis of word vectors, distributed representations of words, and the exploration of inherent biases in these representations. The AG News Benchmark dataset is used for implementing tokenization, vocabulary building, and investigating various techniques for generating and analyzing word vectors.

### 1. Tokenization and Vocabulary Building

The project begins by transforming raw text into tokenized forms, with experimentation on different tokenization methods, including lemmatization. A vocabulary is then built based on the frequency of tokens, using heuristics to optimize the vocabulary size for computational efficiency.





**Figure 1**: Token frequency distribution (top) and cumulative fraction covered (bottom)






Figure 1 shows the effect of applying a cutoff heuristic where tokens with a frequency of 12 or higher are retained, capturing 96\% of the tokens in the dataset. This threshold was chosen for computational feasibility, as it allows the co-occurrence matrix $C$ to remain approximately 1GB in size. Expanding the vocabulary beyond this point would significantly increase memory requirements, potentially exceeding available resources. The figure illustrates how this cutoff effectively balances the coverage of the dataset with the constraints of computational capacity.

### 2. Frequency-Based Word Vectors

Frequency-based word vectors are explored using *Pointwise Mutual Information (PPMI)*. This involves constructing a co-occurrence matrix from the corpus, computing PPMI values, and then reducing the dimensionality of the word vectors through techniques like Truncated SVD. Visualization of these word vectors is performed using *t-SNE* to better understand the captured semantic relationships.





**Figure 2**: t-SNE Visualization










**Figure 3**: t-SNE clusters — War (left), Technology (middle), and Politics (right)






### 3. Learning-Based Word Vectors with GloVe

The GloVe algorithm is implemented to generate word vectors by modeling word co-occurrences as a weighted log-bilinear regression problem. The process includes deriving gradients, optimizing the objective via stochastic gradient descent, and visualizing the resulting word vectors. The behavior of the loss during training is monitored to ensure proper convergence.

The GloVe objective can be written as a sum of weighted squared error terms for each word-pair in a vocabulary,

$$

J = \overbrace{\sum_{i,j  \in V}}^{\mbox{{sum over\\ word pairs}}} \underbrace{f(C_{ij})}_ {\mbox{weight}} ~~~( \overbrace{w_i^T\tilde{w}_ j + b_i + \tilde{b}_ j - \log C_{ij}}^{\mbox{error term}})^2

$$

where each word $i$ is associated with word vector $w_i$, context vector $\tilde{w}_ i$, and word/context biases $b_i$ and $\tilde{b}_ i$.

The $f(C_{ij})$ term is a weighting to avoid frequent co-occurrences from dominating the objective and is defined as,

$$

f(X_{ij}) = min(1, C_{ij}/100)^{0.75}

$$

The derivation of the gradient for the objective $J$ is expressed as follows,

$\nabla_{w_i}J=\nabla_{w_i}\sum_{i,j  \in V}f(C_{ij})(w_i^T\tilde{w}_ j + b_i + \tilde{b}_ j - \log C_{ij})^2$

$\hspace{0.75cm}=2{\tilde{w}_ j}f(C_{ij})(w_i^T\tilde{w}_ j + b_i + \tilde{b}_ j - \log C_{ij})$

$\nabla_{\tilde{w}_ j}J=\nabla_{\tilde{w}_ j}\sum_{i,j  \in V}f(C_{ij})(w_i^T\tilde{w}_ j + b_i + \tilde{b}_ j - \log C_{ij})^2$

$\hspace{0.75cm}=2w_if(C_{ij})(w_i^T\tilde{w}_ j + b_i + \tilde{b}_ j - \log C_{ij})$

$\nabla_{b_i}J=\nabla_{b_i}\sum_{i,j  \in V}f(C_{ij})(w_i^T\tilde{w}_ j + b_i + \tilde{b}_ j - \log C_{ij})^2$

$\hspace{0.75cm}=2f(C_{ij})(w_i^T\tilde{w}_ j + b_i + \tilde{b}_ j - \log C_{ij})$

$\nabla_{\tilde{b}_ j}J=\nabla_{\tilde{b}_ j}\sum_{i,j  \in V}f(C_{ij})(w_i^T\tilde{w}_ j + b_i + \tilde{b}_ j - \log C_{ij})^2$

$\hspace{0.75cm}=2f(C_{ij})(w_i^T\tilde{w}_ j + b_i + \tilde{b}_ j - \log C_{ij})$




Training GloVe vectors involved monitoring the loss function throughout the process. The behavior of the loss during training is detailed below,

```python

2024-04-17 04:09:49 INFO     Iter 14400 / 15227: avg. loss over last 100 batches = 0.046686563985831216

2024-04-17 04:09:49 INFO     Iter 14500 / 15227: avg. loss over last 100 batches = 0.04769956457112328

2024-04-17 04:09:49 INFO     Iter 14600 / 15227: avg. loss over last 100 batches = 0.04687950216720886

2024-04-17 04:09:49 INFO     Iter 14700 / 15227: avg. loss over last 100 batches = 0.04827717854832922

2024-04-17 04:09:49 INFO     Iter 14800 / 15227: avg. loss over last 100 batches = 0.047144581882744535

2024-04-17 04:09:49 INFO     Iter 14900 / 15227: avg. loss over last 100 batches = 0.047903630422071866

2024-04-17 04:09:49 INFO     Iter 15000 / 15227: avg. loss over last 100 batches = 0.04676183418646468

2024-04-17 04:09:49 INFO     Iter 15100 / 15227: avg. loss over last 100 batches = 0.048071157216658514

2024-04-17 04:09:49 INFO     Iter 15200 / 15227: avg. loss over last 100 batches = 0.04732485846561704

```

### 4. Exploring Bias in Word Vectors

A significant focus of this project is the exploration of biases that can be inherent in word vectors. Relationships learned by word2vec are analyzed, revealing how these vectors can reinforce gender, racial, or other societal biases. This highlights the importance of understanding and addressing these biases, particularly in the deployment of NLP models in real-world applications.

The following examples illustrate how word2vec reinforces gender stereotypes in medicine,

```python

>>> analogy('man', 'doctor', 'woman')

    man : doctor :: woman : ?

    [('gynecologist', 0.709), ('nurse', 0.648), ('doctors', 0.647), ('physician', 0.644), ('pediatrician', 0.625), ('nurse_practitioner', 0.622), ('obstetrician', 0.607), ('ob_gyn', 0.599), ('midwife', 0.593), ('dermatologist', 0.574)]

>>> analogy('woman', 'doctor', 'man')

    woman : doctor :: man : ?

    [('physician', 0.646), ('doctors', 0.586), ('surgeon', 0.572), ('dentist', 0.552), ('cardiologist', 0.541), ('neurologist', 0.527), ('neurosurgeon', 0.525), ('urologist', 0.525), ('Doctor', 0.524), ('internist', 0.518)]

```

These results show that word2vec tends to associate female doctors with roles in nursing or specializations focused on women’s or children’s health, thus reinforcing gender stereotypes in the medical field.

---

## Installation

To get started, clone the repository and install the required dependencies:

```bash

git clone https://github.com/kapshaul/NLP-WordVector.git

cd NLP-WordVector

pip install -r requirements.txt

```

## Implementation

1. To implement *Tokenization and Vocabulary Building*, run `build_freq_vectors.py`.

2. To implement *Frequency-Based Word Vectors* and *Learning-Based Word Vectors with GloVe*, run `build_glove_vectors.py`.

3. To implement *Exploring Bias in Word Vectors*, run `Exploring_learned_biases.py`.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/kapshaul/nlp-wordvector

Awesome Lists containing this project

README