https://github.com/ranfysvalle02/just-a-bag-of-words
Learn about the evolution of NLP models, from early techniques like Bag-of-Words to modern advancements in Word2Vec and Large Language Models. Discover how these foundational methods have shaped today's AI and paved the way for future innovations.
https://github.com/ranfysvalle02/just-a-bag-of-words
Last synced: 7 months ago
JSON representation
Learn about the evolution of NLP models, from early techniques like Bag-of-Words to modern advancements in Word2Vec and Large Language Models. Discover how these foundational methods have shaped today's AI and paved the way for future innovations.
- Host: GitHub
- URL: https://github.com/ranfysvalle02/just-a-bag-of-words
- Owner: ranfysvalle02
- Created: 2024-10-17T02:23:03.000Z (about 1 year ago)
- Default Branch: main
- Last Pushed: 2024-10-17T11:59:29.000Z (about 1 year ago)
- Last Synced: 2025-04-12T21:45:18.996Z (7 months ago)
- Language: Python
- Homepage:
- Size: 337 KB
- Stars: 1
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# The Foundations of Text Representation
## How Early Text Representation Models Shaped Today’s AI

__Image Credit to [Word2Vec vs BERT](https://www.saltdatalabs.com/blog/word2vec-vs-bert)__
In the last few years, large language models have radically transformed how we interact with technology, from search engines to virtual assistants. Yet, behind LLM's incredible understanding of language lies a long history of evolving text representation techniques that have shaped the foundation of modern Natural Language Processing (NLP). This blog post takes a journey through the pioneering approaches that dominated before LLM’s emergence, exploring the humble beginnings of Bag of Words (BoW), the innovations of Word2Vec, and how these earlier methods helped us build the AI systems that today power some of the world's largest companies.
As artificial intelligence (AI) continues to integrate deeper into industries from healthcare to finance, understanding these historical methods isn't just academic—it's essential for grasping where NLP is heading in the future.
## Tokenization: The First Step in Text Processing
Before we can even begin to represent text in a way that machines can understand, we need to break it down into smaller, manageable pieces. This is where tokenization comes in. Tokenization is the process of splitting a large paragraph or text into smaller units, typically words or phrases, called tokens. These tokens become the basic building blocks for further text analysis.
Tokenization might seem like a simple task, but it's more complex than just splitting text by spaces. Consider sentences with contractions like "don't" or "it's," or compound words like "mother-in-law." Different languages also have different rules for tokenization. For instance, in Chinese, words often consist of multiple characters with no spaces in between.
Here’s an illustration of tokenization using the popular Natural Language Toolkit (NLTK) in Python:
```python
import nltk
# Download necessary resources
nltk.download('punkt')
# Sample text
text = "Tokenization isn't as easy as it seems."
# Tokenize the text into words
tokens = nltk.word_tokenize(text)
print(tokens)
```
**Output:**
```
['Tokenization', 'is', "n't", 'as', 'easy', 'as', 'it', 'seems', '.']
```
### Key Insights:
- **Handling Contractions:** Notice how "isn't" is split into "is" and "n't", reflecting the underlying meaning more accurately.
- **Punctuation Handling:** The period at the end of the sentence is treated as a separate token, which can be crucial in certain NLP tasks like sentiment analysis or named entity recognition.
### **Why use Punkt?**
Punkt is designed to split sentences and words more intelligently than just splitting by spaces or punctuation marks. Here’s why it’s essential:
1. **Sentence and Word Tokenization**: Punkt handles both sentence and word tokenization, recognizing sentence boundaries and handling edge cases like abbreviations, which might confuse a simple space-based tokenizer.
2. **Language-Agnostic**: While primarily built for English, Punkt can be adapted for other languages with minimal modification.
3. **Handling Edge Cases**: Consider sentences like *"Dr. Smith isn't available."* A naive approach might split "Dr." as a standalone token, but Punkt is designed to recognize "Dr." as an abbreviation and not a sentence boundary.
4. **Contractions**: It handles contractions like "isn't" by splitting it into two tokens: "is" and "n't", which is important for many NLP tasks that require analyzing individual word units.
In short, Punkt helps make tokenization more accurate by handling linguistic nuances and complex tokenization scenarios that basic splitting functions might miss.
### Key points to remember about tokenization:
- It's the first step in text preprocessing for NLP tasks.
- It breaks down text into smaller units (tokens), which can be words, phrases, or even sentences.
- The choice of token depends on the task at hand. For instance, if you're analyzing sentiment, you might want to consider phrases (like "not good") instead of just words.
- Tokenization isn't always straightforward, especially for languages other than English. Specialized tokenizers might be needed for different languages or tasks.
## What about JSON?
When dealing with JSON (JavaScript Object Notation) data, tokenization becomes more nuanced. JSON is often used to represent structured data, containing nested objects, arrays, and key-value pairs. Simply splitting text by spaces won’t work here—you’ll need to preserve both the structure and the meaning of the data.
Let’s consider an example:
```json
{
"user": "JohnDoe",
"age": 30,
"hobbies": ["reading", "coding", "gaming"],
"address": {
"street": "123 Main St",
"city": "Sampletown",
"zipcode": "12345"
}
}
```
Tokenizing such an object means not just breaking it into individual words, but also handling keys, values, arrays, and nested objects while maintaining the structure. Here’s a simple Python example that uses the `json` module to tokenize and traverse this structure.
```python
import json
# Sample JSON data
data = '''{
"user": "JohnDoe",
"age": 30,
"hobbies": ["reading", "coding", "gaming"],
"address": {
"street": "123 Main St",
"city": "Sampletown",
"zipcode": "12345"
}
}'''
# Parse the JSON data
json_data = json.loads(data)
# Recursively tokenize JSON objects
def tokenize_json(obj):
"""Tokenizes a JSON object into a list of tokens.
Args:
obj: The JSON object to tokenize.
Returns:
A list of tokens representing the JSON object.
"""
tokens = [] # Initialize an empty list to store the tokens
if isinstance(obj, dict): # If the object is a dictionary
for key, value in obj.items(): # Iterate over key-value pairs
tokens.append(f'KEY: {key}') # Add the key as a token
tokens.extend(tokenize_json(value)) # Recursively tokenize the value
elif isinstance(obj, list): # If the object is a list
for item in obj: # Iterate over items
tokens.extend(tokenize_json(item)) # Recursively tokenize each item
else: # If the object is a scalar value
tokens.append(f'VALUE: {str(obj)}') # Add the value as a token
return tokens # Return the list of tokens
# Tokenize the parsed JSON
tokens = tokenize_json(json_data)
# Output the tokens
for token in tokens:
print(token)
```
**Explanation:**
- **Recursive Tokenization:** The function `tokenize_json` recursively traverses the JSON structure, identifying keys and values at every level, including nested objects and arrays.
- **Differentiating Between Keys and Values:** In the output, we can clearly see the distinction between JSON keys (like `"user"`, `"age"`, etc.) and their corresponding values.
- **Handling Nested Data:** This method ensures that even deeply nested structures like the `"address"` field are tokenized correctly.
**Output:**
```
KEY: user
VALUE: JohnDoe
KEY: age
VALUE: 30
KEY: hobbies
VALUE: reading
VALUE: coding
VALUE: gaming
KEY: address
KEY: street
VALUE: 123 Main St
KEY: city
VALUE: Sampletown
KEY: zipcode
VALUE: 12345
```
- **KEY tokens:** These represent the names of keys or properties within the JSON object. For example, in the JSON object `{"name": "Alice", "age": 30}`, the tokens would be `KEY: name` and `KEY: age`.
- **VALUE tokens:** These represent the values associated with the keys. In the same example, the VALUE tokens would be `VALUE: Alice` and `VALUE: 30`.
---
### Nuances of Tokenizing JSON:
1. **Preserving Structure:** One of the key challenges when tokenizing JSON data is maintaining its hierarchical structure. Simply flattening everything into a list of words (as you might with regular text) would lose the relationships between keys and values, which are crucial for many tasks.
2. **Handling Arrays:** In JSON, arrays can store multiple values, and tokenization needs to distinguish between each element within the array. For instance, the `"hobbies"` field is an array of strings, each representing a separate token, but we must ensure the array structure is respected.
3. **Keys vs. Values:** JSON consists of key-value pairs, so it’s important to clearly differentiate between keys (which act as labels) and values (which hold the actual data). When tokenizing, you may want to process keys differently from values, especially in tasks like data extraction or information retrieval.
4. **Nested Objects:** JSON often contains nested objects. For example, the `"address"` field is itself an object containing more key-value pairs. Tokenizing such nested structures requires a recursive approach to ensure the hierarchy is preserved.
---
#### Why Tokenizing JSON Matters:
Tokenizing JSON data is critical in real-world applications where textual data is structured (like web APIs, logs, or configuration files). Unlike free-form text, JSON has a specific format that must be maintained during tokenization. Proper tokenization ensures the machine can interpret both the structure and the content correctly, which is essential for tasks like:
- **Data extraction:** When pulling information from APIs or databases, maintaining the key-value relationships is crucial.
- **Entity recognition:** If you're building a model to recognize specific entities (like addresses or user information), you'll need to tokenize structured data accurately.
- **Preprocessing for Machine Learning:** Structured JSON data might feed into ML pipelines, and improper tokenization could cause models to misinterpret the input.
---
Tokenization is a crucial step in text representation as it determines how the text will be broken down and understood by the subsequent models. Whether you're using Bag of Words, Word2Vec, or even more advanced models like GPT-4, tokenization is the first step in making text understandable to machines.

__Image Credit to https://aiml.com/what-are-the-advantages-and-disadvantages-of-bag-of-words-model/__
## Bag of Words (BoW): A Simple Start
The Bag of Words (BoW) model was one of the earliest attempts to represent text in a way that computers could understand. This method converts text into a ‘bag’ of words, focusing only on the frequency of words while ignoring their order. While BoW is remarkably simple, its limitations are clear. It strips away context and meaning, treating words like "cat" and "dog" as no more related than "airplane" and "bicycle." Despite these drawbacks, BoW became a staple in early text classification tasks like spam filtering and sentiment analysis.
The limitations of BoW became more obvious as language models became essential for applications like chatbots and machine translation, where nuance and word relationships are crucial. Nevertheless, it laid the groundwork for more sophisticated models to come.
### 1. Bag of Words (BoW)
BoW simply represents a document as a frequency count of words. Here's an example using `CountVectorizer` from `sklearn`.
```python
from sklearn.feature_extraction.text import CountVectorizer
# Sample documents
documents = ["I love deep learning", "Deep learning is fun", "I love NLP"]
# Convert to BoW representation
vectorizer = CountVectorizer()
bow_rep = vectorizer.fit_transform(documents)
# Output BoW matrix
print(bow_rep.toarray())
print(vectorizer.get_feature_names_out())
```
```
[[1 0 0 1 1 0]
[1 1 1 1 0 0]
[0 0 0 0 1 1]]
['deep' 'fun' 'is' 'learning' 'love' 'nlp']
```
**Explanation:**
- `CountVectorizer` converts the text documents into a sparse matrix where each row represents a document and each column represents a word in the vocabulary.
- This basic BoW model doesn't consider word order or semantics.
**Key points to remember about CountVectorizer:**
- Vocabulary creation: It automatically creates a vocabulary of unique words from the input text.
- Sparse matrix representation: It represents the BoW representation as a sparse matrix, which is efficient for handling large datasets with many unique words.
- Customization: You can customize the CountVectorizer using various parameters, such as stop_words, ngram_range, and max_features, to tailor the BoW representation to your specific needs.
A sparse matrix is a matrix where most of the elements are zero. This is often the case in datasets with high dimensionality, such as those encountered in natural language processing or machine learning.
In the context of the Bag-of-Words (BoW) representation, a sparse matrix is used to represent the occurrence of words in a document. Each row of the matrix corresponds to a document, and each column corresponds to a unique word in the vocabulary. The value at the intersection of a row and column indicates the number of times that word appears in the corresponding document.
Because most documents will only contain a small subset of the total vocabulary, the resulting matrix will be very sparse, with many zero values. This makes it inefficient to store the entire matrix in a dense format. Instead, sparse matrix representations are used to store only the non-zero elements of the matrix, along with their corresponding row and column indices. This can significantly reduce the memory footprint of the BoW representation, making it feasible to work with large datasets.
There are several different sparse matrix representations, including:
* **Compressed Sparse Row (CSR) format:** This format stores the matrix as three arrays: one for the non-zero values, one for the column indices of the non-zero values, and one for the row offsets of the non-zero values.
* **Compressed Sparse Column (CSC) format:** This format is similar to CSR, but it stores the matrix by column instead of by row.
* **Coordinate List (COO) format:** This format stores the matrix as three arrays: one for the row indices of the non-zero values, one for the column indices of the non-zero values, and one for the non-zero values themselves.
The choice of which sparse matrix representation to use depends on the specific application and the type of operations that will be performed on the matrix.
### Limitations of Bag of Words:
While the BoW model is simple and easy to implement, it does have its limitations:
- It doesn't capture the order of words or the context in which they're used. This can be problematic for tasks where the order of words is important, such as sentiment analysis or machine translation.
- It treats each word as an independent entity and doesn't capture the semantic relationships between words.
- It can result in a high-dimensional feature vector due to the large size of the vocabulary, which can lead to computational challenges.
## Continuous Bag of Words (CBOW): The Quest for Context
To address the lack of context in BoW, researchers introduced the Continuous Bag of Words (CBOW) model. Part of the Word2Vec family, CBOW aims to produce word embeddings, vectorized representations that encode semantic meaning. Unlike BoW, CBOW takes surrounding words (context) into account to predict a target word. This is a significant leap forward, allowing machines to better understand the relationships between words, although CBOW still struggles with subtle differences like polysemy (where a word has multiple meanings).
With models like CBOW, AI began to move closer to understanding text in a way more akin to how humans do—grasping word meanings based on their surroundings. This breakthrough became a springboard for further advances in text representation.
```python
import torch
from torch.utils.data import DataLoader, Dataset
import pytorch_lightning as pl
# Vocabulary
vocab = {"winter": 0, "summer": 1, "hot": 2, "cold": 3, "snow": 4, "sun": 5}
# Convert words to tensor
def words_to_tensor(words):
return torch.tensor([vocab[word] for word in words])
# Example dataset class
class CBOWDataset(Dataset):
def __init__(self, context_data, target_data):
self.context_data = [words_to_tensor(context) for context in context_data]
self.target_data = [vocab[target] for target in target_data]
def __len__(self):
return len(self.target_data)
def __getitem__(self, idx):
return self.context_data[idx], self.target_data[idx]
# Example CBOW model
class CBOWModel(pl.LightningModule):
def __init__(self, vocab_size, embedding_dim):
super().__init__()
self.embeddings = torch.nn.Embedding(vocab_size, embedding_dim)
self.linear = torch.nn.Linear(embedding_dim, vocab_size)
def forward(self, context):
embedded = self.embeddings(context).mean(dim=1) # Average context embeddings
output = self.linear(embedded) # No need to reshape embedded
return output
def training_step(self, batch, batch_idx):
context, target = batch
output = self(context)
loss = torch.nn.functional.cross_entropy(output, target.view(-1))
self.log('train_loss', loss)
return loss
def configure_optimizers(self):
return torch.optim.Adam(self.parameters(), lr=0.001)
if __name__ == "__main__":
# Sample training data
context_data = [["winter", "snow"], ["summer", "sun"], ["hot", "summer"], ["cold", "winter"], ["sun", "hot"], ["snow", "cold"]]
target_data = ["cold", "hot", "sun", "snow", "summer", "winter"]
train_dataset = CBOWDataset(context_data, target_data)
train_loader = DataLoader(train_dataset, batch_size=1, shuffle=True)
model = CBOWModel(vocab_size=len(vocab), embedding_dim=10)
# Train the model
trainer = pl.Trainer(max_epochs=100)
trainer.fit(model, train_loader)
# Print embeddings for words in the context
for word, index in vocab.items():
word_tensor = torch.tensor(index).long()
word_embed = model.embeddings(word_tensor)
print(f"Embedding for '{word}': {word_embed.detach().numpy()}")
# Example usage for prediction
CTX = ["winter", "snow"]
context = words_to_tensor(CTX) # Context: ["winter", "snow"]
print(CTX)
model.eval()
with torch.no_grad():
output = model(context.view(1, -1))
predicted_index = torch.argmax(output)
print(f"Predicted next word: {list(vocab.keys())[list(vocab.values()).index(predicted_index.item())]}")
```
```
Embedding for 'winter': [-0.41915172 -0.623203 -0.82449454 2.5393567 0.4014106 -1.2217898
0.8545555 1.1546286 -1.6388881 1.0636564 ]
Embedding for 'summer': [ 0.27508304 0.14038266 -0.60545456 2.6588614 -0.67323846 -0.41758934
-1.9973164 -0.25613934 0.6041437 0.02313808]
Embedding for 'hot': [ 1.5088211 0.11288307 -2.1109881 1.1869487 -0.72843784 0.39130706
-1.167067 0.02461074 1.6819512 0.893444 ]
Embedding for 'cold': [ 0.46896353 0.09806713 0.886707 -2.3495743 -1.8937181 -1.7938056
0.5059122 0.73013675 0.63746464 0.454989 ]
Embedding for 'snow': [-0.98443097 0.36648035 -1.244046 0.8789777 -0.73171115 -0.8969204
-0.6760932 1.4665215 -1.4990703 -1.3799495 ]
Embedding for 'sun': [-0.7282811 -0.45672363 -1.2333208 1.1166155 -0.8474363 1.9662368
-1.5384043 0.01905673 0.08274112 -0.20752785]
['winter', 'snow']
Predicted next word: cold
```
The Continuous Bag of Words (CBOW) model is a popular method for training word embeddings. Word embeddings are a type of word representation that allows words with similar meaning to have a similar representation, which is crucial for many natural language processing (NLP) tasks. Word embeddings generated by CBOW can be used to measure semantic similarity between words or documents. For example, you can calculate the cosine similarity between two word vectors to determine how similar their meanings are.
## Word2Vec: The Revolution Before the Revolution
The Word2Vec models, including CBOW, marked a seismic shift in NLP. Developed by Google in 2013, these shallow neural networks could produce high-quality word embeddings efficiently. For the first time, NLP models could understand relationships between words based on their usage patterns across vast datasets. Word2Vec helped NLP models grasp concepts like the famous "king - man + woman = queen" analogy, encapsulating the semantic relationship between words in vector space.
Yet, despite their power, Word2Vec models had limitations. They were still relatively shallow, meaning they struggled with tasks that required a deep understanding of word order, syntax, or polysemy. This set the stage for more complex models that could tackle those challenges head-on.
## Capturing Meaning in Text
Capturing the meaning in text and structured data like JSON is one of the most challenging aspects of text analysis. It involves understanding the context, semantics, syntax, and structure of the data. Several techniques are used to capture meaning in text and structured data:
- **Tokenization:** This is the process of breaking down text into smaller units, called tokens. In the context of JSON, tokenization involves not just breaking down text into individual words, but also handling keys, values, arrays, and nested objects while maintaining the structure. This is crucial for machines to interpret both the structure and the content correctly.
- **Part-of-Speech (POS) Tagging:** This involves labeling each word in the text as a noun, verb, adjective, etc. POS tagging helps in understanding the grammatical structure of the text and the role of each word in the sentence.
- **Named Entity Recognition (NER):** NER involves identifying and classifying named entities in the text, such as persons, organizations, locations, etc. It helps in understanding who or what the text is about. In the context of JSON, NER can be used to identify and classify entities within the structured data.
- **Dependency Parsing:** This involves analyzing the grammatical structure of the sentence to determine the relationships between words. Dependency parsing helps in understanding how words in a sentence relate to each other. In structured data like JSON, understanding the relationships between different elements is crucial for accurately interpreting the data.
- **Semantic Analysis:** This involves understanding the meaning of the text. Techniques used in semantic analysis include word sense disambiguation (determining the correct meaning of a word based on context), semantic role labeling (identifying the semantic roles of words in a sentence), and semantic similarity (measuring the semantic similarity between words or sentences). In the context of JSON, semantic analysis can involve understanding the meaning and context of the keys and values in the structured data.
By combining these techniques, we can capture the meaning in both text and structured data, enabling more accurate and nuanced analysis and interpretation.
## Conclusion: The Future Built on the Past
In 2024, as AI begins to reshape industries, politics, and daily life, it’s easy to forget the early days of NLP when methods like BoW dominated the field. Understanding the evolution of text representation from these simpler models to GPT and beyond — helps us appreciate the complex AI systems we rely on today. As we look to the future, where NLP models could enable even more personalized and nuanced interactions between humans and machines, one thing is clear: the past is prologue.
The next wave of AI innovation will build upon these foundations, and it’s crucial to understand where we’ve come from to anticipate where we’re headed.