https://github.com/incept5/ai-learning
This repository contains a collection of demos and examples for learning about Natural Language Processing (NLP) and Large Language Models (LLMs). It includes examples of tokenization, embeddings, sentiment analysis, and more
https://github.com/incept5/ai-learning
Last synced: about 1 year ago
JSON representation
This repository contains a collection of demos and examples for learning about Natural Language Processing (NLP) and Large Language Models (LLMs). It includes examples of tokenization, embeddings, sentiment analysis, and more
- Host: GitHub
- URL: https://github.com/incept5/ai-learning
- Owner: Incept5
- Created: 2025-05-01T13:00:30.000Z (about 1 year ago)
- Default Branch: main
- Last Pushed: 2025-05-01T13:32:29.000Z (about 1 year ago)
- Last Synced: 2025-06-12T07:43:11.635Z (about 1 year ago)
- Language: Python
- Size: 160 KB
- Stars: 0
- Watchers: 6
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# NLP and LLM Learning Repository
This repository contains a collection of demos and examples for learning about Natural Language Processing (NLP) and Large Language Models (LLMs). It includes examples of tokenization, embeddings, sentiment analysis, and more.
## Repository Structure
The repository is organized into the following main sections:
### Tokenization
The `tokenization` directory contains demos and examples related to tokenization, which is the process of breaking text into tokens that can be processed by language models.
- [TOKENIZATION_MODELS.md](tokenization/TOKENIZATION_MODELS.md) - Guide to different tokenization models
- [llm-tokenization-demo.py](tokenization/llm-tokenization-demo.py) - Comprehensive demo comparing GPT-2 and BERT tokenizers
- [simple_tokenization.py](tokenization/simple_tokenization.py) - Basic tokenization using BERT tokenizer
- [simple_tokenization_02.py](tokenization/simple_tokenization_02.py) - Comparison between GPT-2 and OPT tokenizers
- [tiktoken_demo.py](tokenization/tiktoken_demo.py) - Demo of OpenAI's tiktoken library
- [sentencepiece_demo.py](tokenization/sentencepiece_demo.py) - Demo of Google's SentencePiece tokenizer
- [tokenization_comparison.py](tokenization/tokenization_comparison.py) - Comparison of different tokenization approaches
### Embeddings
The `embeddings` directory contains demos and examples related to text embeddings, which are vector representations of text that capture semantic meaning.
- [EMBEDDING_MODELS.md](embeddings/EMBEDDING_MODELS.md) - Guide to different embedding models
- [embeddings_demo.py](embeddings/embeddings_demo.py) - Demo using SentenceTransformers models
- [huggingface_embeddings_demo.py](embeddings/huggingface_embeddings_demo.py) - Demo using Hugging Face Transformers models
- [openai_embeddings_demo.py](embeddings/openai_embeddings_demo.py) - Demo using OpenAI's embedding models
- [ollama_embedding.py](embeddings/ollama_embedding.py) - Demo using Ollama's local embedding models
### Other Demos
- [gp2_demo.py](gp2_demo.py) - Demo showcasing GPT-2 limitations and hallucinations
- [sentiment_01.py](sentiment_01.py) - Simple sentiment analysis using a pre-trained model
## Setup Instructions
### Prerequisites
- Python 3.8 or higher
- pip (Python package installer)
### Installation
1. Clone this repository:
```bash
git clone https://github.com:Incept5/ai-learning.git
cd ai-learning
```
2. Create and activate a virtual environment (recommended):
```bash
python -m venv venv
# On Windows
venv\Scripts\activate
# On macOS/Linux
source venv/bin/activate
```
3. Install the required packages:
```bash
pip install -r requirements.txt
```
4. Set up environment variables (for API-based demos):
```bash
# Copy the example .env file
cp .env.example .env
# Edit the .env file with your API keys
# Get your OpenAI API key from: https://platform.openai.com/account/api-keys
# Get your Hugging Face token from: https://huggingface.co/settings/tokens
```
## Running the Demos
### Tokenization Demos
```bash
# Basic tokenization demo
python tokenization/simple_tokenization.py
# Comprehensive tokenization comparison
python tokenization/llm-tokenization-demo.py
# OpenAI's tiktoken demo
python tokenization/tiktoken_demo.py
# SentencePiece demo
python tokenization/sentencepiece_demo.py
```
### Embedding Demos
```bash
# List available SentenceTransformers models
python embeddings/embeddings_demo.py --list-models
# Run embedding demo with a specific model
python embeddings/embeddings_demo.py --model all-mpnet-base-v2
# OpenAI embeddings demo (requires API key)
python embeddings/openai_embeddings_demo.py
# Hugging Face embeddings demo
python embeddings/huggingface_embeddings_demo.py
# Ollama embeddings demo (requires Ollama installation)
python embeddings/ollama_embedding.py
```
### Other Demos
```bash
# GPT-2 limitations demo
python gp2_demo.py
# Sentiment analysis demo
python sentiment_01.py
```
## Additional Resources
- [Hugging Face Transformers Documentation](https://huggingface.co/docs/transformers/index)
- [SentenceTransformers Documentation](https://www.sbert.net/)
- [OpenAI API Documentation](https://platform.openai.com/docs/api-reference)
- [Ollama Documentation](https://ollama.com/docs)
## Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
## License
This project is licensed under the MIT License - see the LICENSE file for details.