Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/furk4neg3/ibm_tokenization_techniques
Practice on tokenizers in Generative AI Advance Fine-Tuning for LLMs course of IBM.
https://github.com/furk4neg3/ibm_tokenization_techniques
Last synced: about 2 months ago
JSON representation
Practice on tokenizers in Generative AI Advance Fine-Tuning for LLMs course of IBM.
- Host: GitHub
- URL: https://github.com/furk4neg3/ibm_tokenization_techniques
- Owner: furk4neg3
- Created: 2024-11-09T19:03:30.000Z (about 2 months ago)
- Default Branch: main
- Last Pushed: 2024-11-09T19:24:09.000Z (about 2 months ago)
- Last Synced: 2024-11-09T20:17:53.773Z (about 2 months ago)
- Language: Jupyter Notebook
- Size: 26.4 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# IBM NLP Tokenization Lab
This repository contains my practice implementations of various tokenization techniques as part of an IBM AI lab project. The exercises cover the fundamentals of tokenization, including word, subword, and sentence tokenization, to prepare text data for machine learning models. Implemented using Python and Hugging Face's Transformers library, this lab is a foundational project for understanding the preprocessing steps crucial to natural language processing (NLP).
## Table of Contents
- [Overview](#overview)
- [Technologies Used](#technologies-used)
- [Project Details](#project-details)
- [Key Learnings](#key-learnings)
- [References](#references)## Overview
Tokenization is an essential step in natural language processing, breaking down text into meaningful pieces for easier analysis and processing by machine learning models. In this lab, I implemented tokenization techniques commonly used in NLP, exploring methods like:
- Word tokenization
- Subword tokenization (BPE, WordPiece)
- Sentence tokenizationThis project provided hands-on practice with both basic and advanced tokenizers, understanding their role in preparing text data for various NLP models.
## Technologies Used
- **Python**: Primary programming language
- **Hugging Face Transformers**: Library for model tokenization and NLP tasks
- **Jupyter Notebook**: For documenting and demonstrating code execution## Project Details
1. **Word Tokenization**: Explored tokenizers that split text into words using whitespace and punctuation as delimiters.
2. **Subword Tokenization**: Implemented Byte Pair Encoding (BPE) and WordPiece methods, commonly used in modern language models.
3. **Sentence Tokenization**: Tokenized text at the sentence level, useful for tasks involving sentence-based processing.Each technique includes code samples and explanations of its applications in NLP.
## Key Learnings
- Understanding different tokenization methods and their use cases
- Implementing tokenizers to handle diverse language data
- Gaining proficiency with Hugging Face’s Transformers library
- Applying tokenization to real-world data for model training## References
- [IBM AI Engineering Professional Certificate](https://www.coursera.org/professional-certificates/ai-engineer?)
- [Generative AI Engineering with LLMs Specialization](https://www.coursera.org/specializations/generative-ai-engineering-with-llms)
- [Generative AI and LLMs: Architecture and Data Preparation](https://www.coursera.org/learn/generative-ai-llm-architecture-data-preparation?specialization=ai-engineer)