Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/furk4neg3/ibm_tokenization_techniques

Practice on tokenizers in Generative AI Advance Fine-Tuning for LLMs course of IBM.
https://github.com/furk4neg3/ibm_tokenization_techniques

Last synced: about 2 months ago
JSON representation

Practice on tokenizers in Generative AI Advance Fine-Tuning for LLMs course of IBM.

Awesome Lists containing this project

README

        

# IBM NLP Tokenization Lab

This repository contains my practice implementations of various tokenization techniques as part of an IBM AI lab project. The exercises cover the fundamentals of tokenization, including word, subword, and sentence tokenization, to prepare text data for machine learning models. Implemented using Python and Hugging Face's Transformers library, this lab is a foundational project for understanding the preprocessing steps crucial to natural language processing (NLP).

## Table of Contents

- [Overview](#overview)
- [Technologies Used](#technologies-used)
- [Project Details](#project-details)
- [Key Learnings](#key-learnings)
- [References](#references)

## Overview

Tokenization is an essential step in natural language processing, breaking down text into meaningful pieces for easier analysis and processing by machine learning models. In this lab, I implemented tokenization techniques commonly used in NLP, exploring methods like:

- Word tokenization
- Subword tokenization (BPE, WordPiece)
- Sentence tokenization

This project provided hands-on practice with both basic and advanced tokenizers, understanding their role in preparing text data for various NLP models.

## Technologies Used

- **Python**: Primary programming language
- **Hugging Face Transformers**: Library for model tokenization and NLP tasks
- **Jupyter Notebook**: For documenting and demonstrating code execution

## Project Details

1. **Word Tokenization**: Explored tokenizers that split text into words using whitespace and punctuation as delimiters.
2. **Subword Tokenization**: Implemented Byte Pair Encoding (BPE) and WordPiece methods, commonly used in modern language models.
3. **Sentence Tokenization**: Tokenized text at the sentence level, useful for tasks involving sentence-based processing.

Each technique includes code samples and explanations of its applications in NLP.

## Key Learnings

- Understanding different tokenization methods and their use cases
- Implementing tokenizers to handle diverse language data
- Gaining proficiency with Hugging Face’s Transformers library
- Applying tokenization to real-world data for model training

## References

- [IBM AI Engineering Professional Certificate](https://www.coursera.org/professional-certificates/ai-engineer?)
- [Generative AI Engineering with LLMs Specialization](https://www.coursera.org/specializations/generative-ai-engineering-with-llms)
- [Generative AI and LLMs: Architecture and Data Preparation](https://www.coursera.org/learn/generative-ai-llm-architecture-data-preparation?specialization=ai-engineer)