Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/furk4neg3/ibm_nlp_tokenization_and_dataloaders

This repository contains my practice implementations of various tokenization techniques and creating dataloaders as part of an IBM AI lab project.
https://github.com/furk4neg3/ibm_nlp_tokenization_and_dataloaders

Last synced: about 2 months ago
JSON representation

This repository contains my practice implementations of various tokenization techniques and creating dataloaders as part of an IBM AI lab project.

Awesome Lists containing this project

README

        

# IBM NLP Tokenization and Data Loader Lab

This repository contains practice implementations of tokenization techniques and data loader design for NLP, part of an IBM AI lab project. The project covers fundamental tokenization methods, including word, subword, and sentence tokenization, along with custom data loaders designed to handle large text datasets effectively in machine learning workflows. Implemented using Python and Hugging Face's Transformers library, this lab offers a comprehensive view of text preprocessing and data management in NLP.

## Table of Contents

- [Overview](#overview)
- [Technologies Used](#technologies-used)
- [Project Details](#project-details)
- [Key Learnings](#key-learnings)
- [References](#references)

## Overview

Tokenization and efficient data handling are critical in natural language processing. This project implements tokenization techniques such as:

- Word tokenization
- Subword tokenization (BPE, WordPiece)
- Sentence tokenization

In addition, I designed data loaders to preprocess and batch data for model training, optimizing data throughput and memory management.

## Technologies Used

- **Python**: Primary programming language
- **Hugging Face Transformers**: Library for model tokenization and NLP tasks
- **Jupyter Notebook**: For documenting and demonstrating code execution
- **PyTorch**: For data loader creation and handling batched data

## Project Details

1. **Tokenization Techniques**:
- **Word Tokenization**: Tokenizers that split text into words using whitespace and punctuation.
- **Subword Tokenization**: Implemented Byte Pair Encoding (BPE) and WordPiece methods.
- **Sentence Tokenization**: Tokenization at the sentence level for sentence-based processing.

2. **Data Loaders**:
- Designed data loaders compatible with PyTorch for batch processing.
- Enabled efficient memory usage and data streaming to optimize model training.

## Key Learnings

- Practical application of various tokenization techniques and their NLP applications.
- Design and implementation of custom data loaders for handling large-scale text data.
- Proficiency with Hugging Face’s Transformers library and PyTorch data utilities.
- Building scalable text preprocessing and batching workflows for NLP.

## References

- [IBM AI Engineering Professional Certificate](https://www.coursera.org/professional-certificates/ai-engineer?)
- [Generative AI Engineering with LLMs Specialization](https://www.coursera.org/specializations/generative-ai-engineering-with-llms)
- [Generative AI and LLMs: Architecture and Data Preparation](https://www.coursera.org/learn/generative-ai-llm-architecture-data-preparation?specialization=generative-ai-engineering-with-llms)