Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/furk4neg3/ibm_nlp_tokenization_and_dataloaders
This repository contains my practice implementations of various tokenization techniques and creating dataloaders as part of an IBM AI lab project.
https://github.com/furk4neg3/ibm_nlp_tokenization_and_dataloaders
Last synced: about 2 months ago
JSON representation
This repository contains my practice implementations of various tokenization techniques and creating dataloaders as part of an IBM AI lab project.
- Host: GitHub
- URL: https://github.com/furk4neg3/ibm_nlp_tokenization_and_dataloaders
- Owner: furk4neg3
- Created: 2024-11-10T10:50:54.000Z (about 2 months ago)
- Default Branch: main
- Last Pushed: 2024-11-10T10:57:09.000Z (about 2 months ago)
- Last Synced: 2024-11-10T11:32:01.374Z (about 2 months ago)
- Language: Jupyter Notebook
- Size: 0 Bytes
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# IBM NLP Tokenization and Data Loader Lab
This repository contains practice implementations of tokenization techniques and data loader design for NLP, part of an IBM AI lab project. The project covers fundamental tokenization methods, including word, subword, and sentence tokenization, along with custom data loaders designed to handle large text datasets effectively in machine learning workflows. Implemented using Python and Hugging Face's Transformers library, this lab offers a comprehensive view of text preprocessing and data management in NLP.
## Table of Contents
- [Overview](#overview)
- [Technologies Used](#technologies-used)
- [Project Details](#project-details)
- [Key Learnings](#key-learnings)
- [References](#references)## Overview
Tokenization and efficient data handling are critical in natural language processing. This project implements tokenization techniques such as:
- Word tokenization
- Subword tokenization (BPE, WordPiece)
- Sentence tokenizationIn addition, I designed data loaders to preprocess and batch data for model training, optimizing data throughput and memory management.
## Technologies Used
- **Python**: Primary programming language
- **Hugging Face Transformers**: Library for model tokenization and NLP tasks
- **Jupyter Notebook**: For documenting and demonstrating code execution
- **PyTorch**: For data loader creation and handling batched data## Project Details
1. **Tokenization Techniques**:
- **Word Tokenization**: Tokenizers that split text into words using whitespace and punctuation.
- **Subword Tokenization**: Implemented Byte Pair Encoding (BPE) and WordPiece methods.
- **Sentence Tokenization**: Tokenization at the sentence level for sentence-based processing.2. **Data Loaders**:
- Designed data loaders compatible with PyTorch for batch processing.
- Enabled efficient memory usage and data streaming to optimize model training.## Key Learnings
- Practical application of various tokenization techniques and their NLP applications.
- Design and implementation of custom data loaders for handling large-scale text data.
- Proficiency with Hugging Face’s Transformers library and PyTorch data utilities.
- Building scalable text preprocessing and batching workflows for NLP.## References
- [IBM AI Engineering Professional Certificate](https://www.coursera.org/professional-certificates/ai-engineer?)
- [Generative AI Engineering with LLMs Specialization](https://www.coursera.org/specializations/generative-ai-engineering-with-llms)
- [Generative AI and LLMs: Architecture and Data Preparation](https://www.coursera.org/learn/generative-ai-llm-architecture-data-preparation?specialization=generative-ai-engineering-with-llms)