Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/furk4neg3/ibm_nlp_tokenization_and_classification

This repository contains my implementations of tokenization techniques and document classification as part of an IBM AI lab project.
https://github.com/furk4neg3/ibm_nlp_tokenization_and_classification

Last synced: about 2 months ago
JSON representation

This repository contains my implementations of tokenization techniques and document classification as part of an IBM AI lab project.

Awesome Lists containing this project

README

        

# IBM NLP Tokenization and Document Classification Lab

This repository contains my implementations of tokenization techniques and document classification as part of an IBM AI lab project. The exercises cover fundamental NLP preprocessing methods, including word and subword tokenization, as well as applying classification models on processed text data using Hugging Face's Transformers library.

## Table of Contents

- [Overview](#overview)
- [Technologies Used](#technologies-used)
- [Project Details](#project-details)
- [Key Learnings](#key-learnings)
- [References](#references)

## Overview

Tokenization and document classification are essential steps in natural language processing. This project implements tokenization techniques such as:

- Word tokenization
- Subword tokenization (BPE, WordPiece)
- Sentence-level tokenization

Additionally, a classification model is applied to the tokenized text data, demonstrating how NLP models process and classify documents based on their content.

## Technologies Used

- **Python**: Primary programming language
- **Hugging Face Transformers**: Library for model tokenization and NLP tasks
- **Jupyter Notebook**: For documenting and demonstrating code execution

## Project Details

1. **Tokenization Techniques**:
- **Word Tokenization**: Tokenizers that split text into individual words.
- **Subword Tokenization**: Implements Byte Pair Encoding (BPE) and WordPiece methods.
- **Sentence Tokenization**: Tokenizes text at the sentence level, useful for tasks requiring sentence-level analysis.

2. **Document Classification**:
- Applied a classification model to categorize text documents based on their content.
- Trained a Hugging Face Transformer model to perform document classification, providing hands-on experience with text analysis.

## Key Learnings

- Practical understanding of various tokenization methods and their importance in NLP.
- Hands-on experience with document classification using Hugging Face’s Transformers library.
- Exposure to foundational NLP workflows, including data preparation and model training.

## References

- [IBM AI Engineering Professional Certificate](https://www.coursera.org/professional-certificates/ai-engineer?)
- [Generative AI Engineering with LLMs Specialization](https://www.coursera.org/specializations/generative-ai-engineering-with-llms)
- [Gen AI Foundational Models for NLP & Language Understanding](https://www.coursera.org/learn/gen-ai-foundational-models-for-nlp-and-language-understanding?specialization=generative-ai-engineering-with-llms)