Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/furk4neg3/ibm_nlp_tokenization_and_classification
This repository contains my implementations of tokenization techniques and document classification as part of an IBM AI lab project.
https://github.com/furk4neg3/ibm_nlp_tokenization_and_classification
Last synced: about 2 months ago
JSON representation
This repository contains my implementations of tokenization techniques and document classification as part of an IBM AI lab project.
- Host: GitHub
- URL: https://github.com/furk4neg3/ibm_nlp_tokenization_and_classification
- Owner: furk4neg3
- Created: 2024-11-10T12:27:41.000Z (about 2 months ago)
- Default Branch: main
- Last Pushed: 2024-11-10T12:29:29.000Z (about 2 months ago)
- Last Synced: 2024-11-10T13:28:35.311Z (about 2 months ago)
- Language: Jupyter Notebook
- Size: 1.31 MB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# IBM NLP Tokenization and Document Classification Lab
This repository contains my implementations of tokenization techniques and document classification as part of an IBM AI lab project. The exercises cover fundamental NLP preprocessing methods, including word and subword tokenization, as well as applying classification models on processed text data using Hugging Face's Transformers library.
## Table of Contents
- [Overview](#overview)
- [Technologies Used](#technologies-used)
- [Project Details](#project-details)
- [Key Learnings](#key-learnings)
- [References](#references)## Overview
Tokenization and document classification are essential steps in natural language processing. This project implements tokenization techniques such as:
- Word tokenization
- Subword tokenization (BPE, WordPiece)
- Sentence-level tokenizationAdditionally, a classification model is applied to the tokenized text data, demonstrating how NLP models process and classify documents based on their content.
## Technologies Used
- **Python**: Primary programming language
- **Hugging Face Transformers**: Library for model tokenization and NLP tasks
- **Jupyter Notebook**: For documenting and demonstrating code execution## Project Details
1. **Tokenization Techniques**:
- **Word Tokenization**: Tokenizers that split text into individual words.
- **Subword Tokenization**: Implements Byte Pair Encoding (BPE) and WordPiece methods.
- **Sentence Tokenization**: Tokenizes text at the sentence level, useful for tasks requiring sentence-level analysis.2. **Document Classification**:
- Applied a classification model to categorize text documents based on their content.
- Trained a Hugging Face Transformer model to perform document classification, providing hands-on experience with text analysis.## Key Learnings
- Practical understanding of various tokenization methods and their importance in NLP.
- Hands-on experience with document classification using Hugging Face’s Transformers library.
- Exposure to foundational NLP workflows, including data preparation and model training.## References
- [IBM AI Engineering Professional Certificate](https://www.coursera.org/professional-certificates/ai-engineer?)
- [Generative AI Engineering with LLMs Specialization](https://www.coursera.org/specializations/generative-ai-engineering-with-llms)
- [Gen AI Foundational Models for NLP & Language Understanding](https://www.coursera.org/learn/gen-ai-foundational-models-for-nlp-and-language-understanding?specialization=generative-ai-engineering-with-llms)