https://github.com/rahulunair/xpu_text_classifier
Compare different text classifiers on Intel discrete GPUs
https://github.com/rahulunair/xpu_text_classifier
alchemist bart bert-model intel-arc intel-gpu intel-gpu-max
Last synced: 9 months ago
JSON representation
Compare different text classifiers on Intel discrete GPUs
- Host: GitHub
- URL: https://github.com/rahulunair/xpu_text_classifier
- Owner: rahulunair
- License: apache-2.0
- Created: 2023-06-22T22:14:23.000Z (over 2 years ago)
- Default Branch: main
- Last Pushed: 2023-06-29T16:39:16.000Z (over 2 years ago)
- Last Synced: 2025-01-19T21:48:25.232Z (about 1 year ago)
- Topics: alchemist, bart, bert-model, intel-arc, intel-gpu, intel-gpu-max
- Language: Python
- Homepage:
- Size: 32.2 KB
- Stars: 2
- Watchers: 2
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# xpu_text_classifier: Custom Text Classification on Intel dGPUs
xpu_text_classifier allows you to fine-tune transformer models using custom datasets for multi-class or multi-label classification tasks. The models supported include popular transformer architectures like BERT, BART, DistilBERT, etc. This solution uses the Huggingface Trainer to handle the training and leverages Intel Extension for PyTorch to run on Intel dGPUs.
## Table of Contents
- [Installation](#installation)
- [Usage](#usage)
- [Monitoring GPU Usage](#monitoring-gpu-usage)
- [Additional Details](#additional-details)
## Installation
Before you start, ensure you have PyTorch and Intel Extension for PyTorch installed.
To install xpu_text_classifier:
1. Clone the transformers_xpu repository from GitHub:
```bash
git clone https://github.com/rahulunair/transformers_xpu.git
cd transformers_xpu
```
2. Install the package:
```bash
python setup.py install
```
3. Install the required dependencies:
```bash
pip install datasets scikit-learn
```
4. Optionally, install Weights & Biases to monitor your training process:
```bash
pip install wandb
```
## Preparing Your Dataset
The dataset should be in a format compatible with the Hugging Face's load_dataset function, which includes CSV, JSON, and several others. The dataset should have two columns 'text' and 'label'.
For multi-class classification tasks, each label is a single integer. For multi-label classification tasks, each label is a list of integers.
Multi-Class Classification Example:
| text | label |
|-----------------|-------|
| This is text 1 | 0 |
| This is text 2 | 2 |
| This is text 3 | 1 |
Multi-Label Classification Example:
| text | label |
|-----------------|--------------|
| This is text 1 | [0, 1] |
| This is text 2 | [1, 2] |
| This is text 3 | [0, 2] |
After preparing your dataset, save it in a format such as JSON or CSV in a directory. The name of this directory will be used as the dataset_name parameter when using the TextClassifier.
## Usage
The script `custom_finetune.py` in the root directory is your entry point for training a model. By default, it uses the 'distilbert-base-uncased' model and Gutenberg dataset with 30 labels.
You can either tweak the `custom_finetune.py` file or create a new python file with these details:
Import TextClassifier from classifier module
```python
import torch
import intel_extension_for_pytorch
from classifier import Text Classifier
```
Instantiate the classifier:
```python
classifier = TextClassifier(
model_name="distilbert-base-uncased",
dataset_name="path/to/your/dataset_directory", # use the name of the directory where you saved your dataset
num_labels=2,
task_type="multi_class",
)
```
Start Training:
```python
classifier.train(epochs=10, batch=16, use_bf16=False)
```
You can specify the model name, number of labels(classes), number of epochs, batch size, and whether to use BF16 precision with the train function as shown in the file `custom_finetune.py`.
To train on a single GPU:
```bash
python custom_finetune.py
```
To train using all available GPUs:
```bash
export MASTER_ADDR=127.0.0.1
source /home/orange/pytorch_xpu_orange/lib/python3.10/site-packages/oneccl_bindings_for_pytorch/env/setvars.sh
mpirun -n 4 python custom_finetune.py
```
Replace 4 with the number of GPUs available in your system.
## Monitoring GPU Usage
To monitor the GPU usage:
```bash
xpu-smi dump -m5,18 # VRAM utilization
```
## Additional Details
The custom_finetune.py script fetches an e-book from Gutenberg and prepares a dataset for the training task. The dataset is stored in the directory specified by dataset_name as a csv file with two columns: text and label.
**Please note**, the transformers expect the labels to be integers. If your labels are strings, make sure to encode them into integers before passing them to the TextClassifier:
```python
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
labels = ['cat', 'mat', 'bat', 'cat', 'bat']
encoded_labels = le.fit_transform(labels)
```
For more details on the TextClassifier, refer to classifier.py.
Remember to check the script and adjust the parameters (model type, dataset, epochs, batch size, etc.) according to your needs.
Happy fine-tuning!