{"id":20229294,"url":"https://github.com/rahulunair/xpu_text_classifier","last_synced_at":"2025-06-12T23:34:53.645Z","repository":{"id":177119314,"uuid":"657356097","full_name":"rahulunair/xpu_text_classifier","owner":"rahulunair","description":"Compare different text classifiers on Intel discrete GPUs","archived":false,"fork":false,"pushed_at":"2023-06-29T16:39:16.000Z","size":33,"stargazers_count":2,"open_issues_count":0,"forks_count":0,"subscribers_count":2,"default_branch":"main","last_synced_at":"2025-01-19T21:48:25.232Z","etag":null,"topics":["alchemist","bart","bert-model","intel-arc","intel-gpu","intel-gpu-max"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/rahulunair.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-06-22T22:14:23.000Z","updated_at":"2024-08-08T13:50:54.000Z","dependencies_parsed_at":null,"dependency_job_id":"8552a70b-a9f4-4b41-8f82-8132bb7ff697","html_url":"https://github.com/rahulunair/xpu_text_classifier","commit_stats":null,"previous_names":["rahulunair/xpu_text_classifier"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rahulunair%2Fxpu_text_classifier","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rahulunair%2Fxpu_text_classifier/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rahulunair%2Fxpu_text_classifier/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rahulunair%2Fxpu_text_classifier/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/rahulunair","download_url":"https://codeload.github.com/rahulunair/xpu_text_classifier/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":241670724,"owners_count":20000456,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["alchemist","bart","bert-model","intel-arc","intel-gpu","intel-gpu-max"],"created_at":"2024-11-14T07:35:10.639Z","updated_at":"2025-03-03T13:23:28.929Z","avatar_url":"https://github.com/rahulunair.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# xpu_text_classifier: Custom Text Classification on Intel dGPUs\n\nxpu_text_classifier allows you to fine-tune transformer models using custom datasets for multi-class or multi-label classification tasks. The models supported include popular transformer architectures like BERT, BART, DistilBERT, etc. This solution uses the Huggingface Trainer to handle the training and leverages Intel Extension for PyTorch to run on Intel dGPUs.\n\n## Table of Contents\n- [Installation](#installation)\n- [Usage](#usage)\n- [Monitoring GPU Usage](#monitoring-gpu-usage)\n- [Additional Details](#additional-details)\n\n## Installation\n\nBefore you start, ensure you have PyTorch and Intel Extension for PyTorch installed. \n\nTo install xpu_text_classifier:\n\n1. Clone the transformers_xpu repository from GitHub:\n\n    ```bash\n    git clone https://github.com/rahulunair/transformers_xpu.git\n    cd transformers_xpu\n    ```\n\n2. Install the package:\n\n    ```bash\n    python setup.py install\n    ```\n\n3. Install the required dependencies:\n\n    ```bash\n    pip install datasets scikit-learn\n    ```\n\n4. Optionally, install Weights \u0026 Biases to monitor your training process:\n\n    ```bash\n    pip install wandb\n    ```\n\n## Preparing Your Dataset\n\nThe dataset should be in a format compatible with the Hugging Face's load_dataset function, which includes CSV, JSON, and several others. The dataset should have two columns 'text' and 'label'.\nFor multi-class classification tasks, each label is a single integer. For multi-label classification tasks, each label is a list of integers.\n\nMulti-Class Classification Example:\n\n\n| text            | label |\n|-----------------|-------|\n| This is text 1  | 0     |\n| This is text 2  | 2     |\n| This is text 3  | 1     |\n\nMulti-Label Classification Example:\n\n| text            | label        |\n|-----------------|--------------|\n| This is text 1  | [0, 1]       |\n| This is text 2  | [1, 2]       |\n| This is text 3  | [0, 2]       |\n\n\nAfter preparing your dataset, save it in a format such as JSON or CSV in a directory. The name of this directory will be used as the dataset_name parameter when using the TextClassifier.\n\n## Usage\n\nThe script `custom_finetune.py` in the root directory is your entry point for training a model. By default, it uses the 'distilbert-base-uncased' model and Gutenberg dataset with 30 labels.\n\nYou can either tweak the `custom_finetune.py` file or create a new python file with these details:\n\nImport TextClassifier from classifier module\n```python\nimport torch\nimport intel_extension_for_pytorch\n\nfrom classifier import Text Classifier\n```\n\nInstantiate the classifier:\n\n```python\nclassifier = TextClassifier(\n    model_name=\"distilbert-base-uncased\",\n    dataset_name=\"path/to/your/dataset_directory\",  # use the name of the directory where you saved your dataset\n    num_labels=2,\n    task_type=\"multi_class\",\n)\n```\n\nStart Training:\n\n```python\nclassifier.train(epochs=10, batch=16, use_bf16=False)\n```\n\nYou can specify the model name, number of labels(classes), number of epochs, batch size, and whether to use BF16 precision with the train function as shown in the file `custom_finetune.py`. \n\n\nTo train on a single GPU:\n\n```bash\npython custom_finetune.py\n```\nTo train using all available GPUs:\n\n```bash\nexport MASTER_ADDR=127.0.0.1\nsource /home/orange/pytorch_xpu_orange/lib/python3.10/site-packages/oneccl_bindings_for_pytorch/env/setvars.sh\nmpirun -n 4 python custom_finetune.py\n```\nReplace 4 with the number of GPUs available in your system.\n\n\n\n## Monitoring GPU Usage\n\nTo monitor the GPU usage:\n\n```bash\nxpu-smi dump -m5,18  # VRAM utilization\n```\n## Additional Details\n\nThe custom_finetune.py script fetches an e-book from Gutenberg and prepares a dataset for the training task. The dataset is stored in the directory specified by dataset_name as a csv file with two columns: text and label.\n\n**Please note**, the transformers expect the labels to be integers. If your labels are strings, make sure to encode them into integers before passing them to the TextClassifier:\n\n```python\nfrom sklearn.preprocessing import LabelEncoder\nle = LabelEncoder()\nlabels = ['cat', 'mat', 'bat', 'cat', 'bat']\nencoded_labels = le.fit_transform(labels)\n```\nFor more details on the TextClassifier, refer to classifier.py.\n\nRemember to check the script and adjust the parameters (model type, dataset, epochs, batch size, etc.) according to your needs.\n\nHappy fine-tuning!\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Frahulunair%2Fxpu_text_classifier","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Frahulunair%2Fxpu_text_classifier","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Frahulunair%2Fxpu_text_classifier/lists"}