Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/r00tens/text-classifier
Naive Bayes classifier for text classification with CPU and GPU (CUDA)
https://github.com/r00tens/text-classifier
classification classifier cpp cuda machine-learning naive-bayes
Last synced: 9 days ago
JSON representation
Naive Bayes classifier for text classification with CPU and GPU (CUDA)
- Host: GitHub
- URL: https://github.com/r00tens/text-classifier
- Owner: r00tens
- License: mit
- Created: 2024-08-24T18:51:47.000Z (6 months ago)
- Default Branch: main
- Last Pushed: 2024-09-16T16:11:13.000Z (5 months ago)
- Last Synced: 2024-12-10T22:41:21.211Z (2 months ago)
- Topics: classification, classifier, cpp, cuda, machine-learning, naive-bayes
- Language: C++
- Homepage:
- Size: 71.9 MB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Text Classifier
## Project Overview
This project implements a text classifier that categorizes text into predefined categories (classes) using the Naive Bayes algorithm. It supports model training by constructing a dictionary (vocabulary) and feature vectors based on a labeled training dataset. The classifier evaluates the performance of the trained model, including metrics such as accuracy, precision, recall, and F1 score using a test dataset. The implementation runs on both the CPU (C++) and NVIDIA GPU (CUDA).
## Features
- Multinomial Naive Bayes Algorithm: A probabilistic algorithm used for text classification.
- Dictionary and Feature Vector Construction: Builds a dictionary from the training data and converts text into sparse feature vectors, where each sparse vector contains the number of occurrences of features (e.g., words) in the text.
- Training and evaluation: allows training the model on an annotated dataset and evaluates its performance on a test set using standard metrics such as accuracy, precision, recall, and F1 score.
- CPU and GPU Support: The implementation can run on both CPU (C++) and GPU (CUDA), utilizing NVIDIA GPUs for accelerated computations.
- Benchmarking: allows (primitive) comparison of CPU and GPU implementation performance on the same dataset.## Prerequisites
- CMake ``3.29``
- CUDA ``12.6``
- Visual Studio ``17 2022`` or Ninja ``1.12.0``## Installation
### Clone the Repository
```
git clone https://github.com/r00tens/text-classifier.git
cd text-classifier
```### Build the Project:
#### Visual Studio 17 2022
```
mkdir build
cd build
cmake ..
cmake --build . --config Release
```#### Ninja
```
mkdir build
cd build
cmake -G Ninja -DCMAKE_BUILD_TYPE Release ..
cmake --build .
```> [!NOTE]
> The project has not yet been tested on any Linux distributions.## Prepare the dataset
### The dataset should be in CSV format, where
- the first row contains column names (e.g., label, text)
- each row corresponds to a single text entry
- columns in a row are separated by a comma#### Example of the dataset format
```
label,text
0,This is a sample text
1,This is another sample text
```## License
This project is licensed under the MIT License - for details, see the [LICENSE](LICENSE) file.