{"id":17632685,"url":"https://github.com/r00tens/text-classifier","last_synced_at":"2026-05-06T10:33:00.316Z","repository":{"id":255595736,"uuid":"847068175","full_name":"r00tens/text-classifier","owner":"r00tens","description":"Naive Bayes classifier for text classification with CPU and GPU (CUDA)","archived":false,"fork":false,"pushed_at":"2024-09-16T16:11:13.000Z","size":75412,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-02-05T05:44:15.988Z","etag":null,"topics":["classification","classifier","cpp","cuda","machine-learning","naive-bayes"],"latest_commit_sha":null,"homepage":"","language":"C++","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/r00tens.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-08-24T18:51:47.000Z","updated_at":"2024-10-10T21:14:07.000Z","dependencies_parsed_at":"2024-10-23T10:17:49.503Z","dependency_job_id":null,"html_url":"https://github.com/r00tens/text-classifier","commit_stats":null,"previous_names":["r00tens/email-classifier"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/r00tens%2Ftext-classifier","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/r00tens%2Ftext-classifier/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/r00tens%2Ftext-classifier/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/r00tens%2Ftext-classifier/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/r00tens","download_url":"https://codeload.github.com/r00tens/text-classifier/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":246273533,"owners_count":20750904,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["classification","classifier","cpp","cuda","machine-learning","naive-bayes"],"created_at":"2024-10-23T01:45:05.913Z","updated_at":"2026-05-06T10:32:55.287Z","avatar_url":"https://github.com/r00tens.png","language":"C++","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Text Classifier\n\n## Project Overview\n\nThis project implements a text classifier that categorizes text into predefined categories (classes) using the Naive Bayes algorithm. It supports model training by constructing a dictionary (vocabulary) and feature vectors based on a labeled training dataset. The classifier evaluates the performance of the trained model, including metrics such as accuracy, precision, recall, and F1 score using a test dataset. The implementation runs on both the CPU (C++) and NVIDIA GPU (CUDA).\n\n## Features\n\n- Multinomial Naive Bayes Algorithm: A probabilistic algorithm used for text classification.\n- Dictionary and Feature Vector Construction: Builds a dictionary from the training data and converts text into sparse feature vectors, where each sparse vector contains the number of occurrences of features (e.g., words) in the text.\n- Training and evaluation: allows training the model on an annotated dataset and evaluates its performance on a test set using standard metrics such as accuracy, precision, recall, and F1 score.\n- CPU and GPU Support: The implementation can run on both CPU (C++) and GPU (CUDA), utilizing NVIDIA GPUs for accelerated computations.\n- Benchmarking: allows (primitive) comparison of CPU and GPU implementation performance on the same dataset.\n\n## Prerequisites\n\n- CMake ``3.29``\n- CUDA ``12.6``\n- Visual Studio ``17 2022`` or Ninja ``1.12.0``\n\n## Installation\n\n### Clone the Repository\n\n```\ngit clone https://github.com/r00tens/text-classifier.git\ncd text-classifier\n```\n\n### Build the Project:\n\n#### Visual Studio 17 2022\n\n```\nmkdir build\ncd build\ncmake ..\ncmake --build . --config Release\n```\n\n#### Ninja\n\n```\nmkdir build\ncd build\ncmake -G Ninja -DCMAKE_BUILD_TYPE Release ..\ncmake --build .\n```\n\n\u003e [!NOTE]  \n\u003e The project has not yet been tested on any Linux distributions.\n\n## Prepare the dataset\n\n### The dataset should be in CSV format, where\n\n- the first row contains column names (e.g., label, text)\n- each row corresponds to a single text entry \n- columns in a row are separated by a comma\n\n#### Example of the dataset format\n\n```\nlabel,text\n0,This is a sample text\n1,This is another sample text\n```\n\n## License\n\nThis project is licensed under the MIT License - for details, see the [LICENSE](LICENSE) file.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fr00tens%2Ftext-classifier","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fr00tens%2Ftext-classifier","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fr00tens%2Ftext-classifier/lists"}