{"id":16314296,"url":"https://github.com/swelcker/cmd.csp.classifier","last_synced_at":"2025-04-27T09:45:31.942Z","repository":{"id":95053983,"uuid":"216594414","full_name":"swelcker/cmd.csp.classifier","owner":"swelcker","description":"Simple implementation of text classifier in Java with built in SVM, C4.5, kNN, and naive Bayesian classifiers. Support for common text preprocessors and for CVS format. You can plugin your own classifier, tokenizer, transformer, stopwords, synonyms, and TF-IDF formula etc. Supports automatic validation and confusion matrix. ","archived":false,"fork":false,"pushed_at":"2019-10-22T07:34:42.000Z","size":51,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"master","last_synced_at":"2025-02-16T16:35:57.999Z","etag":null,"topics":["ai","algorithm","framework","java","machine-learning","machine-learning-algorithms","machine-learning-library","machinelearning","ml-validation","text-classification","text-classifier"],"latest_commit_sha":null,"homepage":"","language":"Java","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/swelcker.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE.md","code_of_conduct":"CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2019-10-21T14:53:02.000Z","updated_at":"2019-10-22T07:34:44.000Z","dependencies_parsed_at":"2023-06-11T23:39:07.673Z","dependency_job_id":null,"html_url":"https://github.com/swelcker/cmd.csp.classifier","commit_stats":null,"previous_names":[],"tags_count":1,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/swelcker%2Fcmd.csp.classifier","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/swelcker%2Fcmd.csp.classifier/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/swelcker%2Fcmd.csp.classifier/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/swelcker%2Fcmd.csp.classifier/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/swelcker","download_url":"https://codeload.github.com/swelcker/cmd.csp.classifier/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":250955224,"owners_count":21513490,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["ai","algorithm","framework","java","machine-learning","machine-learning-algorithms","machine-learning-library","machinelearning","ml-validation","text-classification","text-classifier"],"created_at":"2024-10-10T21:53:37.843Z","updated_at":"2025-04-26T07:39:13.038Z","avatar_url":"https://github.com/swelcker.png","language":"Java","readme":"![csplogo](https://user-images.githubusercontent.com/12301571/67168219-4d618900-f3a2-11e9-9460-b79eff997c35.PNG)\n# cmd.csp.classifier\n[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)\n[![Maintenance](https://img.shields.io/badge/Maintained%3F-yes-green.svg)](https://GitHub.com/swelcker/cmd.csp.classifier/graphs/commit-activity)\n[![GitHub release](https://img.shields.io/github/release/swelcker/cmd.csp.classifier.svg)](https://GitHub.com/swelcker/cmd.csp.classifier/releases/)\n[![GitHub tag](https://img.shields.io/github/tag/swelcker/cmd.csp.classifier.svg)](https://GitHub.com/swelcker/cmd.csp.classifier/tags/)\n[![GitHub commits](https://img.shields.io/github/commits-since/swelcker/cmd.csp.classifier/master.svg)](https://GitHub.com/swelcker/cmd.csp.classifier/commit/)\n[![GitHub contributors](https://img.shields.io/github/contributors/swelcker/cmd.csp.classifier.svg)](https://GitHub.com/swelcker/cmd.csp.classifier/graphs/contributors/)\n\n\nSimple implementation of text classifier in Java with built in SVM, C4.5, kNN, and naive Bayesian classifiers.\nSupport for common text preprocessors and for CVS format. You can plugin your own classifier, tokenizer, transformer, stopwords, synonyms, and TF-IDF formula etc.\nSupports automatic validation and confusion matrix. Used in the Cognitive Service Platform cmd.csp as part of the classifier features.\n\n### Classifier\n\nClassifiers are used to assign class labels to token streams. The toolkit includes:\n\n- kNN classifier. this classifier searches for k samples nearest to a token stream,\n  and then label the stream using the labels of the samples. Since iterate over all\n  samples is needed, it may be very slow for large datasets.\n- Naive Bayesian classifier. This classifier estimate the probability that the stream\n  belong to a class, assuming the appearance of tokens is independent.\n- TF-IDF classifier. This classifier calculate the angle between the token TF-IDF\n  vector of the stream and the token TF-IDF vector of the class.\n- SVM (libSVM/liblinear) classifier. This classifier use support vector machine which solve a kind\n  of conditional optimization problem. This is the preferred classifier for text in cmd.csp.\n- C4.5 classifier. This classifier use decision trees to classify objects.\n\n### Prerequisites\n\nThere are no prerequisites. \nIncluded dependencies others than java core:\n```xml\n\u003cdependency\u003e\n    \u003cgroupId\u003ecom.ibm.icu\u003c/groupId\u003e\n    \u003cartifactId\u003eicu4j\u003c/artifactId\u003e\n    \u003cversion\u003e64.2\u003c/version\u003e\n\u003c/dependency\u003e\n\u003cdependency\u003e\n    \u003cgroupId\u003ede.bwaldvogel\u003c/groupId\u003e\n    \u003cartifactId\u003eliblinear\u003c/artifactId\u003e\n    \u003cversion\u003e2.30\u003c/version\u003e\n\u003c/dependency\u003e\n\u003cdependency\u003e\n    \u003cgroupId\u003ecmd.csp\u003c/groupId\u003e\n    \u003cartifactId\u003ecspstemmer\u003c/artifactId\u003e\n    \u003cversion\u003e1.0.0\u003c/version\u003e\n\u003c/dependency\u003e\n```\n\n### Installing/Usage\n\n1. To use, merge the following into your Maven POM (or the equivalent into your Gradle build script):\n\n```xml\n\u003crepository\u003e\n  \u003cid\u003egithub\u003c/id\u003e\n  \u003cname\u003eGitHub swelcker Apache Maven Packages\u003c/name\u003e\n  \u003curl\u003ehttps://maven.pkg.github.com/swelcker\u003c/url\u003e\n\u003c/repository\u003e\n\n\u003cdependency\u003e\n  \u003cgroupId\u003ecmd.csp\u003c/groupId\u003e\n  \u003cartifactId\u003ecspclassifier\u003c/artifactId\u003e\n  \u003cversion\u003e1.0.0\u003c/version\u003e\n\u003c/dependency\u003e\n```\n\n2. Text need to be tokenizied before being used to train or classify. The toolkit includes:\n- Java class `java.text.BreakIterator` based locale awared method, recommended if\n  the locale of the text is supported by Java.\n- Split using a regular expression that matches separator, the other parts can be kept or not\n  according to your opinion.\n- Split using a regular expression that matches token, the other parts can be kept or not\n  according to your opinion.\n\n3. Filters transform a token stream into another. The toolkit includes:\n   - Used to normalize text，\n       - Normalize Unicode\n       - Apply transformations provided by icu4j\n       - Upcase\n       - Downcase\n       - Fold case. Since there are no one to one corresponding between lower case letters\n         and upper case letters in many languages, case folding should be used to ignore case.\n       - Stemming，i.e. convert words into their root from. Stemming algorithm from Snowball\n         are included:  Arabic,  Danish,  Dutch,  English,  Finnish,  French,  German,\n         Hungarian,  Indonesian,  Irish,  Italian,  Nepali,  Norwegian,  Portuguese,\n         Romanian,  Spanish,  Russian,  Swedish,  Tamil,  Turkish\n       - Text replacement based on regular expression(Backward reference is allowed)\n       - User-defined mapping\n   - Remove some tokens from the stream, e.g.\n       - Remove token that are whitespace\n       - Remove stop words\n       - Keep only protected words\n       - Remove tokens that match a regular expression\n       - Remove tokens that do not match a regular expression\n   - Map a token into zero or more tokens, e.g.\n       - Insert synonyms\n       - User-defined mapping\n   - Convert the stream into a stream form by n-gram from the original stream\n\nThen, import cspclassifier.*;` in your application :\n\n```java\n// Example\nimport cspclassifier.*;\nimport java.io.*;\nimport java.util.*;\n...\nprotected ClassifierFactory classifierFactory;\nprotected Trainable\u003cString\u003e model;\nprotected Classifier\u003cString\u003e classifier;\n\nprotected Category cat= null;\nprotected Map\u003cString, Category\u003e catList = new HashMap\u003cString, Category\u003e();\nprotected Locale locl=Locale.getDefault();\n...\n\nclassifierFactory=Starter.getDefaultClassifierFactory(locl);\nmodel= classifierFactory.createModel();\n\n...\n\n//create the categorys and maybe store them in a list, so you can reause them\ncat = new Category(strCategory);\ncatList.put(strCategory, cat);\n\n...\n\n// train the model\nclassifier=classifierFactory.getClassifier(model);\n\n...\n// finally classify a new text to get results\nList\u003cClassificationResult\u003e res = classifier.getCandidates(currentText, maxCategories);\n\n\n```\n\n## Built With\n\n* [Maven](https://maven.apache.org/) - Dependency Management\n\n\n## Contributing\n\nPlease read [CONTRIBUTING.md](https://gist.github.com/PurpleBooth/b24679402957c63ec426) for details on our code of conduct, and the process for submitting pull requests to us.\n\n## Versioning\n\nWe use [SemVer](http://semver.org/) for versioning. For the versions available, see the [tags on this repository](https://github.com/swelcker/cmd.csp.classifier/tags). \n\n## Authors\n\n* **Stefan Welcker** - *Modifications based on chungkwong/text-classifier-collection* \n\nSee also the list of [contributors](https://github.com/swelcker/cmd.csp.classifier/contributors) who participated in this project.\n\n## License\n\nThis project is licensed under the MIT License - see the [LICENSE.md](LICENSE.md) file for details\n\n\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fswelcker%2Fcmd.csp.classifier","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fswelcker%2Fcmd.csp.classifier","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fswelcker%2Fcmd.csp.classifier/lists"}