{"id":21069525,"url":"https://github.com/wolny/complement-naive-bayes","last_synced_at":"2026-01-05T20:05:00.537Z","repository":{"id":24780880,"uuid":"28194429","full_name":"wolny/complement-naive-bayes","owner":"wolny","description":"Implementation of Complement Naive Bayes text classifier used for automatic categorisation of DaWanda products","archived":false,"fork":false,"pushed_at":"2017-08-20T11:46:43.000Z","size":113,"stargazers_count":3,"open_issues_count":0,"forks_count":1,"subscribers_count":1,"default_branch":"master","last_synced_at":"2025-03-16T16:08:42.816Z","etag":null,"topics":["complement-navie-bayes","document-classification","machine-learning","naive-bayes-classifier"],"latest_commit_sha":null,"homepage":null,"language":"Java","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/wolny.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE.md","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2014-12-18T17:44:58.000Z","updated_at":"2023-11-07T12:38:48.000Z","dependencies_parsed_at":"2022-08-22T15:10:58.962Z","dependency_job_id":null,"html_url":"https://github.com/wolny/complement-naive-bayes","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/wolny%2Fcomplement-naive-bayes","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/wolny%2Fcomplement-naive-bayes/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/wolny%2Fcomplement-naive-bayes/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/wolny%2Fcomplement-naive-bayes/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/wolny","download_url":"https://codeload.github.com/wolny/complement-naive-bayes/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":245388761,"owners_count":20607163,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["complement-navie-bayes","document-classification","machine-learning","naive-bayes-classifier"],"created_at":"2024-11-19T18:35:58.701Z","updated_at":"2026-01-05T20:05:00.483Z","avatar_url":"https://github.com/wolny.png","language":"Java","funding_links":[],"categories":[],"sub_categories":[],"readme":"complement-naive-bayes\n======================\n\nImplementation of Complement Naive Bayes text classifier used for automatic categorisation of product listings on eCommerce sites.\nComplement Naive Bayes was chosen over the classic Naive Bayes due to the fact that distribution of products among\ncategories tend to be _skewed_ (more products in one category than another), which causes Classic Naive Bayes to\nprefer categories which had more products during the training phase. Complement Naive Bayes performs much better\non skewed training data.\n\n## Usage\n*complement-naive-bayes* might be used as a library which exposes API for traning and labeling of new products \nor as a standalone command line application.\n### Command line interface\nIn order to use *complement-naive-bayes* from command line:\n* clone the repo:\n```\ngit clone https://github.com/wolny/complement-naive-bayes.git\n```\n* go to project dir and create executable jar\n```\ncd complement-naive-bayes\n./gradlew jar\n```\n* invoke **java -jar complement-naive-bayes-{version}.jar** to see the options:\n```\nThe following option is required: -c, --command\nUsage: \u003cmain class\u003e [options]\n  Options:\n  * -c, --command\n       Command for the classifier, can be 'train' for training, 'label' for\n       label assignment, or 'validate' for validating the classifier accuracy\n    -m, --multithreaded\n       Use multi-threaded model (true/false)\n       Default: false\n    -o, --outputModel\n       Output file for the model. Option valid only for training.\n       Default: ~/.cbayes/model.json\n    -te, --testDir\n       Input directory containing product files for labeling/validation\n       Default: ~/.cbayes/test\n    -tr, --trainDir\n       Input directory containing product files for training\n       Default: ~/.cbayes/train\n```\n* put your JSON training product files in _trainDir_ and train your model:\n```\njava -jar complement-naive-bayes-{version}.jar -c train --trainDir trainDir\n```\n* put your JSON test product files in _testDir_ and validate you model:\n```\njava -jar complement-naive-bayes-{version}.jar -c validate --testDir testDir\n```\n* put your JSON product files that you want to label in _testDir_ and label you products:\n```\njava -jar complement-naive-bayes-{version}.jar -c label --testDir testDir\n```\n\nImportant note: because all products need to be loaded in memory for training, make sure to run the app with proper heap size (`-Xmx\u003cmemory\u003e`)\n#### JSON product files\n*trainDir*/*testDir* must contain product files in JSON format. Each file must contain list of products with\nthe following JSON schema:\n```\n[\n    {\n        \"id\": 1,\n        \"sellerId\": 12,\n        \"category\": 123,\n        \"title\": \"test title1\",\n        \"description\": \"test description1\"\n    },\n    {\n        \"id\": 2,\n        \"sellerId\": 23,\n        \"category\": 123,\n        \"title\": \"test title2\",\n        \"description\": \"test description2\"\n    }\n]\n```\n* For training _categoryId, title, description, sellerId_ attributes are obligatory, _sellerId_ is needed to filter\nproducts of the same seller from a given category in order to avoid _Seller Bias_.\n* For testing only _title, description_ attributes are necessary.\n* For now only English language is supported, but it's very easy to add support for other languages, all one has\nto do is create _Tokenizer_ for a given language and train the model using this _Tokenizer_.\n\n### API\nThe follwing snippet of code show how to use already trained model in order to label a sample product:\n```java\n// read Naive Bayes model from JSON file\nString pathToModel = \"./model.json\";\nNaiveBayesModel model = NaiveBayesSerializer.readFrom(pathToModel);\n\n// create Complement Naive Bayes classifier\nDocumentClassifier classifier = new WeightNormalizedComplementNaiveBayes(model);\n\n// get the title and description of the product which is to be labeled\nString title = \"...\";\nString description = \"...\";\nString text = title + \" \" + description;\n\n// extract features, MAKE SURE THE SAME EXTRACTOR WAS USED DURING TRAINING PHASE\nDocument document = Extractors.STANDARD_EXTRACTOR.extractFeatureVector(text);\n\n// label document\nLabelingResult labelingResult = classifier.label(document);\n\n// get categories ordered by score\nList\u003cLabelingResult.ScoredCategory\u003e categories = labelingResult.getOrderedCategories();\n\n// print 3 best category suggestions according to the model\nSystem.out.println(Lists.newArrayList(Iterables.limit(categories, 3)));\n```\n... or use [the following Play application](https://bitbucket.org/lfundaro/classifier-services)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fwolny%2Fcomplement-naive-bayes","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fwolny%2Fcomplement-naive-bayes","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fwolny%2Fcomplement-naive-bayes/lists"}