{"id":17060435,"url":"https://github.com/e-oj/language-classifier","last_synced_at":"2025-08-21T06:41:13.236Z","repository":{"id":71200597,"uuid":"160380269","full_name":"e-oj/Language-Classifier","owner":"e-oj","description":"Language classification using decision trees and adaboost.","archived":false,"fork":false,"pushed_at":"2019-02-15T21:29:22.000Z","size":162,"stargazers_count":8,"open_issues_count":0,"forks_count":5,"subscribers_count":2,"default_branch":"master","last_synced_at":"2025-04-12T18:07:24.792Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/e-oj.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2018-12-04T15:39:32.000Z","updated_at":"2024-05-07T16:40:58.000Z","dependencies_parsed_at":null,"dependency_job_id":"1130957a-5d27-464c-b65b-63cc891d86e8","html_url":"https://github.com/e-oj/Language-Classifier","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/e-oj%2FLanguage-Classifier","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/e-oj%2FLanguage-Classifier/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/e-oj%2FLanguage-Classifier/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/e-oj%2FLanguage-Classifier/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/e-oj","download_url":"https://codeload.github.com/e-oj/Language-Classifier/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248610341,"owners_count":21132921,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-10-14T10:43:50.862Z","updated_at":"2025-04-12T18:08:32.745Z","avatar_url":"https://github.com/e-oj.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Language Classifier\n### Language classification using decision trees and adaboost.\n\nThis classifier distinguishes between two or more languages. It's currently set up to differentiate English from \nDutch but you can classify other languages by modifying the training data. You can also edit the language features to get \nstronger results for new languages. For extra details, checkout this [writeup](https://docs.google.com/document/d/1TWwhFmji458pAycIzHSXn9rB8dsC8AZpyY7Qghsrwew/edit?usp=sharing). \u003cb\u003eThis program has no dependencies. Every algorithm used was implemented from scratch.\u003c/b\u003e\n\n\u003cbr\u003e\n\n### Algorithms\nTwo classification methods are used by this program.\n\n#### Decision Tree:\nA decision tree is built using the training data. Entropy is used as a measure of impurity in a given set, and the Information gain algorithm is used to split the data by features. The decision tree can be assigned a maximum depth to restrict its growth. [More on decision trees](https://www.geeksforgeeks.org/decision-tree/)\n\n#### AdaBoost:\nA boosted ensemble of decicion stumps (decision trees with a depth of 1) is built using the training data. Every instance of data is assigned a weight (forming a distribution) and the AdaBoost algorithm is used to adjust the weight of each instance before the next stump in the ensemble is created. [More on AdaBoost](https://towardsdatascience.com/boosting-algorithm-adaboost-b6737a9ee60c).\n\n\u003cbr\u003e\n\n### Usage\n#### The program has two entry points in the root directory, accessible via the following commands:\n\n\u003cb\u003epython3 classify.py train\u003c/b\u003e `\u003cexamples\u003e` `\u003chypothesisOut\u003e` `\u003clearning-type\u003e`\n - `\u003cexamples\u003e` is a file containing labeled examples. For example.\n - `\u003chypothesisOut\u003e` specifies the filename to write your model to.\n - `\u003clearning-type\u003e` specifies the type of learning algorithm you will run, it is either \"dt\" or \"ada\".\n\n\n\u003cb\u003epython3 classify.py predict\u003c/b\u003e `\u003chypothesis\u003e` `\u003cfile\u003e`\n - `\u003chypothesis\u003e` is a pre-trained model created by the train program\n - `\u003cfile\u003e` is a file containing test data.\n\n\u003cbr\u003e\n\n### Training and Test Data\nHere's the format for training and test data\n\n```\n\u003clabel\u003e|\u003cdata\u003e\n```\n\nFor English and Dutch, the labels ```en``` and ```nl``` are used, respectively. Here's an exerpt from the test file:\n\n```\nen|two percussionists and a string quartet) performs six of the leader's originals and, although none\nnl|maakte hij zijn debuut op het hoogste niveau. Schena mocht dertien keer meespelen in het\nen|1984, where the then-World Wrestling Federation put him over the likes of Johnny Rodz\nnl|Trophy tekende hij een contract in de I-League bij regerend kampioen Dempo SC voor het\nen|Eventually, five princes came to Taketori no Okina's residence to ask for the beautiful Kaguya-hime's\nnl|Romeinen was en het feit dat er van hem geen Germaanse naam is overgeleverd, is\n```\n\nThe full test data can be found in the `/in/test.dat` file. Training data can be found at `/in/train.dat` .\n\n\u003cbr\u003e\n\n### Pre-trained Models\nTrained models that classify English and Dutch can be found in the `\\out` directory.\n\n - `\\out\\ensemble.oj` contains an adaboost ensemble.\n - `\\out\\tree.oj` contains a decision tree.\n \nAny of these files can be used to run the classification job.\n\n\u003cbr\u003e\n\n### Language Features\nThe strength of a model is heavily dependent on the strength of the features used in training. While some were generic, \na vast majority of the features used were geared towards English and Dutch. These features can be modified in the \n```get_features``` function of ```instance.py```. A more detailed explanation of the features can be found \n[in the writeup](https://docs.google.com/document/d/1TWwhFmji458pAycIzHSXn9rB8dsC8AZpyY7Qghsrwew/edit?usp=sharing)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fe-oj%2Flanguage-classifier","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fe-oj%2Flanguage-classifier","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fe-oj%2Flanguage-classifier/lists"}