{"id":26848976,"url":"https://github.com/crodas/languagedetector","last_synced_at":"2025-05-16T05:04:18.347Z","repository":{"id":7747574,"uuid":"9114871","full_name":"crodas/LanguageDetector","owner":"crodas","description":"PHP Class to detect languages from any free text","archived":false,"fork":false,"pushed_at":"2024-01-08T15:17:16.000Z","size":11017,"stargazers_count":320,"open_issues_count":7,"forks_count":67,"subscribers_count":32,"default_branch":"master","last_synced_at":"2025-05-08T12:27:29.559Z","etag":null,"topics":["detect-languages","languagedetector","paper","php","textrank"],"latest_commit_sha":null,"homepage":null,"language":"PHP","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/crodas.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null}},"created_at":"2013-03-30T11:27:31.000Z","updated_at":"2024-05-21T18:17:00.000Z","dependencies_parsed_at":"2024-01-15T09:03:02.651Z","dependency_job_id":"1c853e92-0d7d-4d45-9c68-2e39271d036a","html_url":"https://github.com/crodas/LanguageDetector","commit_stats":{"total_commits":50,"total_committers":8,"mean_commits":6.25,"dds":"0.31999999999999995","last_synced_commit":"55590f58ced87c3f7b564edd2ba3dfd4390b6bcf"},"previous_names":[],"tags_count":2,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/crodas%2FLanguageDetector","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/crodas%2FLanguageDetector/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/crodas%2FLanguageDetector/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/crodas%2FLanguageDetector/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/crodas","download_url":"https://codeload.github.com/crodas/LanguageDetector/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":254471061,"owners_count":22076585,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["detect-languages","languagedetector","paper","php","textrank"],"created_at":"2025-03-30T21:24:00.074Z","updated_at":"2025-05-16T05:04:18.324Z","avatar_url":"https://github.com/crodas.png","language":"PHP","funding_links":[],"categories":[],"sub_categories":[],"readme":"LanguageDetector [![Build Status](https://travis-ci.org/crodas/LanguageDetector.png)](https://travis-ci.org/crodas/LanguageDetector) [![Flattr this git repo](http://api.flattr.com/button/flattr-badge-large.png)](https://flattr.com/submit/auto?user_id=crodas\u0026url=https://github.com/crodas/LanguageDetector\u0026title=Language%20Detector%20Library\u0026language=en\u0026tags=github\u0026category=software)\n================\n\nPHP Class to detect languages from any free text.\n\nIt follows the approach described in the [paper](http://scholar.google.com.py/scholar?q=N-Gram-Based+Text+Categorization), a given text is tokenized into [N-Grams](http://en.wikipedia.org/wiki/N-gram) (we cleanup whitespaces before doing this step). Then we sort the `tokens` and we compare against a language `model`.\n\nHow it works\n------------\n\nThe first thing we need is a `language model` (which looks like [this file](https://github.com/crodas/LanguageDetector/blob/master/example/datafile.php)) that is used to compare the texts against at classification time. This process must done *before* anything, and it can be generated with an script similar to [this file](https://github.com/crodas/LanguageDetector/blob/master/example/learn.php).\n\n```php\n// register the autoloader\nrequire 'lib/LanguageDetector/autoload.php';\n\n// it could use a little bit of memory, but it's fine\n// because this process runs once.\nini_set('memory_limit', '1G');\n\n// we load the configuration (which will be serialized\n// later into our language model file\n$config = new LanguageDetector\\Config;\n\n$c = new LanguageDetector\\Learn($config);\nforeach (glob(__DIR__ . '/samples/*') as $file) { \n    // feed with examples ('language', 'text');\n    $c-\u003eaddSample(basename($file), file_get_contents($file));\n}\n\n// some callback so we know where the process is \n$c-\u003eaddStepCallback(function($lang, $status) {\n    echo \"Learning {$lang}: $status\\n\";\n});\n\n// save it in `datafile`. \n// we currently support the `php` serialization but it's trivial\n// to add other formats, just extend `\\LanguageDetector\\Format\\AbstractFormat`. \n//You can check example at https://github.com/crodas/LanguageDetector/blob/master/lib/LanguageDetector/Format/PHP.php\n$c-\u003esave(AbstractFormat::initFormatByPath('language.php'));\n```\n\nOnce we have our language model file (in this case `language.php`) we're ready to classify texts by their language.\n\n```php\n// register the autoloader\nrequire 'lib/LanguageDetector/autoload.php';\n\n// we load the language model, it would create\n// the $config object for us.\n$detect = LanguageDetector\\Detect::initByPath('language.php');\n\n$lang = $detect-\u003edetect(\"Agricultura (-ae, f.), sensu latissimo, \nest summa omnium artium et scientiarum et technologiarum quae de \nterris colendis et animalibus creandis curant, ut poma, frumenta, \ncharas, carnes, textilia, et aliae res e terra bene producantur. \nSpecialius, agronomia est ars et scientia quae terris colendis student, \nagricultio autem animalibus creandis.\")\n\nvar_dump($lang);\n```\n\nAnd that's it.\n\nAlgorithms\n----------\n\nThe project is designed to work with modules, which means you can provide your own algorithm for `sorting` and `comparing` the N-Grams. By default the library implements the [PageRank](http://en.wikipedia.org/wiki/PageRank) as `sorting` algorithm, and *out of place* (described in the paper) as `comparing`. \n\nIn order to supply your own algorithms, you must change the `$config` at *learning stage* to load your own classes (which by the way should implement some interaces).\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcrodas%2Flanguagedetector","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fcrodas%2Flanguagedetector","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcrodas%2Flanguagedetector/lists"}