https://github.com/crodas/languagedetector
PHP Class to detect languages from any free text
https://github.com/crodas/languagedetector
detect-languages languagedetector paper php textrank
Last synced: 8 months ago
JSON representation
PHP Class to detect languages from any free text
- Host: GitHub
- URL: https://github.com/crodas/languagedetector
- Owner: crodas
- Created: 2013-03-30T11:27:31.000Z (almost 13 years ago)
- Default Branch: master
- Last Pushed: 2024-01-08T15:17:16.000Z (about 2 years ago)
- Last Synced: 2025-05-08T12:27:29.559Z (9 months ago)
- Topics: detect-languages, languagedetector, paper, php, textrank
- Language: PHP
- Size: 10.5 MB
- Stars: 320
- Watchers: 32
- Forks: 67
- Open Issues: 7
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
LanguageDetector [](https://travis-ci.org/crodas/LanguageDetector) [](https://flattr.com/submit/auto?user_id=crodas&url=https://github.com/crodas/LanguageDetector&title=Language%20Detector%20Library&language=en&tags=github&category=software)
================
PHP Class to detect languages from any free text.
It follows the approach described in the [paper](http://scholar.google.com.py/scholar?q=N-Gram-Based+Text+Categorization), a given text is tokenized into [N-Grams](http://en.wikipedia.org/wiki/N-gram) (we cleanup whitespaces before doing this step). Then we sort the `tokens` and we compare against a language `model`.
How it works
------------
The first thing we need is a `language model` (which looks like [this file](https://github.com/crodas/LanguageDetector/blob/master/example/datafile.php)) that is used to compare the texts against at classification time. This process must done *before* anything, and it can be generated with an script similar to [this file](https://github.com/crodas/LanguageDetector/blob/master/example/learn.php).
```php
// register the autoloader
require 'lib/LanguageDetector/autoload.php';
// it could use a little bit of memory, but it's fine
// because this process runs once.
ini_set('memory_limit', '1G');
// we load the configuration (which will be serialized
// later into our language model file
$config = new LanguageDetector\Config;
$c = new LanguageDetector\Learn($config);
foreach (glob(__DIR__ . '/samples/*') as $file) {
// feed with examples ('language', 'text');
$c->addSample(basename($file), file_get_contents($file));
}
// some callback so we know where the process is
$c->addStepCallback(function($lang, $status) {
echo "Learning {$lang}: $status\n";
});
// save it in `datafile`.
// we currently support the `php` serialization but it's trivial
// to add other formats, just extend `\LanguageDetector\Format\AbstractFormat`.
//You can check example at https://github.com/crodas/LanguageDetector/blob/master/lib/LanguageDetector/Format/PHP.php
$c->save(AbstractFormat::initFormatByPath('language.php'));
```
Once we have our language model file (in this case `language.php`) we're ready to classify texts by their language.
```php
// register the autoloader
require 'lib/LanguageDetector/autoload.php';
// we load the language model, it would create
// the $config object for us.
$detect = LanguageDetector\Detect::initByPath('language.php');
$lang = $detect->detect("Agricultura (-ae, f.), sensu latissimo,
est summa omnium artium et scientiarum et technologiarum quae de
terris colendis et animalibus creandis curant, ut poma, frumenta,
charas, carnes, textilia, et aliae res e terra bene producantur.
Specialius, agronomia est ars et scientia quae terris colendis student,
agricultio autem animalibus creandis.")
var_dump($lang);
```
And that's it.
Algorithms
----------
The project is designed to work with modules, which means you can provide your own algorithm for `sorting` and `comparing` the N-Grams. By default the library implements the [PageRank](http://en.wikipedia.org/wiki/PageRank) as `sorting` algorithm, and *out of place* (described in the paper) as `comparing`.
In order to supply your own algorithms, you must change the `$config` at *learning stage* to load your own classes (which by the way should implement some interaces).