{"id":28574377,"url":"https://github.com/tripleee/libexttextcat","last_synced_at":"2025-08-08T04:07:01.449Z","repository":{"id":11131887,"uuid":"13494833","full_name":"tripleee/libexttextcat","owner":"tripleee","description":"Clone of libreoffice libexttextcat","archived":false,"fork":false,"pushed_at":"2013-10-11T09:24:21.000Z","size":1992,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":2,"default_branch":"master","last_synced_at":"2025-06-10T21:26:00.891Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"http://cgit.freedesktop.org/libreoffice/libexttextcat/","language":"C","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"bsd-3-clause","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/tripleee.png","metadata":{"files":{"readme":"README","changelog":"ChangeLog","contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2013-10-11T09:22:31.000Z","updated_at":"2014-02-06T01:31:42.000Z","dependencies_parsed_at":"2022-09-04T06:21:47.065Z","dependency_job_id":null,"html_url":"https://github.com/tripleee/libexttextcat","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/tripleee/libexttextcat","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tripleee%2Flibexttextcat","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tripleee%2Flibexttextcat/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tripleee%2Flibexttextcat/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tripleee%2Flibexttextcat/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/tripleee","download_url":"https://codeload.github.com/tripleee/libexttextcat/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tripleee%2Flibexttextcat/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":269361150,"owners_count":24404309,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-08-08T02:00:09.200Z","response_time":72,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2025-06-10T21:20:53.742Z","updated_at":"2025-08-08T04:07:01.439Z","avatar_url":"https://github.com/tripleee.png","language":"C","funding_links":[],"categories":[],"sub_categories":[],"readme":"libexttextcat is an N-Gram-Based Text Categorization library primarily intended\nfor language guessing.\n\nFundamentally this is an adaption of wiseguys libtextcat extended to be UTF-8\naware. See README.libtextcat for details on original libtextcat.\n\nBuilding:\n\n * ./configure\n * make\n * make check\n\nthe tests can be run under valgrind's memcheck with export VALGRIND=memcheck,\ne.g.\n\n * export VALGRIND=memcheck\n * make check\n\nQuickstart: language guesser\n  \n Assuming that you have successfully compiled the library, you need some\nlanguage models to start guessing languages. A collection of over 150 language\nmodels, mostly derived from using the included \"createfp\" utility on UDHR\ntranslations, is bundled, with a matching configuration file, in the langclass\ndirectory:\n\n  * cd langclass/LM\n  * ../../src/testtextcat ../fpdb.conf\n  \t \nPaste some text onto the commandline, and watch it get classified.\n     \nUsing the API:\n  \nClassifying the language of a textbuffer can be as easy as:\n\n #include \"textcat.h\"\n ...\n void *h = textcat_Init( \"fpdb.conf\" );\n ...\n printf( \"Language: %s\\n\", textcat_Classify(h, buffer, 400);\n ...\n textcat_Done(h);\n      \nCreating your own fingerprints:\n  \nThe createfp program allows you to easily create your own document\nfingerprints. Just feed it an example document on standard input, and store the\nstandard output:\n\nPut the names of your fingerprints in a configuration file, add some id's and\nyou're ready to classify.\n\nHere's a worked example. The UN Declaration of Human Rights is available in a\nmassive pile of translations[4], and and unicode.org makes much of these\navailable as plain text[5], so...\n\n% cd langclass/ShortTexts/\n% wget http://unicode.org/udhr/d/udhr_abk.txt\n% tail -n+7 udhr_abk.txt \u003e ab.txt #skip english header, name is using BCP-47\n% cd ../LM\n% ../../src/createfp \u003c ../ShortTexts/ab.txt \u003e ab.lm\n% echo \"ab.lm       ab--utf8\" \u003e\u003e ../fpdb.conf\n\nEventually we'll drop fpdb.conf and assume the name of the fingerprint .lm file\nis the correct BCP-47 tag for the language it detects.\n    \nPerformance tuning:\n\nThis library was made with efficiency in mind. There are couple of\nparameters you may wish to tweak if you intend to use it for other\ntasks than language guessing.\n\nThe most important thing is buffer size. For reliable language\nguessing the classifier only needs a couple of hundreds of bytes max.\nSo don't feed it 100KB of text unless you are creating a fingerprint.\n\nIf you insist on feeding the classifier lots of text, try fiddling\nwith TABLEPOW, which determines the size of the hash table that is\nused to store the n-grams. Making it too small will result in many\nhashtable clashes, making it too large will cause wild memory\nbehaviour and both are bad for the performance.\n\nPutting the most probable models at the top of the list in your config\nfile improves performance, because this will raise the threshold for\nlikely candidates more quickly.\n\nSince the speed of the classifier is roughly linear with respect to\nthe number of models, you should consider how many models you really\nneed. In case of language guessing: do you really want to recognize\nevery language ever invented?\n\nAcknowledgements\n\nUTF-8 conversion and adaption for OpenOffice.org, Jocelyn Merand.\nOriginal libTextCat, Frank Scheelen \u0026 Rob de Wit at wise-guys.nl.\nOriginal language models, copyright Gertjan van Noord.\n\nReferences:\n\n[1] The document that started it all can be downloaded at John M.\nTrenkle's site: N-Gram-Based Text Categorization\n\nhttp://www.novodynamics.com/trenkle/papers/sdair-94-bc.ps.gz\n\n[2] The Perl implementation by Gertjan van Noord (code + language\nmodels): downloadable from his website\n\nhttp://odur.let.rug.nl/~vannoord/TextCat/\n\n[3] Original libtextcat implementation at\n\nhttp://software.wise-guys.nl/libtextcat/\n\n[4] http://www.ohchr.org/EN/UDHR/Pages/SearchByLang.aspx\n\n[5] http://unicode.org/udhr/index_by_name.html\n\nContact:\n\nQuestions or patches can be directed to libreoffice@lists.freedesktop.org.\nBugs can be directed to https://bugs.freedesktop.org\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftripleee%2Flibexttextcat","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ftripleee%2Flibexttextcat","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftripleee%2Flibexttextcat/lists"}