{"id":17158667,"url":"https://github.com/ethteck/decodetect","last_synced_at":"2025-04-13T13:40:55.273Z","repository":{"id":38311892,"uuid":"196669795","full_name":"ethteck/decodetect","owner":"ethteck","description":"Java text encoding detection library","archived":false,"fork":false,"pushed_at":"2023-03-14T10:59:09.000Z","size":24070,"stargazers_count":2,"open_issues_count":9,"forks_count":0,"subscribers_count":1,"default_branch":"master","last_synced_at":"2025-03-27T04:41:30.212Z","etag":null,"topics":["encoding","encodings","java","utf-7"],"latest_commit_sha":null,"homepage":"","language":"Java","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/ethteck.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE.md","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2019-07-13T02:38:41.000Z","updated_at":"2025-02-21T15:51:35.000Z","dependencies_parsed_at":"2023-02-09T11:16:08.923Z","dependency_job_id":null,"html_url":"https://github.com/ethteck/decodetect","commit_stats":null,"previous_names":[],"tags_count":2,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ethteck%2Fdecodetect","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ethteck%2Fdecodetect/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ethteck%2Fdecodetect/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ethteck%2Fdecodetect/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/ethteck","download_url":"https://codeload.github.com/ethteck/decodetect/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248724139,"owners_count":21151556,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["encoding","encodings","java","utf-7"],"created_at":"2024-10-14T22:12:15.637Z","updated_at":"2025-04-13T13:40:55.239Z","avatar_url":"https://github.com/ethteck.png","language":"Java","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Decodetect\nDecodetect is a text encoding detection library designed to support encodings that many other libraries don't. It contains the infrastructure to train and test custom models, and everything is written in pure Java to maximize portability.\n\nModels encode byte bigram frequency counts. At runtime, input data is converted to this same byte bigram frequency format and compared with the trained models via cosine similarity.\n\nThe training data that creates the distributed model is gathered through Wikipedia (see module `train`). However, it is possible to supply one's own training data and train a more specialized model as well.\n\n## Usage\nDecodetect can be found at [Maven Central](https://mvnrepository.com/artifact/com.ethteck.decodetect/decodetect-core/).\n\nUsing Decodetect involves simply creating an instance of `Decodetect` and then passing a `byte[]` to `getResults()`:\n\n```java\nbyte[] rawBytes = Files.readAllBytes(somePath);\n\nDecodetect decodetect = new Decodetect();\nDecodetectResult topResult = decodetect.getResults(rawBytes).get(0);\nCharset detectedCharset = topResult.getEncoding();\n\nString decoded = new String(rawBytes, detectedCharset);\n```\n\nEach `DecodetectResult` contains a confidence number in addition to the `Charset` itself. This is a measure of how similar the input bytes represent the model trained on the encoding. For most use cases, one can just use the first item in the result list.\n\n## Supported Encodings\nDecodetect supports a myriad of encodings for many languages. The bundled model has specific encodings for each language, but all languages support the following encodings as well:\n\n* UTF-7\n* UTF-8\n* UTF-16 BE\n* UTF-16 LE\n* UTF-32 BE\n* UTF-32 LE\n\nFor more information on the encodings and languages supported by Decodetect, see [Encodings.java](core/src/main/java/com/ethteck/decodetect/core/Encodings.java).\n\n\n## Project Structure\nDecodetect can be built simply with maven. The modules are as follows:\n\n* `core` Contains runtime dependencies\n\n* `train` For downloading training data and training models\n\n## Dependencies\nRuntime:\n\n* [jutf7](http://jutf7.sourceforge.net/) for UTF-7 Charset support ([MIT](https://opensource.org/licenses/MIT))\n\nTraining:\n\n* [gson](https://github.com/google/gson) for parsing json to extract text from Wikipedia ([Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0))\n\nTesting:\n\n* [JUnit 5](https://junit.org/junit5/) ([Eclipse 2.0](https://www.eclipse.org/legal/epl-2.0/))\n\n## About\nDecodetect was written by Ethan Roseman and uses the MIT license. See the [license](LICENSE.md) for more information.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fethteck%2Fdecodetect","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fethteck%2Fdecodetect","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fethteck%2Fdecodetect/lists"}