https://github.com/commoncrawl/language-detection-cld2
Natural language detection, Java bindings for CLD2
https://github.com/commoncrawl/language-detection-cld2
language-detection language-identification natural-language
Last synced: 9 months ago
JSON representation
Natural language detection, Java bindings for CLD2
- Host: GitHub
- URL: https://github.com/commoncrawl/language-detection-cld2
- Owner: commoncrawl
- License: apache-2.0
- Created: 2018-06-26T12:44:07.000Z (over 7 years ago)
- Default Branch: master
- Last Pushed: 2024-11-09T17:07:23.000Z (over 1 year ago)
- Last Synced: 2024-11-09T18:18:55.078Z (over 1 year ago)
- Topics: language-detection, language-identification, natural-language
- Language: Java
- Homepage:
- Size: 138 KB
- Stars: 14
- Watchers: 15
- Forks: 2
- Open Issues: 2
-
Metadata Files:
- Readme: README.deezer-weslang
- License: LICENSE
Awesome Lists containing this project
- awesome-java - Language Detection CLD2
README
This is a Java wrapper for the library CLD2 (https://code.google.com/p/cld2/)
using JNA.
Initially the classes CLDHints and Cld2Library were automatically generated
using jnaerator (https://code.google.com/p/jnaerator/). To use it we needed to
remove the include of as it crashed the app.
Then we executed the following command:
$ java -jar jnaerator-0.12-20140604.001151-54-shaded.jar -library Cld2 ~/language_detection/cld2/internal/generated_language.h ~/language_detection/cld2/public/encodings.h ~/language_detection/cld2/public/compact_lang_det.h -o . -v -noJar -noComp -runtime JNA -f -noComments
Then using:
$ nm libcld2_full.so
we got the mangled cpp names and replaced those in Cld2Library, this is because
of bug https://github.com/ochafik/nativelibs4java/issues/515.
We also removed lot of undesired content and made the class protected.
Note also, that we replaced the signature of ExtDetectLanguagesummary to use
arrays. We only keep one function, as the others can be easily implemented on
top of it.
We also extracted both Enumerations to their own classes.
In CLDHints, we replaced the pointers by strings and removed the references to
byreference and bypointer.
IMPORTANT: this bindings have only been tested for linux-x86-64, any other
OSes and flavors are not explicitly supported.