https://github.com/njsmith/cctext
https://github.com/njsmith/cctext
Last synced: 3 months ago
JSON representation
- Host: GitHub
- URL: https://github.com/njsmith/cctext
- Owner: njsmith
- Created: 2014-06-19T20:27:52.000Z (almost 11 years ago)
- Default Branch: master
- Last Pushed: 2014-06-25T16:33:32.000Z (almost 11 years ago)
- Last Synced: 2025-01-19T09:41:51.558Z (4 months ago)
- Language: C++
- Size: 34.4 MB
- Stars: 0
- Watchers: 2
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.rst
Awesome Lists containing this project
README
cctext
======These are some work-in-progress scripts for extracting natural
language text from the Common Crawl corpus.Building and running
--------------------This is complicated, because we need to get access to the CLD2
language detection library we need to get java and c++ libraries to
play along with each other, and also because I have no idea what I'm
doing with java build systems::Step 0: get a recent version of gradle. As of 2014-06-25, Debian
ships gradle 1.5, which **will not** work. I'm using gradle **1.12**.Then::
# Get and build CLD2. You must use 'cld2' as the checkout directory,
# because that's where build.gradle expects to find it.
svn checkout http://cld2.googlecode.com/svn/trunk cld2
(cd cld2/internal && ./compile_libs.sh)# Build the CLD2 C++ wrapper shim.
gradle CLD2wrapSharedLibrary# Build the Java code, fetch dependencies, and create an
# invocation script:
gradle installAppNow you should be able to do::
./run.sh
which is a little script that sets up the rest of the stuff that the
above incantations didn't set up (``LD_LIBRARY_PATH``), and starts a
JVM. (It uses rlwrap to get non-terrible command-line editing, so you
might want to install that too.)Right now we just drop into a Jython REPL from which you can play
around with the actual code, because I haven't written a main()
function yet.The CLD2 wrapper
----------------CLD2 is pretty amazing, but it's written in C++, so using it from Java
is really painful. There are four pieces involved in making it work.1. ``src/main/java/org/vorpus/cctext/LanguageDetection.java`` defines the
high-level interface used by the main driver program.2. It works by calling the auto-generated Bridj wrapper in
``lib/CLD2wrap.jar``. This jar file was created by running::java -jar jnaerator-0.11-shaded.jar -library CLD2wrap -mode Jar -o lib -package org.vorpus.cld2wrap -f src/CLD2wrap/headers/cld2wrap.h
You'll have to re-run this if you ever modify ``cld2wrap.h``.
3. The auto-generated wrapper does FFI calls into the hand-written C++
wrapper in ``src/CLD2wrap/``. This is necessary because CLD2's
native API involves bizarre and exotic C++ features
like... ``std::vector`` and ordinary function calls, which Bridj
apparently cannot handle directly.4. The hand-written C++ wrapper links against ``libcld2_full.so``, and
uses CLD's native API directly.A cry for help
--------------This is terrible. If you know how to make it not so terrible please help.