https://github.com/searchivarius/toolsnlp
UIMA-ECD wrappers for some basic NLP tools.
https://github.com/searchivarius/toolsnlp
Last synced: about 1 month ago
JSON representation
UIMA-ECD wrappers for some basic NLP tools.
- Host: GitHub
- URL: https://github.com/searchivarius/toolsnlp
- Owner: searchivarius
- Created: 2013-09-11T14:07:21.000Z (over 11 years ago)
- Default Branch: master
- Last Pushed: 2013-12-22T08:59:37.000Z (over 11 years ago)
- Last Synced: 2025-03-17T18:19:57.274Z (about 2 months ago)
- Language: Java
- Homepage:
- Size: 22.6 MB
- Stars: 0
- Watchers: 1
- Forks: 2
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
ToolsNLP
========UIMA-ECD wrappers for various basic NLP tools. This was used for a course-project. For more details on UIMA-ECD, please, see https://github.com/oaqa/oaqa-tutorial
Pre-requisits: Java, Python (nltk.util.clean_html should be installed), and the Unix command-line utility html2text
Sub-project:
1. Ex1: five simple HTML cleaners (regexp, my own cleaner, Apache Tika, NLTK, and Unix html2text). One script launches/runAll.sh runs them all.
2. Ex2: wrappers for sentence segmenters and tokenizers. The script launches/run_ex2.sh runs them.
3. Ex3: the wrapper for clearTK/OpenNLP POS tagger.
4. Project: a rudimentary proof-of-concept information extractor. It attemps to extract the following information from Wikipedia descriptions of countries: capital, languages spoken, religion.Additional requirements:
1. Unix utility html2text
2. Python + nltk.util.html_clean
3. Compiled Senna parser (http://ml.nec-labs.com/senna/).