Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/chen0040/java-data-text
Package provides java implementation of various text preprocessing methods such as tokenizers, vocabulary, text filter, stemmer, and so on
https://github.com/chen0040/java-data-text
porter-stemmer-algorithm stop-words word-filter
Last synced: 6 days ago
JSON representation
Package provides java implementation of various text preprocessing methods such as tokenizers, vocabulary, text filter, stemmer, and so on
- Host: GitHub
- URL: https://github.com/chen0040/java-data-text
- Owner: chen0040
- License: mit
- Created: 2017-05-15T12:31:20.000Z (over 7 years ago)
- Default Branch: master
- Last Pushed: 2017-05-18T04:27:18.000Z (over 7 years ago)
- Last Synced: 2024-12-11T09:23:03.606Z (11 days ago)
- Topics: porter-stemmer-algorithm, stop-words, word-filter
- Language: Java
- Size: 1.15 MB
- Stars: 1
- Watchers: 2
- Forks: 2
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# java-data-text
Package provides java implementation of various text preprocessing methods such as tokenizers, vocabulary, text filter, stemmer, and so on
[![Build Status](https://travis-ci.org/chen0040/java-data-text.svg?branch=master)](https://travis-ci.org/chen0040/java-data-text) [![Coverage Status](https://coveralls.io/repos/github/chen0040/java-data-text/badge.svg?branch=master)](https://coveralls.io/github/chen0040/java-data-text?branch=master) [![Documentation Status](https://readthedocs.org/projects/java-data-text/badge/?version=latest)](http://java-data-text.readthedocs.io/en/latest/?badge=latest)
# Install
Add the following dependency to your POM file:
```xml
com.github.chen0040
java-data-text
1.0.3```
# Features
* Porter Stemmer
* Punctuation Filter
* Stop Word Removal* Xml Tag Removal
* Ip Address Removal
* Number Removal* English Tokenizer
# Usage
To use any text filter, just create a new text filter and then calls its filter(...) method.
### Porter Stemmer
```java
import com.github.chen0040.data.text.TextFilter;
import com.github.chen0040.data.text.PorterStemmer;TextFilter stemmer = new PorterStemmer();
List words = Arrays.asList(
"caresses",
"ponies",
"ties",
"caress",
"cats",
"feed",
"agreed",
"disabled",
"matting",
"mating",
"meeting",
"milling",
"messing",
"meetings"
);List result = stemmer.filter(words);
for (int i = 0; i < words.size(); ++i)
{
System.out.println(String.format("%s -> %s", words.get(i), result.get(i)));
}
```### StopWord Removal
```java
import com.github.chen0040.data.text.TextFilter;
import com.github.chen0040.data.text.StopWordRemoval;StopWordRemoval filter = new StopWordRemoval();
filter.setRemoveNumbers(false);
filter.setRemoveIpAddress(false);
filter.setRemoveXmlTag(false);InputStream inputStream = FileUtils.getResource("documents.txt");
BufferedReader reader = new BufferedReader(new InputStreamReader(inputStream));
String content = reader.lines().collect(Collectors.joining("\n"));
reader.close();List before = BasicTokenizer.doTokenize(content);
List after = filter.filter(before);
```### Punctuation Filtering
```java
import com.github.chen0040.data.text.TextFilter;
import com.github.chen0040.data.text.PunctuationFilter;TextFilter filter = new PunctuationFilter();
InputStream inputStream = FileUtils.getResource("documents.txt");
BufferedReader reader = new BufferedReader(new InputStreamReader(inputStream));
String content = reader.lines().collect(Collectors.joining("\n"));
reader.close();List before = BasicTokenizer.doTokenize(content);
List after = filter.filter(before);
```