https://github.com/dkpro/jweb1t
Efficient access to Web1T formatted data
https://github.com/dkpro/jweb1t
Last synced: about 1 year ago
JSON representation
Efficient access to Web1T formatted data
- Host: GitHub
- URL: https://github.com/dkpro/jweb1t
- Owner: dkpro
- License: apache-2.0
- Created: 2015-03-27T19:03:15.000Z (about 11 years ago)
- Default Branch: master
- Last Pushed: 2020-10-13T07:50:19.000Z (over 5 years ago)
- Last Synced: 2025-02-13T10:53:12.713Z (over 1 year ago)
- Language: Java
- Homepage:
- Size: 155 KB
- Stars: 2
- Watchers: 10
- Forks: 0
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
- License: LICENSE.txt
Awesome Lists containing this project
README
# jweb1t
jWeb1T is an open source Java tool for efficiently searching n-gram data in the Web 1T 5-gram corpus format.
It is based on a binary search algorithm that finds the n-grams and returns their frequency counts in logarithmic time.
As the corpus is stored in many files a simple index is used to retrieve the files containing the n-grams.
jWeb1T has been developed by Claudio Giuliano at FBK for the English Lexical Substitution Task at SemEval 2007:
> Claudio Giuliano, Alfio Gliozzo and Carlo Strapparava. FBK-irst: Lexical Substitution Task Exploiting Domain and Syntagmatic Coherence. In Proceedings of the 4th Interational Workshop on Semantic Evaluations (SemEval-2007), Prague, 23-24 June 2007.
jWeb1T has been funded by X-Media Project.
The [UKP Lab at Technische Universität Darmstadt](https://www.ukp.tu-darmstadt.de/ukp-home/) has contributed several bug fixes and updates.
A UIMA wrapper for jWeb1T is available as part of DKPro.
## Getting it
The latest version of jWeb1T is now available via Maven Central.
If you use Maven as your build tool, then you can add jWeb1T as a dependency in your pom.xml file:
```
com.googlecode.jweb1t
com.googlecode.jweb1t
1.3.0
```
## Usage
### Prerequisites
- Obtain or create data in Web1T format.
- Unzip
- Delete zipped files (if still present)
### Creating necessary indexes
```Java
JWeb1TIndexer indexer = new JWeb1TIndexer(PATH_TO_DATA, MAX_NGRAM_LEVEL);
indexer.create();
```
### Getting n-gram counts
```Java
JWeb1TSearcher web1t = new JWeb1TSearcher (
INDEX_FILE_1
INDEX_FILE_2
...
INDEX_FILE_N
);
web1t.getFrequency("test phrase")
```