https://github.com/levelfourab/lect

Pipeline for natural language analysis
https://github.com/levelfourab/lect

java natural-language-analysis natural-language-processing

Last synced: 4 months ago
JSON representation

Pipeline for natural language analysis

Host: GitHub
URL: https://github.com/levelfourab/lect
Owner: LevelFourAB
Created: 2017-07-04T09:01:09.000Z (almost 9 years ago)
Default Branch: master
Last Pushed: 2020-10-29T05:47:28.000Z (over 5 years ago)
Last Synced: 2025-07-16T06:02:40.459Z (11 months ago)
Topics: java, natural-language-analysis, natural-language-processing
Language: Java
Homepage:
Size: 181 KB
Stars: 0
Watchers: 2
Forks: 0
Open Issues: 10
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

          # Lect

Lect is a pipeline for natural language analysis that can be created from and

executed on different formats such as plain text, HTML and Markdown. Lect

parses the original format into paragraphs, sentences and words while keeping

track of the location in the source.

Lect can be used to build things such as spell and grammar checking,

entity tagging, keyword extraction, summarization algorithms and many other

applications that require robust text handling.

```java

Source source = PlainTextSource.forString("Simple plain text");

AtomicInteger wordCount = Pipeline.over(source)

  .language(ICULanguage.forLocale(Locale.ENGLISH))

  .collector(new AtomicInteger())

  .with(encounter -> new DefaultHandler() {

    private int count = 0;

    

    public void word(Token token) {

      count++;

    }

    

    public void done() {

      encounter.collector().set(count);

    }

  })

  .run();

System.out.println(wordCount + " words");

```

## Paragraphs, sentences and tokens

Three things are currently tracked in a source starting with paragraphs. The

paragraphs in Lect are used to group text content that is logically connected

instead of visually connected. For a format such as HTML or Markdown this

means that explicit paragraphs, headings and list items are all turned into

paragraphs. Handlers receive paragraph boundaries via the `startParagraph`

and `endParagraph` methods.

When a paragraph has been found the text in the paragraph is run through a

`LanguageParser` to turn it into sentences and tokens. Sentence boundaries are

passed to handlers via `startSentence` and `endSentence`.

Tokens are the individual parts that make up the actual content. Most of the

tokens are emitted for sentences, but white-space tokens can be found between

sentences and paragraphs.

Four types of tokens exists and map white-space, words, symbols and special.

* White-space is anything that matches space in the source, within our outside

sentences.

* Words are anything that could be a word in the language specified.

* Symbols are individual symbols, such as punctuation.

* Special tokens are things such as URLs, e-mails and phone numbers.

## Languages

Languages are supported via the interface `LanguageParser` which is responsible

for turning text into sentences and tokens (words, symbols and whitespace).

A parser implemented using ICU4J is available that uses `BreakIterator` to split

things into tokens. This parser is suitable for some uses, such as spell

checking but is not recommended for more advanced NLP tasks.

```java

LanguageFactory lang = ICULanguage.forLocale(Locale.ENGLISH);

```

`TokenizingLanguage` is available for use with two types of tokenizers, one that

splits a paragraph into sentences and one that splits a sentence into tokens:

```java

LanguageFactory lang = TokenizingLanguage.create(Locale.ENGLISH,

  SentenceTestTokenizer::new,

  WhitespaceTokenizer::new

);

```

## Tokenizers

Tokenizers are objects responsible for tokenizing input, such as strings,

into tokens. In Lect they are a interesting mostly when implementing a

`LanguageParser`. The `TokeningLanguage` class makes implementing the parsing a two

step process, first implement a tokenizer that splits text into sentences

and secondly a tokenizer that splits sentences into tokens.

A good starting point for custom tokenizers is `OffsetTokenizer` which helps with

creating tokenizers that use `OffsetLocation` for location tracking.

## Token matching

Lect includes utilities for matching patterns of tokens. `TokenPattern` can be

used to compile and match a sequence of tokens. Matching is usually done

streaming so it can be used with handlers:

```java

TokenPattern pattern = TokenPattern.compile("symbol='$' word");

TokenMatcher matcher = pattern.matcher();

if(matcher.add(token)) {

  // The token matched

}

```

Many variants of patterns are supported:

```java

// Match any token

TokenPattern.compile("any");

// Match a word

TokenPattern.compile("word");

// Match against token.getText()

TokenPattern.compile("word='Test'");

// Shortcut to match the text of any type of token

TokenPattern.compile("'Test'");

// Match against TokenProperty.NORMALIZED

TokenPattern.compile("word,normalized='test'");

// Match word followed by symbol

TokenPattern.compile("word symbol")

// Match against regular expression

TokenPattern.compile("word=/test/i");

// Shortcut to match via regex for any type of token

TokenPattern.compile("/test/i");

// Use parenthesis to create an optional group of Mrs + period

TokenPattern.compile("(word,normalized='mrs' symbol,text='.',continuation)? word");

// Use brackets to create an OR between tokens or groups

TokenPattern.compile("[word,normalized='mrs' word,normalized='mr'] symbol,text='.',continuation?");

```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/levelfourab/lect

Awesome Lists containing this project

README