https://github.com/proycon/lexmatch

Simple lexicon matcher against a text
https://github.com/proycon/lexmatch

lexical-search nlp

Last synced: 11 months ago
JSON representation

Simple lexicon matcher against a text

Host: GitHub
URL: https://github.com/proycon/lexmatch
Owner: proycon
License: gpl-3.0
Created: 2021-09-20T14:02:53.000Z (over 4 years ago)
Default Branch: master
Last Pushed: 2024-07-03T19:07:18.000Z (over 1 year ago)
Last Synced: 2025-04-22T10:21:12.957Z (11 months ago)
Topics: lexical-search, nlp
Language: Rust
Homepage: https://git.sr.ht/~proycon/lexmatch
Size: 47.9 KB
Stars: 2
Watchers: 3
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

awesome-cli-apps-in-a-csv - lexmatch - This is a simple lexicon matching tool that, given a lexicon of words or phrases, identifies all matches in a given target text, returning their exact positions. It can be used compute a frequency list for a lexicon, on a target corpus. (<a name="text-processing"></a>Text processing)
awesome-cli-apps - lexmatch - This is a simple lexicon matching tool that, given a lexicon of words or phrases, identifies all matches in a given target text, returning their exact positions. It can be used compute a frequency list for a lexicon, on a target corpus. (<a name="text-processing"></a>Text processing)

README

# Lexmatch

This is a simple lexicon matching tool that, given a lexicon of words or
phrases, identifies all matches in a given target text, returning their exact
positions. It can be used compute a frequency list for a lexicon, on a target
corpus.

The implementation uses suffix arrays or hash tables. The text must be
plain-text UTF-8. For the former implementation (default), it is limited to
2^32 bytes (about 4GB). For the latter implementation (`--tokens`/`--cjk`),
there is no such limit. The offsets outputted will be UTF-8 *byte* positions.

This tool only does exact (or case insensitive) matching, if you need fuzzy
matching against lexicons, check out [analiticcl](https://github.com/proycon/analiticcl)
instead.

## Installation

You can build and install the latest stable release using Rust's package manager:

```
cargo install lexmatch
```

or if you want the development version after cloning this repository:

```
cargo install --path .
```

No cargo/rust on your system yet? Do ``sudo apt install cargo`` on Debian/ubuntu based systems, ``brew install rust`` on mac, or use [rustup](https://rustup.rs/).

## Usage

See ``lexmatch --help``.

Simple example:

```
$ lexmatch --lexicon lexicon.lst corpus.txt
```

The lexicon must be plain-text UTF-8 containing one entry per line, an entry
need not be a single word and is not constrained in length. If the lexicon
consists of Tab Separated Values (TSV), then only the first column is
considered, the rest is ignored.

Instead of a lexicon you can also provide the patterns to query on the command line using ``--query``.

By default, you will get a TSV file with a column for the text, the occurrence count, and
one with the begin position (UTF-8 byte position) for each match (dynamic columns):

```
$ lexmatch --query good --query bad /nettmp/republic.short.txt
Reading text from /tmp/republic.short.txt...
Building suffix array (this may take a while)...
Searching...
good 4 193 3307 3480 278
bad 3 201 3315 3488
```

Matching is case sensitive by default, add `--no-case` for case insensitive
behaviour (all input and output will be lowercase, this may in rare cases cause
the UTF-8 offsets to no longer be valid on the original text).

For verbose output, add ``--verbose``. This produces cleaner TSV (tab seperated
values) output that you can easily import in for example the [STAM
tools](https://github.com/annotation/stam-tools):

```
$ lexmatch --verbose --query good --query bad /nettmp/republic.short.txt
Text BeginUtf8Offset EndUtf8Offset
Reading text from /tmp/republic.short.txt...
Building suffix array (this may take a while)...
Searching...
good 193 197
good 3307 3311
good 3480 3484
good 278 282
bad 201 204
bad 3315 3318
bad 3488 3491
```

You may provide multiple lexicons as well as multiple test files, the output
will output the lexicon and/or test file in such cases. If multiple lexicons match, they are all returned (delimited by a semicolon). The order of the
results is arbitrary.

If you don't care for the exact positions but rather want to compute a
frequency list with the number of occurrences for each item in the lexicon or
passed through ``--query``, then pass ``--count-only``:

```
$ lexmatch --count-only --query good --query bad /tmp/republic.short.txt
Reading text from /tmp/republic.short.txt...
Building suffix array (this may take a while)...
Searching...
good 4
bad 3
```

You can configure a minimum frequency threshold using ``--freq``.

Rather than match all of the lexicon against the text, you can also iterate
over tokens in the text and check if they occur in the lexicon. This uses a
hash map instead of a suffix array and is typically faster. It is more limited,
however, and can not be used with frequency thresholds or counting. It will
always produce verbose output (similar to ``--verbose``):

```
$ lexmatch --tokens --query good --query bad /nettmp/republic.short.txt
Text BeginUtf8Offset EndUtf8Offset
Reading text from /tmp/republic.short.txt...
good 193 197
bad 201 204
good 278 282
good 3307 3311
bad 3315 3318
good 3480 3484
bad 3488 3491
```

Unlike before, you will find the matches are now returned in reading order.

If you add `--coverage` then you will get an extra last line with some coverage
statistics. This is useful to see how much of the text is covered by your
lexicon.

```
#coverage (tokens) = 7/627 = 0.011164274322169059
```

Coverage can also be computed line-by-line and matching against multiple lexicons, we can also read directly from stdin rather than from file by passing `-` as filename:

```
$ echo "Is this good or bad?\nIt is quite good." | lexmatch --coverage-matrix --query good --query bad -
Reading text from -...
Line query
Is this good or bad? 0.4
It is quite good. 0.25
```

This can be used as a simple lexicon-based method for language detection:

```
$ echo "Do you know what language this is?\nUnd was ist das hier genau?\nÇa va assez bien je crois" | lexmatch -i --coverage-matrix --lexicon ~X/en.lst --lexicon ~X/de.lst --lexicon ~X/fr.lst -
Reading lexicon...
Reading lexicon...
Reading lexicon...
Reading text from -...
Line /home/proycon/exp/en.lst /home/proycon/exp/de.lst /home/proycon/exp/fr.lst Total
do you know what language this is? 1 0 0.14285714285714285 1.1428571428571428
und was ist das hier genau? 0.16666666666666666 0.8333333333333334 0.16666666666666666 1.1666666666666667
ça va assez bien je crois 0.2 0.2 0.6 1
```

When using ``--tokens`` (or `--coverage-matrix`) we rely on whitespace and punctuation to delimit
tokens. This does not work for languages such as Chinese, Japanese and Korean
that are not delimited in such a way. For such languages, similar linear search
behaviour can be attained by passing ``--cjk`` instead, with an integer value
representing the maximum character length to explore. A greedy search will then
be performed that favours longer patterns over shorter ones.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/proycon/lexmatch

Awesome Lists containing this project

README