https://github.com/proycon/lexmatch
Simple lexicon matcher against a text
https://github.com/proycon/lexmatch
lexical-search nlp
Last synced: 8 months ago
JSON representation
Simple lexicon matcher against a text
- Host: GitHub
- URL: https://github.com/proycon/lexmatch
- Owner: proycon
- License: gpl-3.0
- Created: 2021-09-20T14:02:53.000Z (about 4 years ago)
- Default Branch: master
- Last Pushed: 2024-07-03T19:07:18.000Z (over 1 year ago)
- Last Synced: 2025-04-22T10:21:12.957Z (8 months ago)
- Topics: lexical-search, nlp
- Language: Rust
- Homepage: https://git.sr.ht/~proycon/lexmatch
- Size: 47.9 KB
- Stars: 2
- Watchers: 3
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
- awesome-cli-apps-in-a-csv - lexmatch - This is a simple lexicon matching tool that, given a lexicon of words or phrases, identifies all matches in a given target text, returning their exact positions. It can be used compute a frequency list for a lexicon, on a target corpus. (<a name="text-processing"></a>Text processing)
- awesome-cli-apps - lexmatch - This is a simple lexicon matching tool that, given a lexicon of words or phrases, identifies all matches in a given target text, returning their exact positions. It can be used compute a frequency list for a lexicon, on a target corpus. (<a name="text-processing"></a>Text processing)
README
# Lexmatch
This is a simple lexicon matching tool that, given a lexicon of words or
phrases, identifies all matches in a given target text, returning their exact
positions. It can be used compute a frequency list for a lexicon, on a target
corpus.
The implementation uses suffix arrays or hash tables. The text must be
plain-text UTF-8. For the former implementation (default), it is limited to
2^32 bytes (about 4GB). For the latter implementation (`--tokens`/`--cjk`),
there is no such limit. The offsets outputted will be UTF-8 *byte* positions.
This tool only does exact (or case insensitive) matching, if you need fuzzy
matching against lexicons, check out [analiticcl](https://github.com/proycon/analiticcl)
instead.
## Installation
You can build and install the latest stable release using Rust's package manager:
```
cargo install lexmatch
```
or if you want the development version after cloning this repository:
```
cargo install --path .
```
No cargo/rust on your system yet? Do ``sudo apt install cargo`` on Debian/ubuntu based systems, ``brew install rust`` on mac, or use [rustup](https://rustup.rs/).
## Usage
See ``lexmatch --help``.
Simple example:
```
$ lexmatch --lexicon lexicon.lst corpus.txt
```
The lexicon must be plain-text UTF-8 containing one entry per line, an entry
need not be a single word and is not constrained in length. If the lexicon
consists of Tab Separated Values (TSV), then only the first column is
considered, the rest is ignored.
Instead of a lexicon you can also provide the patterns to query on the command line using ``--query``.
By default, you will get a TSV file with a column for the text, the occurrence count, and
one with the begin position (UTF-8 byte position) for each match (dynamic columns):
```
$ lexmatch --query good --query bad /nettmp/republic.short.txt
Reading text from /tmp/republic.short.txt...
Building suffix array (this may take a while)...
Searching...
good 4 193 3307 3480 278
bad 3 201 3315 3488
```
Matching is case sensitive by default, add `--no-case` for case insensitive
behaviour (all input and output will be lowercase, this may in rare cases cause
the UTF-8 offsets to no longer be valid on the original text).
For verbose output, add ``--verbose``. This produces cleaner TSV (tab seperated
values) output that you can easily import in for example the [STAM
tools](https://github.com/annotation/stam-tools):
```
$ lexmatch --verbose --query good --query bad /nettmp/republic.short.txt
Text BeginUtf8Offset EndUtf8Offset
Reading text from /tmp/republic.short.txt...
Building suffix array (this may take a while)...
Searching...
good 193 197
good 3307 3311
good 3480 3484
good 278 282
bad 201 204
bad 3315 3318
bad 3488 3491
```
You may provide multiple lexicons as well as multiple test files, the output
will output the lexicon and/or test file in such cases. If multiple lexicons match, they are all returned (delimited by a semicolon). The order of the
results is arbitrary.
If you don't care for the exact positions but rather want to compute a
frequency list with the number of occurrences for each item in the lexicon or
passed through ``--query``, then pass ``--count-only``:
```
$ lexmatch --count-only --query good --query bad /tmp/republic.short.txt
Reading text from /tmp/republic.short.txt...
Building suffix array (this may take a while)...
Searching...
good 4
bad 3
```
You can configure a minimum frequency threshold using ``--freq``.
Rather than match all of the lexicon against the text, you can also iterate
over tokens in the text and check if they occur in the lexicon. This uses a
hash map instead of a suffix array and is typically faster. It is more limited,
however, and can not be used with frequency thresholds or counting. It will
always produce verbose output (similar to ``--verbose``):
```
$ lexmatch --tokens --query good --query bad /nettmp/republic.short.txt
Text BeginUtf8Offset EndUtf8Offset
Reading text from /tmp/republic.short.txt...
good 193 197
bad 201 204
good 278 282
good 3307 3311
bad 3315 3318
good 3480 3484
bad 3488 3491
```
Unlike before, you will find the matches are now returned in reading order.
If you add `--coverage` then you will get an extra last line with some coverage
statistics. This is useful to see how much of the text is covered by your
lexicon.
```
#coverage (tokens) = 7/627 = 0.011164274322169059
```
Coverage can also be computed line-by-line and matching against multiple lexicons, we can also read directly from stdin rather than from file by passing `-` as filename:
```
$ echo "Is this good or bad?\nIt is quite good." | lexmatch --coverage-matrix --query good --query bad -
Reading text from -...
Line query
Is this good or bad? 0.4
It is quite good. 0.25
```
This can be used as a simple lexicon-based method for language detection:
```
$ echo "Do you know what language this is?\nUnd was ist das hier genau?\nÇa va assez bien je crois" | lexmatch -i --coverage-matrix --lexicon ~X/en.lst --lexicon ~X/de.lst --lexicon ~X/fr.lst -
Reading lexicon...
Reading lexicon...
Reading lexicon...
Reading text from -...
Line /home/proycon/exp/en.lst /home/proycon/exp/de.lst /home/proycon/exp/fr.lst Total
do you know what language this is? 1 0 0.14285714285714285 1.1428571428571428
und was ist das hier genau? 0.16666666666666666 0.8333333333333334 0.16666666666666666 1.1666666666666667
ça va assez bien je crois 0.2 0.2 0.6 1
```
When using ``--tokens`` (or `--coverage-matrix`) we rely on whitespace and punctuation to delimit
tokens. This does not work for languages such as Chinese, Japanese and Korean
that are not delimited in such a way. For such languages, similar linear search
behaviour can be attained by passing ``--cjk`` instead, with an integer value
representing the maximum character length to explore. A greedy search will then
be performed that favours longer patterns over shorter ones.