https://github.com/apertium/lexd

A lexicon compiler for non-suffixational morphologies
https://github.com/apertium/lexd

apertium-core apertium-tools

Last synced: 3 months ago
JSON representation

A lexicon compiler for non-suffixational morphologies

Host: GitHub
URL: https://github.com/apertium/lexd
Owner: apertium
License: gpl-3.0
Created: 2019-08-26T23:34:48.000Z (about 6 years ago)
Default Branch: main
Last Pushed: 2025-04-10T12:19:46.000Z (7 months ago)
Last Synced: 2025-07-01T21:34:53.737Z (4 months ago)
Topics: apertium-core, apertium-tools
Language: C++
Homepage:
Size: 541 KB
Stars: 13
Watchers: 6
Forks: 4
Open Issues: 4
Metadata Files:
- Readme: README
- Changelog: ChangeLog
- License: COPYING
- Authors: AUTHORS

Awesome Lists containing this project

README

          # Lexd

A lexicon compiler specialising in non-suffixational morphologies.

This module compiles lexicons in a format loosely based on hfst-lexc and produces transducers in ATT format which are equivalent to those produced using the overgenerate-and-constrain approach with hfst-twolc (see [here](https://wiki.apertium.org/wiki/Morphotactic_constraints_with_twol) and [here](https://wiki.apertium.org/wiki/Replacement_for_flag_diacritics)).  However, it is much faster (see below).

See [Usage.md](Usage.md) for the rule file syntax.

## Installation

First, clone this repository.

To build, do

```bash

./autogen.sh

make

make install

```

If installing to a system-wide path, you may want to run `sudo make install` instead for the last step.

To compile a lexicon file into a transducer, do

```bash

lexd lexicon_file att_file

```

To get a speed comparison, do

```bash

make timing-test

```

To run basic feature smoke-tests (fast), do

```bash

make check

```

## Why is it faster?

When dealing with prefixes, the overgenerate-and-constrain approach initially builds a transducer like this:

![transducer that overgenerates](https://github.com/apertium/lexd/raw/main/imgs/lexc.png)

Then composes that with a twolc rule to turn it into somehting like this:

![correct transducer](https://github.com/apertium/lexd/raw/main/imgs/twoc.png)

But compiling the rule needed to do that can take hundreds of times longer than compiling the lexicon.

Lexd, meanwhile, makes multiple copies of the lexical portion and attaches one to each prefix, thus generating the second transducer directly in a similar amount of time to what is required to generate the first one.

| Language |  Wamesa | Hebrew | Navajo | Lingala |

|---|---:|---:|---:|---:|

| Stems |  262 | 127 | 19 | 1470

| Total forms | 12576 | 2540 | 473 | 1649496

| Path restrictions | 14 | 10 | 17 | 19 

| **Lexc + Twolc**

|    Lexc compilation | 25ms | 15ms | 25ms | 230ms 

|    Twolc compilation | 10245ms | 1360ms | 8460ms | 275525ms

|    Rule composition | 2050ms | 225ms | 1705ms | 45550ms 

|    Minimization | 65ms | 5ms | 20ms | 155ms |

|    Total time | 12385ms | 1605ms | 10210ms | 321460ms |

| **Lexd**

|    Lexd compilation | 210ms | 85ms | 10ms | 490ms |

|     Format conversion | 30ms | 5ms | 5ms | 55ms |

|     Total time | 240ms | 90ms | 15ms | 545ms |

|   **Speedup** | 52x | 18x | 681x | 590x |

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/apertium/lexd

Awesome Lists containing this project

README