https://github.com/sajjad-mazaheri/lexprep

Linguistic Data Preparation Toolkit for Wordlists
https://github.com/sajjad-mazaheri/lexprep

data-preparation g2p linguistics nlp psycholinguistics spacy stanza wordlist

Last synced: 4 months ago
JSON representation

Linguistic Data Preparation Toolkit for Wordlists

Host: GitHub
URL: https://github.com/sajjad-mazaheri/lexprep
Owner: sajjad-mazaheri
License: mit
Created: 2026-01-31T19:14:40.000Z (4 months ago)
Default Branch: main
Last Pushed: 2026-02-17T14:38:59.000Z (4 months ago)
Last Synced: 2026-02-17T19:36:47.451Z (4 months ago)
Topics: data-preparation, g2p, linguistics, nlp, psycholinguistics, spacy, stanza, wordlist
Language: Python
Homepage: https://lexprep.net/
Size: 1.87 MB
Stars: 1
Watchers: 0
Forks: 0
Open Issues: 6
Metadata Files:
- Readme: README.md
- Contributing: CONTRIBUTING.md
- License: LICENSE

Awesome Lists containing this project

README

          


  
lexprep

  

    Linguistic Data Preparation Toolkit for Wordlists

  

  

    G2P - Syllables - POS Tagging - Sampling


    Persian - English - Japanese

  




  

    

  

  

  



---

## What is lexprep?

**lexprep** is a  toolkit designed for **linguistic data preparation**. It processes **wordlists**  and provides:

- **G2P Transcription**: Convert words to phonetic representations

- **Syllable Counting**: Count syllables using orthographic or phonetic methods

- **POS Tagging**: Assign part-of-speech tags to words

- **Sampling**: Stratified sampling and multi-file shuffling

### Key Difference from Other NLP Tools

Most NLP libraries (spaCy, Stanza, etc.) are designed for **text processing** - analyzing sentences and documents.

**lexprep** is designed for **wordlist processing** - working with isolated words in spreadsheets:

| Tool | Input | Use Case |

|------|-------|----------|

| spaCy, Stanza | Running text/sentences | Document analysis, NER, parsing |

| **lexprep** | Wordlists (CSV/XLSX/TXT) | stimulus preparation |

## Supported Languages & Tools

| Language | G2P | Syllables | POS |

|----------|-----|-----------|-----|

| Persian | PersianG2p | Heuristic | Stanza |

| English | g2p-en | pyphen | spaCy |

| Japanese | - | - | Stanza / UniDic |

---

## Installation

```bash

pip install .              # Core only

pip install ".[fa]"        # + Persian

pip install ".[en]"        # + English

pip install ".[ja]"        # + Japanese

pip install ".[fa,en,ja]"  # All languages

```

---

## Quick Start

### Persian

```bash

# G2P transcription

lexprep fa g2p words.xlsx output.xlsx -c word

# Syllable counting (with automatic G2P)

lexprep fa syllables words.xlsx output.xlsx -c word --with-g2p

# POS tagging

lexprep fa pos words.xlsx output.xlsx -c word

```

### English

```bash

# G2P (ARPAbet phonemes)

lexprep en g2p words.xlsx output.xlsx -c word

# Syllable counting

lexprep en syllables words.xlsx output.xlsx -c word

# POS tagging

lexprep en pos words.xlsx output.xlsx -c word

```

### Japanese

```bash

# POS with UniDic (detailed tags)

lexprep ja pos words.xlsx output.xlsx -c word --method unidic

# POS with Stanza (universal tags)

lexprep ja pos words.xlsx output.xlsx -c word --method stanza

```

**Which Japanese method should I use?**

- Use **Stanza** when you want **Universal POS (English tags)** for cross-lingual comparison or UD-style annotation.

- Use **UniDic** when you want **detailed Japanese tags** (pos1..pos4) for linguistic analysis. lexprep also returns `pos_en` (English gloss) for UniDic output automatically via `pos_map`.

### Sampling

```bash

# Stratified sampling

lexprep sample stratified data.xlsx output.xlsx --score-col frequency --n-total 100 --bins 3

# Shuffle multiple files

lexprep sample shuffle-rows file1.xlsx file2.xlsx output_dir/

```

---

## File Formats

| Format | Support |

|--------|---------|

| Excel (.xlsx) | Supported |

| CSV (.csv) | Supported |

| TSV (.tsv) | Supported |

| Plain text (.txt) | Supported (one word per line) |

Use `-c` or `--column-name` to specify the word column.

---

## Web Interface

lexprep includes a **web-based UI** for processing wordlists through your browser:

Features:

- Upload files (Excel, CSV, TSV, TXT)

- Select language and tool

- Process wordlists with all available tools

- Download results automatically

- Fast processing with model caching

---

## Documentation

- [METHODS.md](METHODS.md) - How each algorithm works

- [ACCURACY.md](ACCURACY.md) - Benchmark results and evaluation

- [QUICKSTART.md](QUICKSTART.md) - Quick start guide

- [CONTRIBUTING.md](CONTRIBUTING.md) - Contribution guidelines

---

## Notes

- English POS requires: `python -m spacy download en_core_web_sm`

- Persian and Japanese Stanza models download automatically on first use

- Persian syllables use heuristic methods (see [ACCURACY.md](ACCURACY.md))

---

## Underlying Libraries

lexprep integrates these excellent open-source libraries:

### English

- **[g2p-en](https://github.com/Kyubyong/g2p)** - G2P using CMU Dictionary + neural model (Park & Kim, 2019) - MIT License

- **[pyphen](https://github.com/Kozea/Pyphen)** - TeX hyphenation patterns - LGPL/GPL/MPL

- **[spaCy](https://spacy.io/)** - Industrial-strength NLP (Explosion AI) - MIT License

### Persian

- **[PersianG2p](https://github.com/PasaOpasen/PersianG2P)** - Persian G2P (Demetry Pascal, forked from AzamRabiee) - MIT License

- **[Stanza](https://stanfordnlp.github.io/stanza/)** - Stanford NLP toolkit for POS tagging (Qi et al., 2020) - Apache 2.0

### Japanese

- **[Stanza](https://stanfordnlp.github.io/stanza/)** - Stanford NLP toolkit (Qi et al., 2020) - Apache 2.0

- **[Fugashi](https://github.com/polm/fugashi)** - MeCab wrapper - MIT License

- **[UniDic](https://clrd.ninjal.ac.jp/unidic/)** - Japanese dictionary (NINJAL) - GPL/LGPL/BSD

All libraries are used as dependencies without modification. See their respective licenses for details.

---

## License

MIT License - See [LICENSE](LICENSE) for details.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/sajjad-mazaheri/lexprep

Awesome Lists containing this project

README

lexprep