https://github.com/fmatter/cldflex
Convert FLEx data to CLDF-ready CSV.
https://github.com/fmatter/cldflex
cldf fieldworks flex linguistics
Last synced: 2 months ago
JSON representation
Convert FLEx data to CLDF-ready CSV.
- Host: GitHub
- URL: https://github.com/fmatter/cldflex
- Owner: fmatter
- License: apache-2.0
- Created: 2022-02-03T06:56:39.000Z (over 3 years ago)
- Default Branch: main
- Last Pushed: 2023-11-07T03:36:46.000Z (over 1 year ago)
- Last Synced: 2025-03-18T07:39:41.401Z (3 months ago)
- Topics: cldf, fieldworks, flex, linguistics
- Language: Python
- Homepage:
- Size: 453 KB
- Stars: 6
- Watchers: 2
- Forks: 3
- Open Issues: 3
-
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- License: LICENSE
- Citation: CITATION.cff
Awesome Lists containing this project
README
# cldflex
Convert FLEx data to CLDF-ready CSV.
[](https://www.python.org/)
[](https://pypi.org/project/clld_morphology_plugin)
[](https://www.apache.org/licenses/LICENSE-2.0)Many descriptive linguists have annotated language data in a FLEx ([SIL's Fieldworks Lexical Explorer](https://software.sil.org/fieldworks/)) database, which provides perhaps the most popular and accessible assisted segmentation and annotation workflow.
However, a reasonably complete data export is only available in XML, which is not human-friendly, and is not readily converted to other data.
A data format growing in popularity is the [CLDF standard](https://cldf.clld.org/), a table-based approach with human-readable datasets, designed to be used in [CLLD](https://clld.org/) apps and easily processable by any software that can read [CSV](https://en.wikipedia.org/wiki/Comma-separated_values) files, including [R](https://www.r-project.org/), [pandas](https://pandas.pydata.org/) or spreadsheet applications.
The goal of ``cldflex`` is to convert lexicon and corpus data stored in FLEx to CSV tables, primarily for use in CLDF datasets.## Installation
`cldflex` is available on [PyPI](https://pypi.org/project/cldflex):
```shell
pip install cldflex
```## Command line usage
At the moment, there are three commands: ``cldflex corpus`` for `.flextext` files; ``cldflex dictionary`` and `cldflex wordlist` for `.lift` files.
All commands create a number of CSV files.
One can either use [cldfbench](https://github.com/cldf/cldfbench) to create one's own CLDF datasets from these files, or add the `--cldf` argument to create a simple CLDF dataset.
Project-specific [configuration](#configuration) can be passed by `--conf your/config.yaml`, or creating a file `cldflex.yaml`### `corpus`
Basic usage:```shell
cldflex corpus texts.flextext
```Connect the corpus with the lexicon:
```shell
cldflex corpus texts.flextext --lexicon lexicon.lift
```Create a CLDF dataset:
```shell
cldflex corpus texts.flextext --lexicon lexicon.lift --cldf
```### `dictionary`
Extract morphemes, morphs, and entries from `lexicon.lift`:
```shell
cldflex dictionary lexicon.lift
```Create a CLDF dataset with a [`Dictionary`](https://github.com/cldf/cldf/tree/master/modules/Dictionary) module:
```shell
cldflex dictionary lexicon.lift --cldf
```### `wordlist`
Create a CLDF dataset with a [`Wordlist`](https://github.com/cldf/cldf/tree/master/modules/Wordlist) module:
```shell
cldflex wordlist lexicon.lift --cldf
```## API usage
The functions corresponding to the commands above are [`cldflex.corpus.convert()`](https://github.com/fmatter/cldflex/blob/4d9962ff53baab68a20ecce34f8623e87f7197ec/src/cldflex/corpus.py#L445) and [`cldflex.lift2csv.convert()`](https://github.com/fmatter/cldflex/blob/4d9962ff53baab68a20ecce34f8623e87f7197ec/src/cldflex/lift2csv.py#L130).## Configuration
There is no default configuration.
Rather, `cldflex` will guess values for most of the parameters below and tell you what it's doing.
It is suggested to start out configuration-free until something goes wrong or you want to change something.
Create a [YAML](https://yaml.org/) file for CLI usage, pass a dict to the `convert` methods.* `obj_lg`: the object language
* `gloss_lg`: the language used for glossing / translation
* `msa_lg`: the language used for storing POS information
* `lang_id`: the value to be used in the created tables
* `glottocode`: used to look up language metadata from glottolog
* `csv_cell_separator`: if there are multiple values in a cell (allomorphs, polysemy...), they are by default separated by `"; "`
* `form_slices`: set to `false` if you don't want form slices connecting morphs and word forms
* `mappings`: a dictionary specifying name changes of columns in the created CSV files