https://github.com/gpizzorno/conllu_tools

A Python toolkit for working with CoNLL-U files, Universal Dependencies treebanks, and annotated corpora.
https://github.com/gpizzorno/conllu_tools

brat conllu conllu-evaluation conllu-validation latin natural-language-processing nlp tag-conversion tag-normalization text-annotation ud universal-dependencies

Last synced: 2 months ago
JSON representation

A Python toolkit for working with CoNLL-U files, Universal Dependencies treebanks, and annotated corpora.

Host: GitHub
URL: https://github.com/gpizzorno/conllu_tools
Owner: gpizzorno
License: mit
Created: 2025-06-02T18:16:48.000Z (about 1 year ago)
Default Branch: main
Last Pushed: 2025-11-29T17:04:34.000Z (7 months ago)
Last Synced: 2026-01-06T13:43:55.989Z (5 months ago)
Topics: brat, conllu, conllu-evaluation, conllu-validation, latin, natural-language-processing, nlp, tag-conversion, tag-normalization, text-annotation, ud, universal-dependencies
Language: Python
Homepage: https://gpizzorno.github.io/conllu_tools/
Size: 6.27 MB
Stars: 2
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

          # Latin NLP Utilities

[![License](https://img.shields.io/badge/License-MIT-green.svg)](https://opensource.org/licenses/MIT)

[![Python](https://img.shields.io/badge/Language-Python-blue.svg)](https://www.python.org)

[![Tests](https://github.com/gpizzorno/latin-nlp-utilities/actions/workflows/tests.yml/badge.svg)](https://github.com/gpizzorno/latin-nlp-utilities/actions/workflows/tests.yml)

[![Documentation](https://img.shields.io/badge/Docs-latest-blue.svg)](https://gpizzorno.github.io/latin-nlp-utilities/)

**Latin NLP Utilities** is a set of convenience tools for working with Latin treebanks and annotated corpora. It provides converters, evaluation scripts, validation tools, and utilities for transforming, validating, and comparing Latin linguistic data in [CoNLL-U](https://universaldependencies.org/format.html) and [brat](https://brat.nlplab.org) standoff formats.

[Read the documentation](https://gpizzorno.github.io/latin-nlp-utilities/)

## Features

- **brat/CoNLL-U Interoperability**: Convert between brat [standoff](https://brat.nlplab.org/standoff.html) and [CoNLL-U](https://universaldependencies.org/format.html)

- **Morphological Feature Utilities**: Normalize and map features across tagsets ([Perseus](https://universaldependencies.org/treebanks/la_perseus/index.html), [ITTB](https://universaldependencies.org/treebanks/la_ittb/index.html), [PROIEL](https://universaldependencies.org/treebanks/la_proiel/index.html), [DALME](https://dalme.org))

- **Validation**: Check CoNLL-U files for format and annotation guideline compliance

- **Evaluation**: Score system outputs against gold-standard CoNLL-U files, including enhanced dependencies

- **Extensible**: Easily add new tagset converters or feature mappings

For detailed information about each feature, see the [User Guide](https://gpizzorno.github.io/latin-nlp-utilities/user_guide/index.html).

## Installation

### Quick Install

```sh

pip install latin-nlp-utilities

```

For detailed installation instructions, including platform-specific guidance and troubleshooting, see the [Installation Guide](https://gpizzorno.github.io/latin-nlp-utilities/installation.html).

## Quick Start

### Convert CoNLL-U to brat

```python

from nlp_utilities.brat import conllu_to_brat

conllu_to_brat(

    conllu_filename='path/to/conllu/yourfile.conllu',

    output_directory='path/to/brat/files',

    sents_per_doc=10,

    output_root=True,

)

# Outputs .ann and .txt files to 'path/to/brat/files', alongside

# annotation.conf, tools.conf, visual.conf, and metadata.json

```

### Convert brat to CoNLL-U

```python

from nlp_utilities.brat import brat_to_conllu

from nlp_utilities.loaders import load_language_data

feature_set = load_language_data('feats', language='la')

brat_to_conllu(

    input_directory='path/to/brat/files',

    output_directory='path/to/conllu',

    ref_conllu='yourfile.conllu',

    feature_set=feature_set,

    output_root=True

)

# Outputs yourfile-from_brat.conllu to 'path/to/conllu'

```

### Validate CoNLL-U Files

```python

from nlp_utilities.conllu import ConlluValidator

validator = ConlluValidator(lang='la', level=2)

reporter = validator.validate_file('path/to/yourfile.conllu')

# Print error count

print(f'Errors found: {reporter.get_error_count()}')

# Inspect first error

sent_id, order, testlevel, error = reporter.errors[0]

print(f'Sentence ID: {sent_id}')  # e.g. 34

print(f'Testing at level: {sent_id}')  # e.g. 2

print(f'Error test level: {error.testlevel}')  # e.g. 1

print(f'Error type: {error.error_type}')  # e.g. "Metadata"

print(f'Test ID: {error.testid}')  # e.g. "text-mismatch"

print(f'Error message: {error.msg}')  # Full error message (see below)

# Print all errors formatted as strings

for error in reporter.format_errors():

    print(error)

# Example output:

# Sentence 34:

# [L2 Metadata text-mismatch] The text attribute does not match the text 

# implied by the FORM and SpaceAfter=No values. Expected: 'Una scala....' 

# Reconstructed: 'Una scala ....' (first diff at position 9)

```

### Evaluate CoNLL-U Files

```python

from nlp_utilities.conllu import ConlluEvaluator

evaluator = ConlluEvaluator(eval_deprels=True, treebank_type='0')

scores = evaluator.evaluate_files(

    gold_path='path/to/gold_standard.conllu',

    system_path='path/to/system_output.conllu',

)

print(f'UAS: {scores["UAS"].f1:.2%}')

print(f'LAS: {scores["LAS"].f1:.2%}')

# Example output:

# UAS: 64.82%

# LAS: 48.16%

```

### Convert Between Tagsets

```python

from nlp_utilities.converters.upos import dalme_to_upos, upos_to_perseus

from nlp_utilities.converters.xpos import ittb_to_perseus, llct_to_perseus

from nlp_utilities.converters.features import feature_string_to_dict, feature_dict_to_string

print(dalme_to_upos('adjective'))

# Returns 'ADJ'

print(upos_to_perseus('NOUN'))

# Returns 'n'

print(ittb_to_perseus('VERB', 'gen4|tem1|mod1'))  

# Returns 'v1sp-----'

print(llct_to_perseus('VERB', 'v|v|3|s|p|i|a|-|-|-', 'Mood=Ind|Number=Sing|Person=3|Tense=Pres|Voice=Act'))

# Returns 'v3spia---'

feat_dict = feature_string_to_dict('Case=Nom|Gender=Neut|Number=Sing')

# Returns a dictionary: 

{'Case': 'Nom', 'Gender': 'Neut', 'Number': 'Sing'}

print(feature_dict_to_string(feat_dict)) 

# Returns 'Case=Nom|Gender=Neut|Number=Sing'

```

### Normalize Morphology

```python

from nlp_utilities.loaders import load_language_data

from nlp_utilities.normalizers import normalize_morphology

feature_set = load_language_data('feats', language='la')

# Normalize morphology with feature reconciliation

# VerbForm is missing from feats but present in ref_feats

xpos, feats = normalize_morphology(

    upos='VERB',

    xpos='v-s-ga-g-',

    feats='Aspect=Perf|Case=Gen|Degree=Pos|Number=Sing|Voice=Act',

    feature_set=feature_set,

    ref_features='Aspect=Perf|Case=Gen|Degree=Pos|Number=Sing|VerbForm=Ger|Voice=Act'

)

print(xpos)

# Returns 'v-stga-g-' (normalized and validated)

print(feats)

# Returns {'Aspect': 'Perf', 'Case': 'Gen', 'Degree': 'Pos', 'Number': 'Sing', 'VerbForm': 'Ger', 'Voice': 'Act'}

```

For more examples and detailed usage, see the [Quickstart Guide](https://gpizzorno.github.io/latin-nlp-utilities/quickstart.html).

## Documentation

The full documentation includes:

- **[Installation Guide](https://gpizzorno.github.io/latin-nlp-utilities/installation.html)**: Detailed installation instructions and troubleshooting

- **[Quickstart Guide](https://gpizzorno.github.io/latin-nlp-utilities/quickstart.html)**: Get started quickly with common tasks

- **[User Guide](https://gpizzorno.github.io/latin-nlp-utilities/user_guide/index.html)**: Comprehensive guides for all features

  - [brat Conversion](https://gpizzorno.github.io/latin-nlp-utilities/user_guide/brat_conversion.html): CoNLL-U ↔ brat conversion

  - [Validation](https://gpizzorno.github.io/latin-nlp-utilities/user_guide/validation.html): Validation framework and recipes

  - [Evaluation](https://gpizzorno.github.io/latin-nlp-utilities/user_guide/evaluation.html): Metrics and evaluation workflows

  - [Converters](https://gpizzorno.github.io/latin-nlp-utilities/user_guide/converters.html): Tagset conversions

  - [Normalization](https://gpizzorno.github.io/latin-nlp-utilities/user_guide/normalization.html): Feature normalization

- **[API Reference](https://gpizzorno.github.io/latin-nlp-utilities/api_reference/index.html)**: Complete API documentation

- **[Developer Guide](https://gpizzorno.github.io/latin-nlp-utilities/developer_guide/index.html)**: Architecture and testing guides for contributors

## Acknowledgments

This toolkit builds upon and extends code from several sources:

- CoNLL-U/brat conversion logic is based on the [tools](https://github.com/nlplab/brat/tree/master/tools) made available by the [brat team](https://brat.nlplab.org/about.html).

- CoNLL-U evaluation is based on the work of Milan Straka and Martin Popel for the [CoNLL 2018 UD shared task](https://universaldependencies.org/conll18/), and Gosse Bouma for the [IWPT 2020 shared task](https://universaldependencies.org/iwpt20/task_and_evaluation.html).

- CoNLL-U validation is based on [work](https://github.com/UniversalDependencies/tools/blob/b3925718ba7205976d80eda7628687218474b541/validate.py) by Filip Ginter and Sampo Pyysalo.

## License

The project is licensed under the [MIT License](LICENSE), allowing free use, modification, and distribution.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/gpizzorno/conllu_tools

Awesome Lists containing this project

README