https://github.com/ermlab/polish-gec-datasets

Polish datsets for grammatical error correction
https://github.com/ermlab/polish-gec-datasets

Last synced: 2 months ago
JSON representation

Polish datsets for grammatical error correction

Host: GitHub
URL: https://github.com/ermlab/polish-gec-datasets
Owner: Ermlab
License: apache-2.0
Created: 2023-10-07T06:58:45.000Z (over 2 years ago)
Default Branch: main
Last Pushed: 2023-10-13T05:54:52.000Z (over 2 years ago)
Last Synced: 2024-03-15T11:21:33.259Z (over 2 years ago)
Size: 6.52 MB
Stars: 10
Watchers: 3
Forks: 1
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

          # Polish GEC test datasets

Polish test datsets for grammatical error correction. 

## Dataset statistics

### Humman annotated test sets

| dataset nam                              | lines | errors fleks | errors orth | errors punct | errors syntax | errors lex |

|------------------------------------------|-------|--------------|-------------|--------------|---------------|------------|

| human_expert_gec_dataset.jsonl           | 1804  | 299          | 599         | 300          | 300           | 299        |

| human_annotators_common_errors_10k.jsonl | 10000 | 2442         | 4925        | 2423         | 2459          |            |

|                                          |       |              |             |              |               |            |

### Synthetic rule-based test sets

| dataset name                           | lines | errors fleks | errors orth | errors punct | errors syntax | errors lex |

|----------------------------------------|-------|--------------|-------------|--------------|---------------|------------|

| synthetic_syntax_10k.jsonl             | 10000 |              |             |              | 10336         |            |

| synthetic_nie_bez_jednostkowe_10k.json | 10000 |              | 10831       |              |               |            |

| synthetic_simple_fleks_10k.json        | 10000 | 10000        |             |              |               |            |

| synthetic_ fleks_morf_regex_10K.jsonl  | 10000 | 15158        |             |              |               |            |

| synthetic_orth_phonteic_10k.jsonl      | 10000 |              | 10000       |              |               |            |

# Acknowledge

This work was supported by NCBiR grant number POIR.01.01.01-00-1128/18-00 

“Technologia kontekstowego rozumienia języka pisanego na potrzeby poprawy błędów oraz automatycznej oceny zrozumiałości tekstu.”

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/ermlab/polish-gec-datasets

Awesome Lists containing this project

README