https://github.com/yongzhuo/csc_dataset_de3
dataset of csc, chinese spelling correct / check; 14万中文文本纠错数据集(中文拼写纠错), 主要是"地得的"
https://github.com/yongzhuo/csc_dataset_de3
chinese-spell-check chinese-spell-correction chinese-spelling-correction csc dataset macro-correct text-correct text-correction
Last synced: 4 months ago
JSON representation
dataset of csc, chinese spelling correct / check; 14万中文文本纠错数据集(中文拼写纠错), 主要是"地得的"
- Host: GitHub
- URL: https://github.com/yongzhuo/csc_dataset_de3
- Owner: yongzhuo
- License: apache-2.0
- Created: 2025-01-21T04:06:19.000Z (over 1 year ago)
- Default Branch: master
- Last Pushed: 2025-04-28T10:04:44.000Z (about 1 year ago)
- Last Synced: 2025-06-10T08:09:24.904Z (12 months ago)
- Topics: chinese-spell-check, chinese-spell-correction, chinese-spelling-correction, csc, dataset, macro-correct, text-correct, text-correction
- Homepage:
- Size: 924 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
---
license: apache-2.0
task_categories:
- text2text-generation
- text-generation
language:
- zh
tags:
- csc
- text-correct
- macro-correct
- chinese-spelling-correct
- corrector
---
# csc_public_de3数据集
## 数据来源
- 1.由人民日报/学习强国/chinese-poetry等高质量数据人工生成;
- 2.来自人民日报高质量语料;
- 3.来自学习强国网站的高质量语料;
- 4.源数据为qwen生成的好词好句;
- 5.古诗词chinese-poetry; 文言文garychowcmu/daizhigev20;
## 测评指标
```
模型[Macropodus/macbert4mdcspell_v1](https://huggingface.co/Macropodus/macbert4mdcspell_v1)
common Sentence Level detection: acc:0.9547, precision:1.0000, recall:0.9547, f1:0.9768
common Sentence Level correction: acc:0.9309, precision:1.0000, recall:0.9309, f1:0.9642
strict Sentence Level detection: acc:0.9547, precision:0.9629, recall:0.9547, f1:0.9588
strict Sentence Level correction: acc:0.9309, precision:0.9389, recall:0.9309, f1:0.9349
common Token Level correction: precision:0.8865, recall:0.9898, f1:0.9353
common TOken Level correction: precision:0.9797, recall:0.9828, f1:0.9812
```
## 数据简介
- 该数据主要为'的地得'纠错;
- 其中训练数据130753条, 验证数据5545条, 测试数据5545条;
- 句子平均长度为36, 最长句子长度为414, 最短为5, 95%的为89, 75%的为46, 60%的为34;
- 每个句子中字的平均错误数为2;
- 全量数据在HF(包括train.json):[https://huggingface.co/datasets/Macropodus/csc_public_de3](https://huggingface.co/datasets/Macropodus/csc_public_de3)
## 数据详情
```
################################################################################################################################
train.json
130753
avg_errors: 2.074751630937722
avg_length: 36.39140210932063
414
5
95%: 89
90%: 69
85%: 58
80%: 51
75%: 46
70%: 41
60%: 34
50%: 29
################################################################################################################################
dev.json
5545
avg_errors: 2.047069431920649
avg_length: 35.767177637511274
278
5
95%: 85
90%: 66
85%: 56
80%: 49
75%: 45
70%: 41
60%: 35
50%: 29
################################################################################################################################
tet.json
5545
avg_errors: 2.028494138863841
avg_length: 35.982326420198376
277
5
95%: 88
90%: 66
85%: 56
80%: 49
75%: 45
70%: 41
60%: 34
50%: 29
```
## Cite
For citing this dataset, you can refer to the present GitHub project. For example, with BibTeX:
```
@software{macro-correct,
url = {https://github.com/yongzhuo/macro-correct},
author = {Yongzhuo Mo},
title = {macro-correct},
year = {2025}
```