https://github.com/scut-dlvclab/megahan97k
MegaHan97K: A Large-Scale Dataset for Mega-Category Chinese Character Recognition with over 97K Categories
https://github.com/scut-dlvclab/megahan97k
Last synced: 10 days ago
JSON representation
MegaHan97K: A Large-Scale Dataset for Mega-Category Chinese Character Recognition with over 97K Categories
- Host: GitHub
- URL: https://github.com/scut-dlvclab/megahan97k
- Owner: SCUT-DLVCLab
- Created: 2024-05-27T11:40:20.000Z (12 months ago)
- Default Branch: main
- Last Pushed: 2024-07-19T08:26:07.000Z (10 months ago)
- Last Synced: 2024-07-19T16:54:39.887Z (10 months ago)
- Language: Python
- Homepage:
- Size: 38.3 MB
- Stars: 4
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
MegaHan97K Dataset
Pattern Recognition (PR), 2025
* We introduce MegaHan97K, a mega-category, large-scale dataset that contains the largest 97,455 Chinese character categories.
* MegaHan97K includes Chinese characters of 97,455 categories, which significantly surpasses existing datasets with at least six times larger categories and holds the largest volume.
* MegaHan97K pioneers to support the latest Chinese GB18030-2022 standard, ensuring the most comprehensive coverage and compatibility with modern Chinese processing systems.
* MegaHan97K contains three distinct subsets: handwritten, historical, and synthetic. Each subset contains a greater number of character categories compared to existing datasets, resulting in remarkable scale and diversity advantages.
* MegaHan97K effectively mitigates long-tail distribution issues by providing a balanced and sufficient number of samples for each category, ensuring robust training and validation of CCR models.
## 🔥 Download
| **Setting** | **Dataset** | **status** |
|-------------------------|----------------|------------|
| **General CCR** | [Baiduyun:k4ch](https://pan.baidu.com/s/1LwIS-K812Q0LjBajpvQeVw?pwd=k4ch)/[OneDrive](https://1drv.ms/u/c/d3b0ec8fe3491f94/EYi4e5_dtLBMmFl9I669KjEBr2PqPWEd7VLxeIzHDlKhgg?e=YXrQEO) | Released |
| **Zero-Shot CCR** | [Baiduyun:bxde](https://pan.baidu.com/s/1tKhrIZK7zmpQq3NNCo5Edw?pwd=bxfe)/[OneDrive](https://1drv.ms/u/c/d3b0ec8fe3491f94/ETsFnx-i6sRJvrVrgnvO3h4BMugmO2TUObjD9ddz3xfEmw?e=IoUcXq) | Released |## 🛠️ Usage
* Clone this repo:
```bash
git clone https://github.com/SCUT-DLVCLab/MegaHan97K.git
```* Execute the following command to obtain example samples from the MegaHan97K dataset.
```python
python MegaHan_Dataloader.py
```**Note:**
- The MegaHan97K dataset can only be used for non-commercial research purposes. For scholar or organization who wants to use the MegaHan97K dataset, please first fill in this [Application Form](./application-form/Application-Form-for-Using-MegaHan97K.docx) and sign the [Legal Commitment](./application-form/Legal-Commitment.docx) and email them to us. When submitting the application form to us, please list or attached 1-2 of your publications in the recent 6 years to indicate that you (or your team) do research in the related research fields of handwriting analysis and recognition, document image processing, and so on.
- We will give you the decompression password after your application has been received and approved.
- All users must follow all use conditions; otherwise, the authorization will be revoked.* To access the entire dataset, please first download it, update the ```data_root``` in the python ```MegaHan_Dataloader.py``` script and then execute
```python
python MegaHan_Dataloader.py
```## 🌄 Gallery
* **Illustration of the handwritten-original data in MegaHan97K**
* **Illustration of the handwritten-augmented data in MegaHan97K**
* **Illustration of the M5HisDoc data in MegaHan97K**
* **Illustration of the Kangxi dictionary data in MegaHan97K**
* **Illustration of the handwritten-original data in MegaHan97K**
* **Illustration of the handwritten-augmented data in MegaHan97K**
* **Illustration of the synthetic data in MegaHan97K**
## 💙 Acknowledgement
- [M5HisDoc](https://github.com/HCIILAB/M5HisDoc)
- [FontDiffuser](https://github.com/yeungchenwa/FontDiffuser)## License
MegaHan97K should be used and distributed under [Creative Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0) License](https://creativecommons.org/licenses/by-nc-nd/4.0/) for non-commercial research purposes.## Copyright
- This repository can only be used for non-commercial research purposes.
- For commercial use, please contact Prof. Lianwen Jin ([email protected]).
- Copyright 2025, [Deep Learning and Vision Computing Lab (DLVC-Lab)](http://www.dlvc-lab.net), South China University of Technology.