Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/howl-anderson/chinese_models_for_spacy
SpaCy 中文模型 | Models for SpaCy that support Chinese
https://github.com/howl-anderson/chinese_models_for_spacy
chinese-nlp nlp nlp-dependency-parsing nlp-machine-learning spacy-models
Last synced: about 3 hours ago
JSON representation
SpaCy 中文模型 | Models for SpaCy that support Chinese
- Host: GitHub
- URL: https://github.com/howl-anderson/chinese_models_for_spacy
- Owner: howl-anderson
- License: mit
- Created: 2018-05-02T11:05:19.000Z (almost 7 years ago)
- Default Branch: master
- Last Pushed: 2025-01-04T19:30:37.000Z (29 days ago)
- Last Synced: 2025-01-26T05:01:48.260Z (7 days ago)
- Topics: chinese-nlp, nlp, nlp-dependency-parsing, nlp-machine-learning, spacy-models
- Language: Jupyter Notebook
- Homepage:
- Size: 709 KB
- Stars: 654
- Watchers: 31
- Forks: 110
- Open Issues: 10
-
Metadata Files:
- Readme: README.en-US.md
- License: LICENSE.md
Awesome Lists containing this project
README
[中文版本的 README](README.zh-Hans.md)
------------------------------# The official Chinese model for SpaCy is now available at (https://spacy.io/models/zh). It was developed with reference to this project and shares the same features. As the goal of this project — “promoting the development of the SpaCy Chinese model” — has been achieved, this repository will enter maintenance mode. Future updates will focus only on bug fixes. We would like to thank all users for their long-term attention and support.
# Chinese models for SpaCy
SpaCy (version > 2) models for Chinese language. Those models are rough and still **working in prograss**. But "Something is Better Than Nothing".
## Online demo
An online jupyter notebook / demo is provided at [![Binder](https://mybinder.org/badge.svg)](https://mybinder.org/v2/gh/howl-anderson/Chinese_models_for_SpaCy/master?filepath=notebooks%2Fdemo.ipynb).
### Features
Partial attributes of a `Doc` object for `王小明在北京的清华大学读书`:
![attributes_of_doc](.images/attributes_of_doc.png)
### NER (**New!**)
NER of a `Doc` object for `王小明在北京的清华大学读书`:![attributes_of_doc](.images/ner_of_doc.png)
## Getting Started
Models are released as binary file, users should know basic knowledge of using SpaCy version 2+.
### Prerequisites
Python 3 (maybe python2, but currently not well tested)
### Installing
Download relased model from `releases`.
```
wget -c https://github.com/howl-anderson/Chinese_models_for_SpaCy/releases/download/v2.0.4/zh_core_web_sm-2.0.4.tar.gz
```then install model
```
pip install zh_core_web_sm-2.0.4.tar.gz
```## Running demo code
`test.py` contains demo codes. After install the model, user can download or clone this repo then execute:
```bash
python3 ./test.py
```then, open web browser to `http://127.0.0.1:5000`, user will see image simllar to this:
![Dependency of doc](.images/dependency_of_doc.png)
## How to re-produce model
See [workflow](workflow.md)
## Corpus Data
The corpus data used in this project is OntoNotes 5.0。Since OntoNotes 5.0 is copyright material of LDC ([Linguistic Data Consortium](https://www.ldc.upenn.edu/)) . This project can not include the daa directly。Good news is OntoNotes 5.0 is free to organizer user, you can set up a count for your company or school, then you can get the OntoNotes 5.0 at no cost。
## TODO list
* Attribute `pos_` is not working correctly. This related to Language class in SpaCy.
* Attribute `shape_` and `is_alpha` seems meaningless for Chinese, need make sure of it.
* Attribute `is_stop` is not working correctly. This related to Language class in SpaCy.
* Attribute `vector` seems not well trained
* Attribute `is_oov` is totally incorrect. First priority.
* NER model is not available due to lacking of LDC corpus. I am working on it.
* Release all the intermediate material to help user build own model## Built With
* TODO
## Contributing
Please read [CONTRIBUTING.md](https://gist.github.com/PurpleBooth/b24679402957c63ec426) for details on our code of conduct, and the process for submitting pull requests to us.
## Versioning
We use [SemVer](http://semver.org/) for versioning. For the versions available, see the `tags` on this repository.
## Authors
* **Xiaoquan Kong** - *Initial work* - [howl-anderson](https://github.com/howl-anderson)
See also the list of `contributors` who participated in this project.
## License
This project is licensed under the MIT License - see the [LICENSE.md](LICENSE.md) file for details
## Acknowledgments
* TODO