https://github.com/jacksonllee/pycantonese
Cantonese Linguistics and NLP
https://github.com/jacksonllee/pycantonese
cantonese computational-linguistics jyutping linguistics natural-language-processing nlp part-of-speech-tagging pycantonese python stop-words word-segmentation
Last synced: 29 days ago
JSON representation
Cantonese Linguistics and NLP
- Host: GitHub
- URL: https://github.com/jacksonllee/pycantonese
- Owner: jacksonllee
- License: mit
- Created: 2014-12-13T20:40:56.000Z (about 11 years ago)
- Default Branch: main
- Last Pushed: 2024-05-23T12:48:59.000Z (over 1 year ago)
- Last Synced: 2025-09-25T16:27:24.267Z (5 months ago)
- Topics: cantonese, computational-linguistics, jyutping, linguistics, natural-language-processing, nlp, part-of-speech-tagging, pycantonese, python, stop-words, word-segmentation
- Language: Python
- Homepage: https://pycantonese.org
- Size: 15.1 MB
- Stars: 391
- Watchers: 20
- Forks: 43
- Open Issues: 12
-
Metadata Files:
- Readme: README.rst
- Changelog: CHANGELOG.md
- Contributing: CONTRIBUTING.md
- Funding: .github/FUNDING.yml
- License: LICENSE.txt
- Code of conduct: CODE_OF_CONDUCT.md
Awesome Lists containing this project
README
PyCantonese: Cantonese Linguistics and NLP in Python
====================================================
.. image:: https://jacksonllee.com/logos/pycantonese-logo.png
:width: 250px
Full Documentation: https://pycantonese.org
|
.. image:: https://badge.fury.io/py/pycantonese.svg
:target: https://pypi.python.org/pypi/pycantonese
:alt: PyPI version
.. image:: https://img.shields.io/pypi/pyversions/pycantonese.svg
:target: https://pypi.python.org/pypi/pycantonese
:alt: Supported Python versions
.. image:: https://circleci.com/gh/jacksonllee/pycantonese.svg?style=shield
:target: https://circleci.com/gh/jacksonllee/pycantonese
:alt: CircleCI Builds
|
.. start-sphinx-website-index-page
PyCantonese is a Python library for Cantonese linguistics and natural language
processing (NLP). Currently implemented features (more to come!):
- Accessing and searching corpus data
- Parsing and conversion tools for Jyutping romanization
- Parsing Cantonese text
- Stop words
- Word segmentation
- Part-of-speech tagging
.. _download_install:
Download and Install
--------------------
To download and install the stable, most recent version::
$ pip install --upgrade pycantonese
Ready for more?
Check out the `Quickstart `_ page.
Consulting
----------
If your team would like professional assistance in using PyCantonese,
freelance consulting and training services are available for both academic and commercial groups.
Please email `Jackson L. Lee `_.
Support
-------
If you have found PyCantonese useful and would like to offer support,
`buying me a coffee `_ would go a long way!
Links
-----
* Source code: https://github.com/jacksonllee/pycantonese
* Bug tracker: https://github.com/jacksonllee/pycantonese/issues
* Social media:
`Facebook `_
and `Twitter `_
How to Cite
-----------
PyCantonese is authored and maintained by `Jackson L. Lee `_.
Lee, Jackson L., Litong Chen, Charles Lam, Chaak Ming Lau, and Tsz-Him Tsui. 2022.
`PyCantonese: Cantonese Linguistics and NLP in Python `_.
*Proceedings of the 13th Language Resources and Evaluation Conference*.
.. code-block:: latex
@inproceedings{lee-etal-2022-pycantonese,
title = "PyCantonese: Cantonese Linguistics and NLP in Python",
author = "Lee, Jackson L. and
Chen, Litong and
Lam, Charles and
Lau, Chaak Ming and
Tsui, Tsz-Him",
booktitle = "Proceedings of The 13th Language Resources and Evaluation Conference",
month = june,
year = "2022",
publisher = "European Language Resources Association",
language = "English",
}
License
-------
MIT License. Please see ``LICENSE.txt`` in the GitHub source code for details.
The HKCanCor dataset included in PyCantonese is substantially modified from
its source in terms of format. The original dataset has a CC BY license.
Please see ``pycantonese/data/hkcancor/README.md``
in the GitHub source code for details.
The rime-cantonese data (release 2021.05.16) is
incorporated into PyCantonese for word segmentation and
characters-to-Jyutping conversion.
This data has a CC BY 4.0 license.
Please see ``pycantonese/data/rime_cantonese/README.md``
in the GitHub source code for details.
Logo
----
The PyCantonese logo is the Chinese character 粵 meaning Cantonese,
with artistic design by albino.snowman (Instagram handle).
Acknowledgments
---------------
Wonderful resources with a permissive license that have been incorporated into PyCantonese:
- HKCanCor
- rime-cantonese
Individuals who have contributed pull requests, bug reports, and other feedback
(in alphabetical order of last names):
- @cathug
- Francis Bond
- Jenny Chim
- Eric Dong
- @g-traveller
- @graphemecluster
- Rachel Han
- Ryan Lai
- @laubonghaudoi
- Katrina Li
- Kevin Li
- @ZhanruiLiang
- Hill Ma
- @richielo
- @rylanchiu
- Stephan Stiller
- Robin Yuen
.. end-sphinx-website-index-page
Changelog
---------
Please see ``CHANGELOG.md``.
Setting up a Development Environment
------------------------------------
The latest code under development is available on GitHub at
`jacksonllee/pycantonese `_.
To obtain this version for experimental features or for development:
.. code-block:: bash
$ git clone https://github.com/jacksonllee/pycantonese.git
$ cd pycantonese
$ pip install -e ".[dev]"
To run tests and styling checks:
.. code-block:: bash
$ pytest
$ flake8 src tests
$ black --check src tests
To build the documentation website files:
.. code-block:: bash
$ python docs/source/build_docs.py