Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/jacksonllee/pycantonese
Cantonese Linguistics and NLP
https://github.com/jacksonllee/pycantonese
cantonese computational-linguistics jyutping linguistics natural-language-processing nlp part-of-speech-tagging pycantonese python stop-words word-segmentation
Last synced: 3 months ago
JSON representation
Cantonese Linguistics and NLP
- Host: GitHub
- URL: https://github.com/jacksonllee/pycantonese
- Owner: jacksonllee
- License: mit
- Created: 2014-12-13T20:40:56.000Z (almost 10 years ago)
- Default Branch: main
- Last Pushed: 2024-05-23T03:21:20.000Z (6 months ago)
- Last Synced: 2024-05-23T03:31:34.100Z (6 months ago)
- Topics: cantonese, computational-linguistics, jyutping, linguistics, natural-language-processing, nlp, part-of-speech-tagging, pycantonese, python, stop-words, word-segmentation
- Language: Python
- Homepage: https://pycantonese.org
- Size: 15.1 MB
- Stars: 335
- Watchers: 21
- Forks: 38
- Open Issues: 9
-
Metadata Files:
- Readme: README.rst
- Changelog: CHANGELOG.md
- Contributing: CONTRIBUTING.md
- Funding: .github/FUNDING.yml
- License: LICENSE.txt
- Code of conduct: CODE_OF_CONDUCT.md
Awesome Lists containing this project
README
PyCantonese: Cantonese Linguistics and NLP in Python
====================================================.. image:: https://jacksonllee.com/logos/pycantonese-logo.png
:width: 250pxFull Documentation: https://pycantonese.org
|
.. image:: https://badge.fury.io/py/pycantonese.svg
:target: https://pypi.python.org/pypi/pycantonese
:alt: PyPI version.. image:: https://img.shields.io/pypi/pyversions/pycantonese.svg
:target: https://pypi.python.org/pypi/pycantonese
:alt: Supported Python versions.. image:: https://circleci.com/gh/jacksonllee/pycantonese.svg?style=shield
:target: https://circleci.com/gh/jacksonllee/pycantonese
:alt: CircleCI Builds|
.. start-sphinx-website-index-page
PyCantonese is a Python library for Cantonese linguistics and natural language
processing (NLP). Currently implemented features (more to come!):- Accessing and searching corpus data
- Parsing and conversion tools for Jyutping romanization
- Parsing Cantonese text
- Stop words
- Word segmentation
- Part-of-speech tagging.. _download_install:
Download and Install
--------------------To download and install the stable, most recent version::
$ pip install --upgrade pycantonese
Ready for more?
Check out the `Quickstart `_ page.Consulting
----------If your team would like professional assistance in using PyCantonese,
freelance consulting and training services are available for both academic and commercial groups.
Please email `Jackson L. Lee `_.Support
-------If you have found PyCantonese useful and would like to offer support,
`buying me a coffee `_ would go a long way!Links
-----* Source code: https://github.com/jacksonllee/pycantonese
* Bug tracker: https://github.com/jacksonllee/pycantonese/issues
* Social media:
`Facebook `_
and `Twitter `_How to Cite
-----------PyCantonese is authored and maintained by `Jackson L. Lee `_.
Lee, Jackson L., Litong Chen, Charles Lam, Chaak Ming Lau, and Tsz-Him Tsui. 2022.
`PyCantonese: Cantonese Linguistics and NLP in Python `_.
*Proceedings of the 13th Language Resources and Evaluation Conference*... code-block:: latex
@inproceedings{lee-etal-2022-pycantonese,
title = "PyCantonese: Cantonese Linguistics and NLP in Python",
author = "Lee, Jackson L. and
Chen, Litong and
Lam, Charles and
Lau, Chaak Ming and
Tsui, Tsz-Him",
booktitle = "Proceedings of The 13th Language Resources and Evaluation Conference",
month = june,
year = "2022",
publisher = "European Language Resources Association",
language = "English",
}License
-------MIT License. Please see ``LICENSE.txt`` in the GitHub source code for details.
The HKCanCor dataset included in PyCantonese is substantially modified from
its source in terms of format. The original dataset has a CC BY license.
Please see ``pycantonese/data/hkcancor/README.md``
in the GitHub source code for details.The rime-cantonese data (release 2021.05.16) is
incorporated into PyCantonese for word segmentation and
characters-to-Jyutping conversion.
This data has a CC BY 4.0 license.
Please see ``pycantonese/data/rime_cantonese/README.md``
in the GitHub source code for details.Logo
----The PyCantonese logo is the Chinese character 粵 meaning Cantonese,
with artistic design by albino.snowman (Instagram handle).Acknowledgments
---------------Wonderful resources with a permissive license that have been incorporated into PyCantonese:
- HKCanCor
- rime-cantoneseIndividuals who have contributed pull requests, bug reports, and other feedback
(in alphabetical order of last names):- @cathug
- Francis Bond
- Jenny Chim
- Eric Dong
- @g-traveller
- @graphemecluster
- Rachel Han
- Ryan Lai
- @laubonghaudoi
- Katrina Li
- Kevin Li
- @ZhanruiLiang
- Hill Ma
- @richielo
- @rylanchiu
- Stephan Stiller
- Robin Yuen.. end-sphinx-website-index-page
Changelog
---------Please see ``CHANGELOG.md``.
Setting up a Development Environment
------------------------------------The latest code under development is available on GitHub at
`jacksonllee/pycantonese `_.
To obtain this version for experimental features or for development:.. code-block:: bash
$ git clone https://github.com/jacksonllee/pycantonese.git
$ cd pycantonese
$ pip install -e ".[dev]"To run tests and styling checks:
.. code-block:: bash
$ pytest
$ flake8 src tests
$ black --check src testsTo build the documentation website files:
.. code-block:: bash
$ python docs/source/build_docs.py