{"id":25279875,"url":"https://github.com/canclid/cantonesedetect","last_synced_at":"2025-10-27T16:31:05.764Z","repository":{"id":225765709,"uuid":"766748671","full_name":"CanCLID/cantonesedetect","owner":"CanCLID","description":null,"archived":false,"fork":false,"pushed_at":"2024-12-05T19:59:53.000Z","size":71,"stargazers_count":5,"open_issues_count":0,"forks_count":3,"subscribers_count":9,"default_branch":"main","last_synced_at":"2024-12-05T20:32:06.293Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"https://pypi.org/project/cantonesedetect/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/CanCLID.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-03-04T03:40:16.000Z","updated_at":"2024-12-05T19:58:56.000Z","dependencies_parsed_at":"2024-03-04T07:40:37.031Z","dependency_job_id":"5074ac46-cbea-4e12-bb7d-d46b1bb39b03","html_url":"https://github.com/CanCLID/cantonesedetect","commit_stats":null,"previous_names":["canclid/feature-detector"],"tags_count":5,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/CanCLID%2Fcantonesedetect","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/CanCLID%2Fcantonesedetect/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/CanCLID%2Fcantonesedetect/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/CanCLID%2Fcantonesedetect/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/CanCLID","download_url":"https://codeload.github.com/CanCLID/cantonesedetect/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":238523990,"owners_count":19486601,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2025-02-12T18:04:13.686Z","updated_at":"2025-10-27T16:31:00.405Z","avatar_url":"https://github.com/CanCLID.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# CantoneseDetect 粵語特徵分類器\n\n[![license](https://img.shields.io/github/license/DAVFoundation/captain-n3m0.svg?style=for-the-badge\u0026color=)](https://github.com/DAVFoundation/captain-n3m0/blob/master/LICENSE)\n\n本項目為 [canto-filter](https://github.com/CanCLID/canto-filter) 之後續。canto-filter 得 4 個分類標籤且判斷邏輯更加快速簡單，適合在線快速篩選判別文本或者其他要求低延遲、速度快嘅應用場合。本項目採用更精細嘅判斷邏輯，有 6 個分類標籤，準確度更高，但速度亦會相對 canto-filter 更慢。\n\nThis is an extension of the [canto-filter](https://github.com/CanCLID/canto-filter) project. canto-filter has only 4 output labels. It has a simipler classification logic and is faster, more suitable for use cases which require low-latency or high classification speed. This package has 6 output and uses a more sophisticated classification logic for more fine-grained classification. It has higher classification accuracy but slower performance.\n\n## 引用 Citation\n\n分類器採用嘅分類標籤及基準，參考咗對使用者嘅語言意識形態嘅研究。討論分類準則時，請引用：\n\nThe definitions and boundaries of the labels depend on the user's language ideology.\nWhen discussing the criteria adopted by this tool, please cite:\n\n\u003e Chaak-ming Lau, Mingfei Lau, and Ann Wai Huen To. 2024.\n\u003e [The Extraction and Fine-grained Classification of Written Cantonese Materials through Linguistic Feature Detection.](https://aclanthology.org/2024.eurali-1.4/)\n\u003e In Proceedings of the 2nd Workshop on Resources and Technologies for Indigenous, Endangered and Lesser-resourced Languages in Eurasia (EURALI)\n\u003e @ LREC-COLING 2024, pages 24–29, Torino, Italia. ELRA and ICCL.\n\n---\n\n## 簡介 Introduction\n\n分類方法係利用粵語同書面中文嘅特徵字詞，用 Regex 方式加以識別。分類器主要有兩個主要參數，`--split`同埋`--quotes`，兩個默認都係`False`。\n\nThe filter is based on Regex rules and detects lexical features specific to Cantonese or Written-Chiense.\n\n### 分句參數`--split`\n\n呢個參數默認關閉，如果打開，分類器會用句號、問號、感歎號等標點符號將輸入文本切成單句，對每個單句分類判斷，然後再按照下面判別標準整合嚟得到最終分類。所以呢個參數喺輸入都係單句嘅情況下唔會有區別，只會降低運行速度。喺官粵混雜比較多而且比較長嘅文本輸入下會有更多唔同。\n\n目前因為整合分句判斷嘅邏輯比較嚴，所以如果打開，會相比於關閉更加容易將其他類別判斷為`mixed`。所以對於篩選純粵文嘅用途嚟講，打開呢個參數會提高 precision 但降低 recall。\n\n### 分類標籤參數`--quotes`\n\n呢個參數默認關閉，分類器淨係會將輸入分為 4 類。如果打開，就會再增加兩類總共有 6 個標籤。打開後分類器會將引號內嘅文本抽出嚟，將佢哋同引號外文本分開判斷。下面一段就係介紹呢四個同六個標籤。\n\n### 標籤 Labels\n\n分類器會將輸入文本分成四類（粗疏）或六類（精細），分類如下:\n\nThe classifiers output four (coarse) or six (fine-grained) categories. The labels are:\n\n1. `Cantonese`: 純粵文，僅含有粵語特徵字詞，例如“你喺邊度” | Pure Cantonese text, contains Cantonese-featured words. E.g. 你喺邊度\n1. `SWC`: 書面中文，係一個僅含有書面語特徵字詞，例如“你在哪裏” | Pure Standard Written Chinese (SWC) text, contains Mandarin-feature words. E.g. 你在哪裏\n1. `Mixed`：書粵混雜文，同時含有書面語同粵語特徵嘅字詞，例如“是咁的” | Mixed Cantonese-Mandarin text, contains both Cantonese and Mandarin-featured words. E.g. 是咁的\n1. `Neutral`：無特徵中文，唔含有官話同粵語特徵，既可以當成粵文亦可以當成官話文，例如“去學校讀書” | No feature Chinese text, contains neither Cantonese nor Mandarin feature words. Such sentences can be used for both Cantonese and Mandarin text corpus. E.g. 去學校讀書\n1. `MixedQuotesInSWC` : 書面中文，引文入面係 `Mixed` | `Mixed` contents quoted within SWC text\n1. `CantoneseQuotesInSWC` : 書面中文，引文入面係純粵文 `cantonese` | `Cantonese` contents quoted within SWC text\n\n### 系統要求 Requirement\n\nPython \u003e= 3.11\n\n### 安裝 Installation\n\n```bash\npip install cantonesedetect\n```\n\n## 用法 Usage\n\n可以通過 Python 函數嚟引用，亦可以直接 CLI 調用。\n\nYou can call the Python API or this library, or run it directly in CLI.\n\n### Python\n\n用下面嘅方法創建一個 `Detector`，然後直接調用 `judge()`就可以得到分類結果：\n\nInitialize a `Detector` and call the `judge()` function on inputs, and you will get the classification outputs.\n\n```python\nfrom cantonesedetect import CantoneseDetector\n\n# 默認情況下 use_quotes=False, split_seg=False, get_analysis=False\ndetector = CantoneseDetector()\n\ndetector.judge('你喺邊度') # cantonese\ndetector.judge('你在哪裏') # swc\ndetector.judge('是咁的')  # mixed\ndetector.judge('去學校讀書')  # neutral\ndetector.judge('他説：“係噉嘅。”')  # cantonese_quotes_in_swc\ndetector.judge('那就「是咁的」')  # mixed_quotes_in_swc\n```\n\n如果想要用引號抽取判別、分句判別同埋獲得分析結果，可以：\n\nIf you want to judge inputs based on matrix-quote-splitting, or spliting into segments, you can:\n\n```python\nfrom cantonesedetect import Detector\n\ndetector = Detector(use_quotes=True, split_seg=True, get_analysis=True)\n\njudgement, document_features = detector.judge(\"他説：「我哋今晚食飯。你想去邊度食？」\")\n\n# 打印分析結果\n# Print analysis results\nprint(document_features.get_analysis())\n\n# `document_features` 入面有每個分句嘅 `document_segments_features` 同 `document_segments_judgements`\n# `document_features` object contains `document_segments_features` which is a list of segment features\nprint(document_features.document_segments_features[0].canto_feature)\nprint(document_features.document_segments_features[0].canto_exclude)\nprint(document_features.document_segments_features[0].swc_feature)\nprint(document_features.document_segments_features[0].swc_exclude)\n# Also contains `document_segments_judgements` which is a list of judgements of the segments\nprint([j.value for j in document_features.document_segments_judgements])\n```\n\n### CLI\n\n如果直接喺 CLI 調用嘅話，只需要指明`--input`就得。 `--quotes`、`--split`、`--print_analysis`三個參數都默認關閉，如果標明就會打開：\n\nIf you run directly in CLI, simply specify the `--input`. The optional arguments `--quotes`、`--split`、`--print_analysis` are all `False` by default, and you can turn them on by specifying them.\n\n```bash\ncantonesedetect --input input.txt\n# 開啓引號抽取判別、分句判別並且打印分析結果\n# Enable matrix-quotes-splitting, segment-splitting and printing the analysis.\ncantonesedetect --input input.txt --quotes --split --print_analysis\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcanclid%2Fcantonesedetect","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fcanclid%2Fcantonesedetect","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcanclid%2Fcantonesedetect/lists"}