{"id":26149093,"url":"https://github.com/liao961120/collocation","last_synced_at":"2025-04-14T03:41:32.773Z","repository":{"id":62563654,"uuid":"386239294","full_name":"liao961120/collocation","owner":"liao961120","description":null,"archived":false,"fork":false,"pushed_at":"2021-12-23T07:07:46.000Z","size":563,"stargazers_count":4,"open_issues_count":0,"forks_count":1,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-03-27T17:46:25.125Z","etag":null,"topics":["collocation","linguistics","python3"],"latest_commit_sha":null,"homepage":"https://pypi.org/project/collocation","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/liao961120.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2021-07-15T09:46:16.000Z","updated_at":"2024-09-05T22:22:35.000Z","dependencies_parsed_at":"2022-11-03T16:00:24.751Z","dependency_job_id":null,"html_url":"https://github.com/liao961120/collocation","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/liao961120%2Fcollocation","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/liao961120%2Fcollocation/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/liao961120%2Fcollocation/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/liao961120%2Fcollocation/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/liao961120","download_url":"https://codeload.github.com/liao961120/collocation/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248510003,"owners_count":21116131,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["collocation","linguistics","python3"],"created_at":"2025-03-11T05:28:42.981Z","updated_at":"2025-04-14T03:41:32.744Z","avatar_url":"https://github.com/liao961120.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"Collocation\n=================\n\n\n## Installation\n\n```bash\npip install collocation\n```\n\n## Usage\n\n```python\nfrom collocation import Collocation\n\n# Prepare corpus data\n# https://yongfu.name/collocation/sampled_PTTposts.txt\ncorpus = []\nwith open(\"sampled_PTTposts.txt\", encoding=\"utf-8\") as f:\n    for sent in f.read().split(\"\\n\"):\n        if sent.strip() == \"\": continue\n        sentence = []\n        for tk in sent.split(\"\\u3000\"):\n            if tk == \"\": continue\n            sentence.append(tk)\n        corpus.append(sentence)\n\n\u003e\u003e\u003e corpus[:7]\n[['物品', '名稱', '：', '學生證'],\n ['拾獲', '地點', '：', '大一女', '前'],\n ['拾獲', '時間', '：', '6', '/', '21'],\n ['18', ':', '20', '左右'],\n ['物品', '描述', '：', '就', '一', '張', '學生證'],\n ['聯絡', '方式', '：', '站', '內', '信'],\n ['其他', '說明', '：', '請', '失主', '或', '朋友', '速速', '聯絡', '喔']]\n\n\n# Initialize\nc = Collocation(corpus, left_window=3, right_window=3)\n# Query\n\u003e\u003e\u003e c.get_topn_collocates(\"[臺台]灣\", cutoff=3, n=3, by=\"MI\", chinese_only=True)\n[('臺灣', '國立',\n  {'MI': 9.801006087045614,\n   'Xsq': 3560.8618187084653,\n   'Gsq': 46.973839463946646,\n   'Dice': 0.04519774011299435,\n   'DeltaP21': 0.0277504097342154,\n   'DeltaP12': 0.12108001346126511,\n   'RawCount': 4}),\n ('臺灣', '聯盟',\n  {'MI': 9.064040492879409,\n   'Xsq': 2133.3656286224623,\n   'Gsq': 42.68555772916195,\n   'Dice': 0.04020100502512563,\n   'DeltaP21': 0.0277296477701336,\n   'DeltaP12': 0.0725951622338306,\n   'RawCount': 4}),\n ('臺灣', '大學',\n  {'MI': 8.428314878476366,\n   'Xsq': 3768.760643213294,\n   'Gsq': 107.96714978953916,\n   'Dice': 0.05804749340369393,\n   'DeltaP21': 0.07617749434551055,\n   'DeltaP12': 0.046682984348090525,\n   'RawCount': 11})]\n\n# Acess documentation of parameters\nhelp(c.get_topn_collocates)\n```\n\n### Custom Association Measures\n\n```python\nfrom math import log2\nfrom collocation.association import FisherAttract, Dice\n\n\ndef logDice(O11, O12, O21, O22, E11, E12, E21, E22):\n    D = Dice(O11, O12, O21, O22, E11, E12, E21, E22)\n    return 14 + log2(D)\n\nc.association_measures = [FisherAttract, logDice]\n\n\u003e\u003e\u003e c.get_topn_collocates(\"[臺台]灣\", cutoff=3, n=3, by=\"logDice\", chinese_only=True)\n[('臺灣', '大學',\n  {'FisherAttract': 56.04305766162403,\n   'logDice': 9.893377580466206,\n   'RawCount': 11}),\n ('臺灣', '國立',\n  {'FisherAttract': 25.040705008265565,\n   'logDice': 9.532394449917003,\n   'RawCount': 4}),\n ('臺灣', '聯盟',\n  {'FisherAttract': 22.922605012581965,\n   'logDice': 9.36337537945635,\n   'RawCount': 4})]\n```","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fliao961120%2Fcollocation","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fliao961120%2Fcollocation","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fliao961120%2Fcollocation/lists"}