{"id":13814766,"url":"https://github.com/m1-llie/TUMCC","last_synced_at":"2025-05-15T06:32:44.392Z","repository":{"id":49387549,"uuid":"417766528","full_name":"m1-llie/TUMCC","owner":"m1-llie","description":"[IP\u0026M 2022] Telegram地下市场中文黑话识别语料集。Telegram Underground Market Chinese Corpus. Paper: Identification of Chinese Dark Jargons in Telegram Underground Markets Using Context-Oriented and Linguistic Features (IP\u0026M, 2022).","archived":false,"fork":false,"pushed_at":"2024-03-01T07:16:00.000Z","size":11200,"stargazers_count":176,"open_issues_count":0,"forks_count":19,"subscribers_count":2,"default_branch":"main","last_synced_at":"2024-08-04T04:06:21.283Z","etag":null,"topics":["chinese","corpus","dataset","telegram"],"latest_commit_sha":null,"homepage":"","language":null,"has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/m1-llie.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2021-10-16T08:28:58.000Z","updated_at":"2024-08-04T04:06:21.582Z","dependencies_parsed_at":"2024-08-04T04:06:20.998Z","dependency_job_id":"2b64244b-697e-4eab-a609-8a19c467b74c","html_url":"https://github.com/m1-llie/TUMCC","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/m1-llie%2FTUMCC","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/m1-llie%2FTUMCC/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/m1-llie%2FTUMCC/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/m1-llie%2FTUMCC/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/m1-llie","download_url":"https://codeload.github.com/m1-llie/TUMCC/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":225335143,"owners_count":17458218,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["chinese","corpus","dataset","telegram"],"created_at":"2024-08-04T04:02:33.722Z","updated_at":"2024-11-19T10:30:25.250Z","avatar_url":"https://github.com/m1-llie.png","language":null,"funding_links":[],"categories":["Others"],"sub_categories":[],"readme":"# TUMCC (Telegram Underground Market Chinese Corpus)\n\nTUMCC is the first Chinese corpus in the jargon identification field. \n\n**28,749** sentences, including **804,971** characters, from **19,821** Telegram users of **12** Telegram groups were collected when we built TUMCC.\n\nWe had finished data screening and word segmentation before we released this corpus. So it might be easier for you to use.\n\nAfter cleaning, TUMCC contains 3,863 sentences (100,000 characters) from 3,139 Telegram users.\n\n## Files\n\n``TUMCC-clean.txt`` contains the corpus after our cleaning. You can use it directly in your research.\n\n``TUMCC-raw.7z`` contains raw information we collected from Telegram. You can do text cleaning to get more valid data and valuable information.\n\nFor more details about the target Telegram group sources for data extraction, please refer to the paper [`Identification of Chinese Dark Jargons in Telegram Underground Markets Using Context-Oriented and Linguistic Features`](https://doi.org/10.1016/j.ipm.2022.103033) ([Information Processing and Management](https://www.sciencedirect.com/journal/information-processing-and-management), 2022).\n\n## Citation\nThanks for your interest in our dataset, please feel free to leave a ⭐️ or cite us through:\n\n```\n@article{hou2022identification,\n  title={Identification of Chinese dark jargons in Telegram underground markets using context-oriented and linguistic features},\n  author={Hou, Yiwei and Wang, Hailin and Wang, Haizhou},\n  journal={Information Processing \\\u0026 Management},\n  volume={59},\n  number={5},\n  pages={103033,1--20},\n  year={2022},\n  publisher={Elsevier}\n}\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fm1-llie%2FTUMCC","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fm1-llie%2FTUMCC","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fm1-llie%2FTUMCC/lists"}