{"id":20240048,"url":"https://github.com/sudoskys/teledataparser","last_synced_at":"2025-10-30T05:52:43.738Z","repository":{"id":118833225,"uuid":"524301908","full_name":"sudoskys/TeleDataParser","owner":"sudoskys","description":"Batch parsing of Telegram exported Json data files and extraction of the entire/specified corpus of a user/group for AI learning","archived":false,"fork":false,"pushed_at":"2022-11-13T07:42:19.000Z","size":64,"stargazers_count":7,"open_issues_count":0,"forks_count":1,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-04-10T20:16:45.781Z","etag":null,"topics":["ai","nlp","nlp-machine-learning","python3","telegram"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/sudoskys.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2022-08-13T04:41:11.000Z","updated_at":"2024-04-16T00:23:10.000Z","dependencies_parsed_at":null,"dependency_job_id":"8e10e0d5-4cc0-428b-b215-c5556d6b2163","html_url":"https://github.com/sudoskys/TeleDataParser","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/sudoskys/TeleDataParser","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sudoskys%2FTeleDataParser","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sudoskys%2FTeleDataParser/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sudoskys%2FTeleDataParser/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sudoskys%2FTeleDataParser/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/sudoskys","download_url":"https://codeload.github.com/sudoskys/TeleDataParser/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sudoskys%2FTeleDataParser/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":281754331,"owners_count":26555915,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-10-30T02:00:06.501Z","response_time":61,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["ai","nlp","nlp-machine-learning","python3","telegram"],"created_at":"2024-11-14T08:42:53.187Z","updated_at":"2025-10-30T05:52:43.698Z","avatar_url":"https://github.com/sudoskys.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Telegram Machine Learning Corpus Extraction\n\n[![Python 3.*](https://img.shields.io/badge/Python-3.*-yellow.svg)](http://www.python.org/download/)\n\nMachine learning corpus extractor.\n\nParses Json data files exported by Telegram.\n\n- Extracting specific replies.\n- Extracts specific statements.\n- Filter support.\n- Extract all.\n- Support for word limit.\n- Custom field length calculation.\n\nExtract a user's speech for AI learning, and save your loved one's chat history.\n\nAt the moment, because I'm too lazy, I've **only** done the part of extracting the plain text corpus.\n\n## Run\n\n- Installation\n\nRun `pip3 install -r requirements.txt` in the project directory\n\n- Run\n\nConfigure `config.ini` to run `python3 main.py` to generate the data\n\n## Performance\n\nNumber of outputs\n\n- 100 w -\u003e 28s\n- 49w -\u003e 12s\n\n## Config\n\n### Constructing classes\n\n````python\nfrom Core.Tool import TeleParser\n\nParser = TeleParser(\"JsonInput\", \"DataOutput\", min_limit=5, max_limit=512)\ndicts = Parser.get_all(lable=\"GIRL\", showDate=False, ending=\"\\n\", uni_data=False)\nprint(dicts)\n# See comments for yourself\n# Returns: total number of writes, number of non-conforming skips, number of deleted, total number of signed messages\n````\n\n**TeleParser Api**\n\n**__init__**(self,\njson_path: str, out_path: str, min_limit: int = 5, max_limit: int =\n512, Counter: str = 'chinese', filter_mode: str = False, filter: str =\n'Not_need.txt')\n\n```\n:param json_path:input_directory\n:param out_path:output\n:param min_limit:min_count\n:param max_limit:max_count\n:param Counter:counter\n:param filter_mode:type, True to keep only sentences with keywords, False to keep only sentences without keywords\n:param filter:path to filter phrase file\n:return: dict\n```\n\n**get_all**(self, lable: str, showDate=False, ending='\\n', uni_data=False, no_id: list = None) -\u003e dict\n\n```\n:param lable: the label\n:param no_id: who not to receive (e.g. messages from service bots)\n:param uni_data: whether to de-duplicate\n:param ending: the suffix\n:param showDate: whether to show the date\n:return: dict\n```\n\n**get_all_reply**(self, showDate=True, ending='\\n', uni_data=False) -\u003e dict\n\n```\n:param uni_data: whether to de-duplicate\n:param ending: the suffix\n:param showDate: whether to show the date\n:return: dict\n```\n\n**get_reply**(self, lable, target_id, showDate=True, ending='\\n', uni_data=False) -\u003e dict\n\n```\n:param showDate: whether to show the date\n:param ending: the suffix\n:param uni_data: whether to de-duplicate\n:param lable: the name tag\n:param target_id: the target ID, the one with user\n:return: dict\n```\n\n**get_speech**(self, lable, target_id, showDate=True, ending='\\n', uni_data=False) -\u003e dict\n\n```\n:param uni_data: whether to de-duplicate\n:param ending: the suffix\n:param showDate: whether to attach a date\n:param lable: the name tag\n:param target_id: the target ID, the one with user\n:return: dict\n```\n\n- hint method\n\n**write_out**(self, speech: list, path: str, Wash: bool = False)\n\n```\n:param speech: list of phrases\n:param path: the name of the output file\n:param Wash: whether to de-duplicate\n:return:\n```\n\n#### Length Gauge\n\nclass **Tester**(builtins.object)\nStatic methods defined here:\n\n```\nchinese(ask)\ndefault(ask)\n```\n\n### Config File\n\n````ini\n; Sample configuration file\n[user]\nuser = Someone\nuser_id = user114514\n\n\n[path]\ninput = JsonInput\noutput = DataOutput\n````\n\n**Sample reference format**\n\n```json\n{\n  \"name\": \"Unknown | Private\",\n  \"type\": \"private_supergroup\",\n  \"id\": 11451418180,\n  \"messages\": [\n    {\n      \"id\": 1,\n      \"type\": \"message\",\n      \"date\": \"2022-01-28T01:35:46\",\n      \"date_unixtime\": \"1643333746\",\n      \"edited\": \"2022-05-15T14:16:08\",\n      \"edited_unixtime\": \"1652624168\",\n      \"from\": \"Someone\",\n      \"from_id\": \"user2333\",\n      \"reply_to_message_id\": 271065,\n      \"text\": \"Hi,GOOD MORNING\"\n    }\n  ]\n}\n```\n\n\n![counter](https://count.getloli.com/get/@sudoskys-github-TeleDataParser?theme=moebooru)\n\n### License\n\n```lines\nUse of this item for malicious purposes is not permitted.\nThis project is licensed under the Apache License\n```","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsudoskys%2Fteledataparser","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fsudoskys%2Fteledataparser","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsudoskys%2Fteledataparser/lists"}