{"id":19798771,"url":"https://github.com/mykhode/data_mining_py","last_synced_at":"2026-06-16T14:32:21.250Z","repository":{"id":214726049,"uuid":"737217506","full_name":"MyKhode/Data_Mining_Py","owner":"MyKhode","description":"Simple Scrabe data with Python ","archived":false,"fork":false,"pushed_at":"2023-12-30T08:16:11.000Z","size":9,"stargazers_count":3,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-11-23T14:25:44.911Z","etag":null,"topics":["ai","scrabe-data","training-data"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/MyKhode.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null}},"created_at":"2023-12-30T08:11:05.000Z","updated_at":"2024-06-09T13:51:52.000Z","dependencies_parsed_at":"2023-12-30T09:22:18.965Z","dependency_job_id":"6b0544d1-f711-4b29-a5c6-7fc4ab979928","html_url":"https://github.com/MyKhode/Data_Mining_Py","commit_stats":null,"previous_names":["soytet/data_mining_py","ikhode-kh/data_mining_py","ikhode-arena/data_mining_py","mykhode/data_mining_py"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/MyKhode/Data_Mining_Py","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/MyKhode%2FData_Mining_Py","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/MyKhode%2FData_Mining_Py/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/MyKhode%2FData_Mining_Py/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/MyKhode%2FData_Mining_Py/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/MyKhode","download_url":"https://codeload.github.com/MyKhode/Data_Mining_Py/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/MyKhode%2FData_Mining_Py/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":34410780,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-06-16T02:00:06.860Z","response_time":126,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["ai","scrabe-data","training-data"],"created_at":"2024-11-12T07:31:51.044Z","updated_at":"2026-06-16T14:32:21.226Z","avatar_url":"https://github.com/MyKhode.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"\n\u003c!DOCTYPE html\u003e\n\u003chtml lang=\"en\"\u003e\n\n\u003cbody\u003e\n\n  \u003ch1\u003eData Mining with Python\u003c/h1\u003e\n\n  \u003ch2\u003eDescription\u003c/h2\u003e\n  \u003cp\u003e\n    This project scrapes a Q\u0026A website (Khmer language-based) to generate intents for a conversational AI system. It utilizes web scraping techniques, natural language processing, and data structuring to create a dataset of tagged intents for training language models.\n  \u003c/p\u003e\n\n  \u003ch2\u003eFeatures\u003c/h2\u003e\n  \u003cul\u003e\n    \u003cli\u003e\u003cstrong\u003eWeb Scraping:\u003c/strong\u003e Utilizes requests and BeautifulSoup for data extraction from the website.\u003c/li\u003e\n    \u003cli\u003e\u003cstrong\u003eNLP Tagging:\u003c/strong\u003e Implements the KhmerNLP library for part-of-speech tagging.\u003c/li\u003e\n    \u003cli\u003e\u003cstrong\u003eIntent Generation:\u003c/strong\u003e Gathers unique nouns from questions to form intent patterns and extracts corresponding answers.\u003c/li\u003e\n  \u003c/ul\u003e\n\n  \u003ch2\u003eInstallation\u003c/h2\u003e\n  \u003col\u003e\n    \u003cli\u003eClone the repository:\n      \u003ccode\u003egit clone https://github.com/your-username/repo-name.git\u003c/code\u003e\n    \u003c/li\u003e\n    \u003cli\u003eInstall dependencies:\n      \u003ccode\u003epip install -r requirements.txt\u003c/code\u003e\n    \u003c/li\u003e\n  \u003c/ol\u003e\n\n  \u003ch2\u003eUsage\u003c/h2\u003e\n  \u003col\u003e\n    \u003cli\u003eRun the Python script \u003ccode\u003egenerate_intents.py\u003c/code\u003e.\u003c/li\u003e\n    \u003cli\u003eThe script will scrape the Q\u0026A website and generate a JSON file (\u003ccode\u003edata_intents.json\u003c/code\u003e) containing intents for conversational AI systems.\u003c/li\u003e\n  \u003c/ol\u003e\n\n  \u003ch2\u003eExample\u003c/h2\u003e\n  \u003cpre\u003e\u003ccode\u003epython generate_intents.py\u003c/code\u003e\u003c/pre\u003e\n\n  \u003ch2\u003eDependencies\u003c/h2\u003e\n  \u003cul\u003e\n    \u003cli\u003e\u003ccode\u003erequests\u003c/code\u003e\u003c/li\u003e\n    \u003cli\u003e\u003ccode\u003ebeautifulsoup4\u003c/code\u003e\u003c/li\u003e\n    \u003cli\u003e\u003ccode\u003ekhmernltk\u003c/code\u003e\u003c/li\u003e\n  \u003c/ul\u003e\n\n  \u003ch2\u003eData Structure\u003c/h2\u003e\n  \u003cp\u003eThe generated JSON file (\u003ccode\u003edata_intents.json\u003c/code\u003e) follows the structure:\u003c/p\u003e\n  \u003cpre\u003e\n    \u003ccode\u003e\n{\n  \"intents\": [\n    {\n      \"tag\": \"id_1\",\n      \"patterns\": [\"Question Pattern 1\", \"Noun Pattern 1\"],\n      \"responses\": [\"Answer 1\"]\n    },\n    // Other intents follow the same structure\n  ]\n}\n    \u003c/code\u003e\n  \u003c/pre\u003e\n\n  \u003ch2\u003eContribution\u003c/h2\u003e\n  \u003col\u003e\n    \u003cli\u003eFork the repository.\u003c/li\u003e\n    \u003cli\u003eCreate a new branch (\u003ccode\u003egit checkout -b feature/new-feature\u003c/code\u003e).\u003c/li\u003e\n    \u003cli\u003eMake your changes and commit (\u003ccode\u003egit commit -am 'Add new feature'\u003c/code\u003e).\u003c/li\u003e\n    \u003cli\u003ePush to the branch (\u003ccode\u003egit push origin feature/new-feature\u003c/code\u003e).\u003c/li\u003e\n    \u003cli\u003eCreate a new Pull Request.\u003c/li\u003e\n  \u003c/ol\u003e\n\n  \u003ch2\u003eLicense\u003c/h2\u003e\n  \u003cp\u003eThis project is licensed under the \u003ca href=\"LICENSE\"\u003eMIT License\u003c/a\u003e.\u003c/p\u003e\n\n  \u003ch2\u003eCredits\u003c/h2\u003e\n  \u003cul\u003e\n    \u003cli\u003eDeveloped by \u003ca href=\"https://github.com/soytet\"\u003eSOY TET\u003c/a\u003e\u003c/li\u003e\n    \u003cli\u003eKhmerNLP Library: \u003ca href=\"https://github.com/KhmerNLP/khmer-nltk\"\u003eLink\u003c/a\u003e\u003c/li\u003e\n  \u003c/ul\u003e\n\n\u003c/body\u003e\n\u003c/html\u003e\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmykhode%2Fdata_mining_py","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmykhode%2Fdata_mining_py","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmykhode%2Fdata_mining_py/lists"}