{"id":26730939,"url":"https://github.com/daedalus/sharegpt_vicuna","last_synced_at":"2025-03-27T23:34:18.798Z","repository":{"id":153934771,"uuid":"629608313","full_name":"daedalus/sharegpt_vicuna","owner":"daedalus","description":null,"archived":false,"fork":false,"pushed_at":"2024-05-03T21:03:43.000Z","size":11,"stargazers_count":3,"open_issues_count":1,"forks_count":1,"subscribers_count":2,"default_branch":"main","last_synced_at":"2024-05-03T22:24:01.601Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/daedalus.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-04-18T16:50:16.000Z","updated_at":"2024-05-03T22:24:03.803Z","dependencies_parsed_at":null,"dependency_job_id":"d61cd1c7-3c97-45e5-b600-ca674623b991","html_url":"https://github.com/daedalus/sharegpt_vicuna","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/daedalus%2Fsharegpt_vicuna","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/daedalus%2Fsharegpt_vicuna/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/daedalus%2Fsharegpt_vicuna/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/daedalus%2Fsharegpt_vicuna/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/daedalus","download_url":"https://codeload.github.com/daedalus/sharegpt_vicuna/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":245944020,"owners_count":20697945,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2025-03-27T23:33:09.717Z","updated_at":"2025-03-27T23:34:18.760Z","avatar_url":"https://github.com/daedalus.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"---\nlicense: other\n---\n\n## Prepraration\n\n```\npip3 install -r requirements.txt\n```\n\n## Data Cleaning\n\n1. merge two raw json files and json beautify the merged file\n\n```\npython merge.py sharegpt_90k_raw_dataset/sg_90k_part1.json sharegpt_90k_raw_dataset/sg_90k_part2.json  sharegpt_20230401_html_unformatted.json\npython pretty_json.py --in sharegpt_20230401_html_unformatted.json --out sharegpt_20230401_html.json\n```\n\n2. (Optional) Verify the json file\n\n```\nif jq empty sharegpt_20230401_html.json 2\u003e/dev/null; then\n  echo \"JSON is valid\"\nelse\n  echo \"JSON is invalid\"\nfi\n\njq length sharegpt_90k_raw_dataset/sg_90k_part1.json\njq length sharegpt_90k_raw_dataset/sg_90k_part2.json\njq length sharegpt_20230401_html.json\n```\n\n3. clean data - remove html tags etc\n\n```\npython3 clean_sharegpt.py --in sharegpt_20230401_html.json --out sharegpt_20230401_clean.json\n....\n100%|███████████████████████████████████████████████████████████████████| 90665/90665 [06:32\u003c00:00, 230.98it/s]\ntotal: 90665, skip: 13745, new: 76920\n```\n\n4. Filter dataset by language\n\n```\npython3 optional_clean.py --in sharegpt_20230401_clean.json --out sharegpt_20230401_clean_lang_zh.json --lang zh\n....\nreturn 6240 out of 76920, start dump ...\n\npython3 optional_clean.py --in sharegpt_20230401_clean.json --out sharegpt_20230401_clean_lang_en.json --lang en\n...\nreturn 55413 out of 76920, start dump ...\n```\n\n\u003e Note: the code itself doesn't support languange list, I didn't change the code for adpation. You can change the code to support more languages. Instead, I just filter two languages I need and merge the `sharegpt_20230401_clean_lang_zh.json` and `sharegpt_20230401_clean_lang_en.json` into `sharegpt_20230401_clean_lang.json`. \n\n\n5. Split the long conversation\n\n```\npython3 split_long_conversation.py --in sharegpt_20230401_clean_lang.json --out sharegpt_20230401_clean_lang_split.json --model-name /home/ubuntu/llama-13b-hf/\n...\ntotal: 61653, new: 126032\n```\n\nOk, now we have the cleaned dataset `sharegpt_20230401_clean_lang_split.json` which should be used for finetuning.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdaedalus%2Fsharegpt_vicuna","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdaedalus%2Fsharegpt_vicuna","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdaedalus%2Fsharegpt_vicuna/lists"}