{"id":15497048,"url":"https://github.com/zaneh/dataset-tools","last_synced_at":"2026-06-01T02:31:10.995Z","repository":{"id":240888235,"uuid":"803683552","full_name":"ZaneH/dataset-tools","owner":"ZaneH","description":"Small collection of scripts to build datasets for LLMs.","archived":false,"fork":false,"pushed_at":"2024-07-03T20:04:29.000Z","size":4,"stargazers_count":2,"open_issues_count":0,"forks_count":0,"subscribers_count":2,"default_branch":"main","last_synced_at":"2026-01-11T15:17:18.310Z","etag":null,"topics":["csv","dataset","fine-tuning","jsonl","llm","system-prompt","training"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":false,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"unlicense","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/ZaneH.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-05-21T07:37:31.000Z","updated_at":"2025-04-11T03:50:45.000Z","dependencies_parsed_at":"2024-05-21T09:29:49.169Z","dependency_job_id":"e890837e-7d90-42dc-9fb6-33a16110c180","html_url":"https://github.com/ZaneH/dataset-tools","commit_stats":null,"previous_names":["zaneh/dataset-tools"],"tags_count":0,"template":true,"template_full_name":null,"purl":"pkg:github/ZaneH/dataset-tools","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ZaneH%2Fdataset-tools","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ZaneH%2Fdataset-tools/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ZaneH%2Fdataset-tools/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ZaneH%2Fdataset-tools/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/ZaneH","download_url":"https://codeload.github.com/ZaneH/dataset-tools/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ZaneH%2Fdataset-tools/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":33757790,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-06-01T02:00:06.963Z","response_time":115,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["csv","dataset","fine-tuning","jsonl","llm","system-prompt","training"],"created_at":"2024-10-02T08:30:19.890Z","updated_at":"2026-06-01T02:31:10.990Z","avatar_url":"https://github.com/ZaneH.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"## Dataset Tools\n\n[Blog post](https://www.zaaane.com/posts/fine-tuning-llama3-on-1-rtx-3060/)\n\nI use this collection of scripts for creating new datasets to train LLM models.\n\n- **stage1**: Store raw CSV file. Cleanup the base data manually\n- **stage2**: Create a JSONL file from each CSV file\n- **stage3**: Combine all JSONL files into one\n\n### Example Usage:\n\n- **Required:** Be sure to install the dependency in `requirements.txt`\n\n```bash\n$ python ./data/stage2/create-jsonl.py\nusage: create-jsonl.py [-h] input_file output_dir output_file\ncreate-jsonl.py: error: the following arguments are required: input_file, output_dir, output_file\n\n$ python ./data/stage3/combine-jsonl.py\nusage: combine-jsonl.py [-h] directory_path output_file\ncombine-jsonl.py: error: the following arguments are required: directory_path, output_file\n```\n\n```bash\n$ python ./data/stage2/create-jsonl.py ./data/stage1/scrape-results1.csv ./data/stage2 scrape-results1.jsonl\n2024-05-21 03:33:50 [info     ] CSV processing complete        output_file=data/stage2/scrape-results1.jsonl\n\n$ python ./data/stage3/combine-jsonl.py ./data/stage2 ./data/stage3/final.jsonl\n2024-05-21 03:36:42 [info     ] Merged JSONL files             output_file=./data/stage3/final.jsonl\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fzaneh%2Fdataset-tools","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fzaneh%2Fdataset-tools","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fzaneh%2Fdataset-tools/lists"}