{"id":27105237,"url":"https://github.com/farithadnan/datasetforge","last_synced_at":"2025-07-07T16:35:28.620Z","repository":{"id":204064530,"uuid":"711032875","full_name":"farithadnan/DatasetForge","owner":"farithadnan","description":"Extracts Google Sheets to JSONL for fine-tuning, estimates task costs with tiktoken.","archived":false,"fork":false,"pushed_at":"2024-04-18T02:36:50.000Z","size":48,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2024-04-18T03:53:47.372Z","etag":null,"topics":["fine-tuning","googlesheetsapi","openai","python3","tiktoken"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/farithadnan.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null}},"created_at":"2023-10-28T02:45:22.000Z","updated_at":"2023-11-02T07:49:28.000Z","dependencies_parsed_at":null,"dependency_job_id":"4a09d7d6-4d1c-41c9-b2fd-82378c2f4e8e","html_url":"https://github.com/farithadnan/DatasetForge","commit_stats":null,"previous_names":["farithadnan/datasetforge"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/farithadnan%2FDatasetForge","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/farithadnan%2FDatasetForge/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/farithadnan%2FDatasetForge/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/farithadnan%2FDatasetForge/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/farithadnan","download_url":"https://codeload.github.com/farithadnan/DatasetForge/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247534672,"owners_count":20954565,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["fine-tuning","googlesheetsapi","openai","python3","tiktoken"],"created_at":"2025-04-06T18:37:39.174Z","updated_at":"2025-04-06T18:37:39.836Z","avatar_url":"https://github.com/farithadnan.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# DatasetForge ⚒️\n\nDatasetForge is a Python project designed to extract data from Google Sheets and convert it into JSONL formatted dataset, which is suitable for fine-tuning (`davinci-002` model) tasks (OpenAI). This tool also uses the library called [tiktoken](https://pypi.org/project/tiktoken/) to estimate the cost of fine-tuning (`davinci-002` model) tasks.\n\n## Requirements ⭐\n\n- You must have Google Sheets data that is represented in a prompt-completion (legacy) structure.\n  \u003e Refer to `sheets_sample.ods` for details\n- You must [create a Google Service Account in Google Cloud Platform](https://www.howtogeek.com/devops/how-to-create-and-use-service-accounts-in-google-cloud-platform/).\n- You must [enable the Google Sheets API for that Google Service Account](https://support.google.com/googleapi/answer/6158841?hl=en).\n- You must have the credentials for that Google Service Account.\n\n\n## How to Run the Project 🏃🏽‍♂️\n\n**Step 1: Clone the repo**\n\nOpen Git bash and type:\n```bash\n  git clone https://github.com/farithadnan/DatasetForge.git\n```\n\n**Step 2: Installation** \n\nInstall the required Python packages by running below command on your terminal:\n  ```bash\n    pip install -r requirements.txt\n  ```\n\n**Step 3: Set Up Google Sheets Config**\n\nEnsure that the configuration file (e.g., `config.yaml`) contains essential settings such as:\n- Path to Google Sheets credentials file (private keys).\n- URL of the Google Sheet to extract data from.\n- Index of the specific sheet within the Google Sheet.\n- Name for the output JSONL file.\n\u003e Refer to a file called `config.yaml.sample` for more info.\n\n\n**Step 4: Set up model for Encoding**\n\nTo estimate the cost of your dataset when it is fine-tuned later, you need to configure the encoding in `config.yaml`. By default, it is configured to `r50k_base` encoding, which refers to GPT-3 models like (`davinci-002`).\n\u003e For more details, refer to [How to count tokens with tiktoken](https://github.com/openai/openai-cookbook/blob/main/examples/How_to_count_tokens_with_tiktoken.ipynb)\n\n**Step 5: Run the Project**\n\nActivate your virtual environment then run the main python script:\n\n```bash\npython app.py\n```\n\nThis will authenticate with Google Sheets, extract the specified data, and convert it into a JSONL format, creating a dataset ready for fine-tuning tasks.\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ffarithadnan%2Fdatasetforge","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ffarithadnan%2Fdatasetforge","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ffarithadnan%2Fdatasetforge/lists"}