{"id":18234050,"url":"https://github.com/ajithvcoder/dvc-gdrive-workflow-setup","last_synced_at":"2025-04-08T12:30:37.477Z","repository":{"id":258812185,"uuid":"875734740","full_name":"ajithvcoder/dvc-gdrive-workflow-setup","owner":"ajithvcoder","description":"tutorial to connect dvc and gdrive and run github actions","archived":false,"fork":false,"pushed_at":"2024-10-20T20:40:26.000Z","size":1795,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-02-14T08:36:38.881Z","etag":null,"topics":["data-version-control","dvc-pipeline","github-actions","github-workflows","google-drive","google-drive-api"],"latest_commit_sha":null,"homepage":"","language":null,"has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/ajithvcoder.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-10-20T17:57:39.000Z","updated_at":"2024-10-21T03:07:57.000Z","dependencies_parsed_at":"2024-10-20T22:27:36.798Z","dependency_job_id":null,"html_url":"https://github.com/ajithvcoder/dvc-gdrive-workflow-setup","commit_stats":null,"previous_names":["ajithvcoder/dvc-gdrive-workflow-setup"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ajithvcoder%2Fdvc-gdrive-workflow-setup","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ajithvcoder%2Fdvc-gdrive-workflow-setup/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ajithvcoder%2Fdvc-gdrive-workflow-setup/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ajithvcoder%2Fdvc-gdrive-workflow-setup/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/ajithvcoder","download_url":"https://codeload.github.com/ajithvcoder/dvc-gdrive-workflow-setup/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247842100,"owners_count":21005233,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["data-version-control","dvc-pipeline","github-actions","github-workflows","google-drive","google-drive-api"],"created_at":"2024-11-04T17:03:17.313Z","updated_at":"2025-04-08T12:30:37.412Z","avatar_url":"https://github.com/ajithvcoder.png","language":null,"readme":"### Using Service Account Method with DVC in GitHub CI/CD Pipeline\n\n**Content**\n\n1. [Setup Service account and Google drive folder](#setup-service-account-and-google-drive-folder)\n2. [Local Setup](#local-setup)\n3. [Github Repo setup](#github-repo-setup)\n4. [Github Actions](#github-actions)\n\n\n### Setup Service account and Google drive folder\n\nGo to Google cloud console -\u003e Click APIs \u0026 Services -\u003e Click Enable APIs and Services\n\n![services](./assets/snap_api_services.png)\n\nEnable Drive Labels API, Google Drive API, Google Drive Activity API\n\n![service_api](./assets/snap_api_service_1.png)\n\n**Setup Service account and get json key**\n\nIn this method, we can store data in a Google Drive and fetch it using service account authentication.\n\nTo create a service account, navigate to IAM \u0026 Admin in the left sidebar, and select Service Accounts.\n\n![service_account_icon](./assets/snap_tutorial_1.png)\n\n\nClick + CREATE SERVICE ACCOUNT, enter a service account name (e.g., \"My DVC Project\"). If you are new and don't know what permissions to choose, it's better to give owner permissions.\n\n![owner-permission](./assets/snap_tutorial_2.png)\n\nAdd all user accounts for which you need to grant access.\n\n![email-access](./assets/snap_tutorial_3.png)\n\nThen click CREATE AND CONTINUE. Click DONE, and you will be returned to the overview page.\n\nNow you can see your service account; click on it and go to the Keys tab.\n\n![serivce-mail-id](./assets/snap_tutorial_4.png)\n \nUnder Add Key, select Create New Key, choose JSON, and click CREATE.\n\n![key-creation](./assets/snap_tutorial_5.png)\n \nDownload the generated projectname-xxxxxx.json key file to a safe location.\n\nImportant: Store the API key in a local folder as credentials.json, but do not commit it to GitHub. If you do so, GitHub will raise a warning, and Google will be notified, revoking the credentials. \n\n**Google drive folder**\n\nCreate a folder in your google drive. I have created a folder with name \"dvc-storage-test\"\n\n![folder](./assets/snap_dvc_storage_test.png)\n\nImportant: Give permission to the folder `anyone with the link` with **editor** access. Also share with your service account for example this is my service account mail id \"ajithvcodernew@devcmanager.iam.gserviceaccount.com\" and give editor access. The folder in should be shared to specific users (or groups) so they can use it with DVC. \"Anyone with a link\" is not guaranteed to work.\n\n![permission](./assets/snap_permission_gdrive.png)\n\nNow get the id of the folder. For example this my folder url `https://drive.google.com/drive/folders/1b4v577_NGcEuUZK3WP6vZe5G9dBTGcsG`. So this is my gdrive folder id - `1b4v577_NGcEuUZK3WP6vZe5G9dBTGcsG` \n\nThis is the configuration url i need to add to dvc config later `gdrive://1b4v577_NGcEuUZK3WP6vZe5G9dBTGcsG` i.e gdrive://\u003cyour_gdrive_folder_id\u003e\nReference: [here](https://dvc.org/doc/user-guide/data-management/remote-storage/google-drive#url-format)\n\n\n### Local Setup\n\nHereafter, in your local setup, you need to handle two things: the JSON file (dvcmanager-38xxxxxxx.json or any JSON with projectname-xxx.json which you downloaded as a key) and the Google Drive URL (gdrive://\u003cyour_gdrive_folder_id\u003e).\n\n- Create a folder named data locally.\n\n- Copy the contents from this Kaggle dataset (https://www.kaggle.com/datasets/khushikhushikhushi/dog-breed-image-dataset) into the data folder and unzip it. Remove all files that are not needed (e.g., archive.zip is not needed after unzipping).\n\n**Tree example**\n\n|- data\n\n|----dataset\n\n|--------Beagle\n\n|--------Boxer\n\n|-------- etc folders\n\n|-------- etc folders\n\n- Install dvc and dvc-gdrive\n\n```pip install dvc dvc-gdrive```\n\n- Run git init (if you are not in a git folder already)\n\n- Run dvc init\n\n\n- Now run `dvc remote add -d myremote gdrive://\u003cyour_gdrive_folder_id\u003e command`. Reference [here](https://dvc.org/doc/user-guide/data-management/remote-storage/google-drive#url-format)\n\neg:  ```dvc remote add -d myremote gdrive://1b4v577_NGcEuUZK3WP6vZe5G9dBTGcsG````\n\nYou will see that \"myremote\" has been added in the .dvc file.\n\n- Run `dvc remote modify myremote gdrive_use_service_account true`\n\n- Run `dvc remote modify myremote gdrive_acknowledge_abuse true`\n\n- Run ```dvc remote modify myremote --local gdrive_service_account_json_file_path path/to/file.json```. i.e For example: ```dvc remote modify --local myremote gdrive_service_account_json_file_path devcmanager-385390fe7f4f.json```\n\nYou can see similar config in your `.dvc/config` file\n\n![dvc config](./assets/snap_config_file.png)\n\n\n- Run ```dvc add data```  i.e `dvc add \u003cdata_folder_name\u003e\n\n- Run ```dvc config core.autostage true``` (optional)\n\n- Run ```dvc push -r myremote -v```\n\n- Wait for about 10 minutes if it's around 800 MB of data for pushing; if it's in GitHub Actions, wait for 15 minutes.\n\nNow you can check your Google Drive folder; you should see a folder named \"files\" like this:\n\n![config_file](./assets/snap_files_store.png)\n\n### Github Repo setup\n\n- Now push this to your GitHub repository. Note that in the .dvc folder, by default, you can only push the \"config\" and \".gitignore\" files. Don't change this; let it remain as is.\n\n- Important: Never push the project-xxx.json file. If you do, Google will identify it and revoke the token; you'll need to set the key again.\n\n- Add only .dvc/config, .gitignore, and data.dvc files.\n\n- After pushing to repo, in github in your repo click on \"Secrets and variables\" -\u003e \"Actions\" -\u003e \"Repository secret\" in your GitHub repo and create a secret named \"GDRIVE_CREDENTIALS_DATA\" Copy the content of your project-xxx.json file (credentials.json file) into the content field.\n\n![secrets](./assets/snap_add_secret.png)\n\n### Github Actions\n\n- Create a .github/workflows folder locally for setting up your GitHub Actions workflow.\n\n- You can refer to the `dvc-pipeline.yml` file for complete content. \n\nBelow is the code used to set up authentication and pull data inside GitHub CI/CD from Google Cloud Drive:\n\n\n```\n      # Note you can also directly use \"GDRIVE_CREDENTIALS_DATA\" as env variable and pull it\n      - name: Create credentials.json\n        env:\n          GDRIVE_CREDENTIALS_DATA: ${{ secrets.GDRIVE_CREDENTIALS_DATA }}\n        run: |\n          echo $GDRIVE_CREDENTIALS_DATA \u003e credentials_1.json\n\n      - name: Modify DVC Remote\n        run: |\n          uv run dvc remote modify --local myremote gdrive_service_account_json_file_path credentials_1.json\n\n      - name: DVC Pull Data\n        run: |\n          uv run dvc pull -v\n```\n\n- Now you can trigger workflow by clicking \"Run workflow\" in github actions\n\nNote: I have used `uv` [package](https://pypi.org/project/uv/) in github workflow to set a virtual environment as `dvc-gdrive` is causing some issues with github server instance. So you can also run it without `uv run` before dvc commands.\n\n![workflow-trigger](./assets/snap_run_workflow.png)\n\n- You might see a error like this but its not a problem wait for sometime it is internally downloading files\n\n![default_error](./assets/snap_error_gdrive.png)\n\n- After 5 minutes(Depending on the data size) you can see successfull run\n\n![run_success](./assets/snap_gdrive_success_run.png)\n\n**Reference**\n\n- Refered 1st point alone in \"Using service account\" in https://dvc.org/doc/user-guide/data-management/remote-storage/google-drive#using-service-accounts\n- URL config setup - https://dvc.org/doc/user-guide/data-management/remote-storage/google-drive#url-format\n- Using Service Account - GDrive - https://dvc.org/doc/user-guide/data-management/remote-storage/google-drive#using-service-accounts\n\n\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fajithvcoder%2Fdvc-gdrive-workflow-setup","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fajithvcoder%2Fdvc-gdrive-workflow-setup","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fajithvcoder%2Fdvc-gdrive-workflow-setup/lists"}