{"id":18234052,"url":"https://github.com/ajithvcoder/dvc-gcs-bucket-workflow-setup","last_synced_at":"2025-10-19T01:46:29.310Z","repository":{"id":258781257,"uuid":"875669506","full_name":"ajithvcoder/dvc-gcs-bucket-workflow-setup","owner":"ajithvcoder","description":"Tutorial on connecting dvc tool with google cloud storage bucket service and setting up workflow","archived":false,"fork":false,"pushed_at":"2024-10-20T18:37:30.000Z","size":2051,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-04-08T12:42:36.665Z","etag":null,"topics":["bucket","ci-cd","dvc-pipeline","gcs","github-actions","google-cloud-storage","workflows"],"latest_commit_sha":null,"homepage":"","language":null,"has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/ajithvcoder.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-10-20T15:33:42.000Z","updated_at":"2024-10-20T18:37:33.000Z","dependencies_parsed_at":null,"dependency_job_id":"3dfbdfd5-f1c6-4db6-94be-f154046394a5","html_url":"https://github.com/ajithvcoder/dvc-gcs-bucket-workflow-setup","commit_stats":null,"previous_names":["ajithvcoder/dvc-gcs-bucket-workflow-setup"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/ajithvcoder/dvc-gcs-bucket-workflow-setup","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ajithvcoder%2Fdvc-gcs-bucket-workflow-setup","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ajithvcoder%2Fdvc-gcs-bucket-workflow-setup/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ajithvcoder%2Fdvc-gcs-bucket-workflow-setup/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ajithvcoder%2Fdvc-gcs-bucket-workflow-setup/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/ajithvcoder","download_url":"https://codeload.github.com/ajithvcoder/dvc-gcs-bucket-workflow-setup/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ajithvcoder%2Fdvc-gcs-bucket-workflow-setup/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":261816311,"owners_count":23213863,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["bucket","ci-cd","dvc-pipeline","gcs","github-actions","google-cloud-storage","workflows"],"created_at":"2024-11-04T17:03:17.992Z","updated_at":"2025-10-19T01:46:24.285Z","avatar_url":"https://github.com/ajithvcoder.png","language":null,"funding_links":[],"categories":[],"sub_categories":[],"readme":"### Using Service Account Method with DVC in GitHub CI/CD Pipeline\n\n**Content**\n\n1. [Setup GCS bucket storage](#setup-gcs-bucket-storage)\n2. [Local Setup](#local-setup)\n3. [Github Repo setup](#github-repo-setup)\n4. [Github Actions](#github-actions)\n\n\n### Setup GC bucket storage\n\nIn this method, we can store data in a Google Cloud Storage (GCS) bucket and fetch it using service account authentication.\n\nTo create a service account, navigate to IAM \u0026 Admin in the left sidebar, and select Service Accounts.\n\n![service_account_icon](./assets/snap_tutorial_1.png)\n\n\nClick + CREATE SERVICE ACCOUNT, enter a service account name (e.g., \"My DVC Project\"). If you are new and don't know what permissions to choose, it's better to give owner permissions.\n\n![owner-permission](./assets/snap_tutorial_2.png)\n\nAdd all user accounts for which you need to grant access.\n\n![email-access](./assets/snap_tutorial_3.png)\n\nThen click CREATE AND CONTINUE. Click DONE, and you will be returned to the overview page.\n\nNow you can see your service account; click on it and go to the Keys tab.\n\n![serivce-mail-id](./assets/snap_tutorial_4.png)\n \nUnder Add Key, select Create New Key, choose JSON, and click CREATE.\n\n![key-creation](./assets/snap_tutorial_5.png)\n \nDownload the generated projectname-xxxxxx.json key file to a safe location.\n\n#TODO: change to numbers\n\nImportant: Store the API key in a local folder as credentials.json, but do not commit it to GitHub. If you do so, GitHub will raise a warning, and Google will be notified, revoking the credentials.\n\nIn the Google Console search bar, type \"Google Cloud Storage\" and go there.\n\nClick the \"Create\" button -\u003e Give a name to the bucket -\u003e In \"Choose where to store data,\" select \"Region\" -\u003e \"asia-south1\" (this can be anything; I just used it as an example) -\u003e Click whatever option is default after this -\u003e Click \"Create\" at last.\n\n![](./assets/snap_create_button.png)\n\n![](./assets/snap_create_2.png)\n\n\n- Click Create a folder and then give it the name \"storage\" (you can choose any name).\n\n![](./assets/snap_create_folder.png)\n\nNow you should see a folder similar to the screenshot below:\n\n![](./assets/snap_create_folder_2.png)\n\nThe URL or location of this is `gs://\u003cbucket_name\u003e/\u003cfolder_name\u003e`. For the above folder, it is `gs://dvctestbucket/storage`, where `dvctestbucket` is the bucket name and `storage` is the folder name.\n\n\n### Local Setup\n\nHereafter, in your local setup, you need to handle two things: the JSON file (dvcmanager-38xxxxxxx.json or any JSON with projectname-xxx.json which you downloaded as a key) and the Google Storage URL (gs://dvctestbucket/storage).\n\n- Create a folder named data locally.\n\n- Copy the contents from this Kaggle dataset (https://www.kaggle.com/datasets/khushikhushikhushi/dog-breed-image-dataset) into the data folder and unzip it. Remove all files that are not needed (e.g., archive.zip is not needed after unzipping).\n\n**Tree example**\n\n|- data\n\n|----dataset\n\n|--------Beagle\n\n|--------Boxer\n\n|-------- etc folders\n\n|-------- etc folders\n\n- Install dvc and dvc-gs\n\n```pip install dvc dvc-gs```\n\n- Run git init (if you are not in a git folder already)\n\n- Run dvc init\n\n- Now run `dvc remote add -d myremote gs://\u003cmybucket\u003e/\u003cpath\u003e command`. Reference [here](https://dvc.org/doc/user-guide/data-management/remote-storage/google-cloud-storage)\n\neg:  ```dvc remote add -d myremote gs://dvctestbucket/storage````\n\nYou will see that \"myremote\" has been added in the .dvc file.\n\n![dvc config](./assets/snap_myremote.png)\n\n- Run ```dvc remote modify --local myremote credentialpath devcmanager-385390fe7f4f.json``` i.e ```dvc remote modify --local myremote credentialpath 'path/to/project-XXX.json'```\n\n- Run ```dvc add data```  i.e `dvc add \u003cdata_folder_name\u003e\n\n- Run ```dvc config core.autostage true``` (optional)\n\n- Run ```dvc push -r myremote -v```\n\n- Wait for about 10 minutes if it's around 800 MB of data for pushing; if it's in GitHub Actions, wait for 15 minutes.\n\nNow you can check your Google Cloud bucket; you should see a folder named \"files\" like this:\n\n![](./assets/snap_files_folder.png)\n\n### Github Repo setup\n\n- Now push this to your GitHub repository. Note that in the .dvc folder, by default, you can only push the \"config\" and \".gitignore\" files. Don't change this; let it remain as is.\n\n- Important: Never push the project-xxx.json file. If you do, Google will identify it and revoke the token; you'll need to set the key again.\n\n- Add only .dvc/config, .gitignore, and data.dvc files.\n\n- After pushing to repo, in github in your repo click on \"Secrets and variables\" -\u003e \"Actions\" -\u003e \"Repository secret\" in your GitHub repo and create a secret named \"GDRIVE_CREDENTIALS_DATA\" Copy the content of your project-xxx.json file (credentials.json file) into the content field.\n\n![secrets](./assets/snap_add_secret.png)\n\n- Now push this to github repo\n\nNote in .dvc folder by default u can push only \"config\" and \".gitingore\" file. Dont change it let it be like that.\n\n- Kindly note it dont ever push \"project-xxx.json\" file. if you push google will identify it and revoke the token you need to set the key again.\n\n- Add `.dvc/config`, `.gitignore`, `data.dvc` files alone.\n\n- Click to \"Secrets and variables\" -\u003e \"Actions\" -\u003e \"Reprository secret\" in github repo and create a secret with name \"GDRIVE_CREDENTIALS_DATA\" and oopy the content of project-xxx.json file (credentials.json file) in the content.\n\n### Github Actions\n\n- Create a .github/workflows folder locally for setting up your GitHub Actions workflow.\n\n- You can refer to the `dvc-pipeline.yml` file for complete content. \n\nBelow is the code used to set up authentication and pull data inside GitHub CI/CD from Google Cloud Storage bucket:\n\n\n```\n      # Note you can also directly use \"GDRIVE_CREDENTIALS_DATA\" as env variable and pull it\n      - name: Create credentials.json\n        env:\n          GDRIVE_CREDENTIALS_DATA: ${{ secrets.GDRIVE_CREDENTIALS_DATA }}\n        run: |\n          echo $GDRIVE_CREDENTIALS_DATA \u003e credentials_1.json\n\n      - name: Modify DVC Remote\n        run: |\n          dvc remote modify --local myremote credentialpath credentials_1.json\n\n      - name: DVC Pull Data\n        run: |\n          dvc pull -v\n```\n\n- Now you can trigger workflow by clicking \"Run workflow\" in github actions\n\n![workflow-trigger](./assets/snap_workflow_trigger.png)\n\n- You might see a error like this but its not a problem wait for sometime it is internally downloading files\n\n![default_error](./assets/snap_error_default.png)\n\n- After 5 minutes(Depending on the data size) you can see successfull run\n\n![run_success](./assets/snap_run_success.png)\n\n**Reference**\n\n- Refered 1st point alone in \"Using service account\" in https://dvc.org/doc/user-guide/data-management/remote-storage/google-drive#using-service-accounts\n- Google cloud storage - https://dvc.org/doc/user-guide/data-management/remote-storage/google-cloud-storage#google-cloud-storage\n- Custom authentication google cloud storage - https://dvc.org/doc/user-guide/data-management/remote-storage/google-cloud-storage#custom-authentication\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fajithvcoder%2Fdvc-gcs-bucket-workflow-setup","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fajithvcoder%2Fdvc-gcs-bucket-workflow-setup","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fajithvcoder%2Fdvc-gcs-bucket-workflow-setup/lists"}