Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/ajithvcoder/dvc-gdrive-workflow-setup
tutorial to connect dvc and gdrive and run github actions
https://github.com/ajithvcoder/dvc-gdrive-workflow-setup
data-version-control dvc-pipeline github-actions github-workflows google-drive google-drive-api
Last synced: 6 days ago
JSON representation
tutorial to connect dvc and gdrive and run github actions
- Host: GitHub
- URL: https://github.com/ajithvcoder/dvc-gdrive-workflow-setup
- Owner: ajithvcoder
- Created: 2024-10-20T17:57:39.000Z (4 months ago)
- Default Branch: main
- Last Pushed: 2024-10-20T20:40:26.000Z (4 months ago)
- Last Synced: 2024-12-22T01:13:29.507Z (2 months ago)
- Topics: data-version-control, dvc-pipeline, github-actions, github-workflows, google-drive, google-drive-api
- Homepage:
- Size: 1.71 MB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
### Using Service Account Method with DVC in GitHub CI/CD Pipeline
**Content**
1. [Setup Service account and Google drive folder](#setup-service-account-and-google-drive-folder)
2. [Local Setup](#local-setup)
3. [Github Repo setup](#github-repo-setup)
4. [Github Actions](#github-actions)### Setup Service account and Google drive folder
Go to Google cloud console -> Click APIs & Services -> Click Enable APIs and Services
data:image/s3,"s3://crabby-images/8c344/8c3448851a57387aa0fda3db2134e12b7044a5a9" alt="services"
Enable Drive Labels API, Google Drive API, Google Drive Activity API
data:image/s3,"s3://crabby-images/6b53f/6b53f5b5f1fa2cc1afd70341b50292bbb2158563" alt="service_api"
**Setup Service account and get json key**
In this method, we can store data in a Google Drive and fetch it using service account authentication.
To create a service account, navigate to IAM & Admin in the left sidebar, and select Service Accounts.
data:image/s3,"s3://crabby-images/f2cba/f2cbabfe58a4d8ea5efa5efb0a5c67390ac93129" alt="service_account_icon"
Click + CREATE SERVICE ACCOUNT, enter a service account name (e.g., "My DVC Project"). If you are new and don't know what permissions to choose, it's better to give owner permissions.
data:image/s3,"s3://crabby-images/63269/632692f8f66ff81929f1abad09fc3a6c286d4e15" alt="owner-permission"
Add all user accounts for which you need to grant access.
data:image/s3,"s3://crabby-images/2c8a6/2c8a616a3093057f87b74cfa76a10fdcb24c1202" alt="email-access"
Then click CREATE AND CONTINUE. Click DONE, and you will be returned to the overview page.
Now you can see your service account; click on it and go to the Keys tab.
data:image/s3,"s3://crabby-images/60355/603551d4f5366ddfabc281e11084fdcfb27d7f19" alt="serivce-mail-id"
Under Add Key, select Create New Key, choose JSON, and click CREATE.data:image/s3,"s3://crabby-images/8d9e0/8d9e0bc1a1d2306100db23ec8d36bcb2421d7b76" alt="key-creation"
Download the generated projectname-xxxxxx.json key file to a safe location.Important: Store the API key in a local folder as credentials.json, but do not commit it to GitHub. If you do so, GitHub will raise a warning, and Google will be notified, revoking the credentials.
**Google drive folder**
Create a folder in your google drive. I have created a folder with name "dvc-storage-test"
data:image/s3,"s3://crabby-images/2e4e6/2e4e6d35df132d4920aee0a54af35a8d05f2f1ef" alt="folder"
Important: Give permission to the folder `anyone with the link` with **editor** access. Also share with your service account for example this is my service account mail id "[email protected]" and give editor access. The folder in should be shared to specific users (or groups) so they can use it with DVC. "Anyone with a link" is not guaranteed to work.
data:image/s3,"s3://crabby-images/05151/05151f8d9e82c262ebfb1e3be8b8283350ac65dd" alt="permission"
Now get the id of the folder. For example this my folder url `https://drive.google.com/drive/folders/1b4v577_NGcEuUZK3WP6vZe5G9dBTGcsG`. So this is my gdrive folder id - `1b4v577_NGcEuUZK3WP6vZe5G9dBTGcsG`
This is the configuration url i need to add to dvc config later `gdrive://1b4v577_NGcEuUZK3WP6vZe5G9dBTGcsG` i.e gdrive://
Reference: [here](https://dvc.org/doc/user-guide/data-management/remote-storage/google-drive#url-format)### Local Setup
Hereafter, in your local setup, you need to handle two things: the JSON file (dvcmanager-38xxxxxxx.json or any JSON with projectname-xxx.json which you downloaded as a key) and the Google Drive URL (gdrive://).
- Create a folder named data locally.
- Copy the contents from this Kaggle dataset (https://www.kaggle.com/datasets/khushikhushikhushi/dog-breed-image-dataset) into the data folder and unzip it. Remove all files that are not needed (e.g., archive.zip is not needed after unzipping).
**Tree example**
|- data
|----dataset
|--------Beagle
|--------Boxer
|-------- etc folders
|-------- etc folders
- Install dvc and dvc-gdrive
```pip install dvc dvc-gdrive```
- Run git init (if you are not in a git folder already)
- Run dvc init
- Now run `dvc remote add -d myremote gdrive:// command`. Reference [here](https://dvc.org/doc/user-guide/data-management/remote-storage/google-drive#url-format)
eg: ```dvc remote add -d myremote gdrive://1b4v577_NGcEuUZK3WP6vZe5G9dBTGcsG````
You will see that "myremote" has been added in the .dvc file.
- Run `dvc remote modify myremote gdrive_use_service_account true`
- Run `dvc remote modify myremote gdrive_acknowledge_abuse true`
- Run ```dvc remote modify myremote --local gdrive_service_account_json_file_path path/to/file.json```. i.e For example: ```dvc remote modify --local myremote gdrive_service_account_json_file_path devcmanager-385390fe7f4f.json```
You can see similar config in your `.dvc/config` file
data:image/s3,"s3://crabby-images/a6955/a6955edfb2c5ab3dbaaa46ee1bca4388afaabd27" alt="dvc config"
- Run ```dvc add data``` i.e `dvc add
- Run ```dvc config core.autostage true``` (optional)
- Run ```dvc push -r myremote -v```
- Wait for about 10 minutes if it's around 800 MB of data for pushing; if it's in GitHub Actions, wait for 15 minutes.
Now you can check your Google Drive folder; you should see a folder named "files" like this:
data:image/s3,"s3://crabby-images/b8215/b8215e76c0f18e44d85061c51e24578fe0d1c1e1" alt="config_file"
### Github Repo setup
- Now push this to your GitHub repository. Note that in the .dvc folder, by default, you can only push the "config" and ".gitignore" files. Don't change this; let it remain as is.
- Important: Never push the project-xxx.json file. If you do, Google will identify it and revoke the token; you'll need to set the key again.
- Add only .dvc/config, .gitignore, and data.dvc files.
- After pushing to repo, in github in your repo click on "Secrets and variables" -> "Actions" -> "Repository secret" in your GitHub repo and create a secret named "GDRIVE_CREDENTIALS_DATA" Copy the content of your project-xxx.json file (credentials.json file) into the content field.
data:image/s3,"s3://crabby-images/d6ee9/d6ee9ac54e79502cd3d8e8dd7f0f25bfa0292449" alt="secrets"
### Github Actions
- Create a .github/workflows folder locally for setting up your GitHub Actions workflow.
- You can refer to the `dvc-pipeline.yml` file for complete content.
Below is the code used to set up authentication and pull data inside GitHub CI/CD from Google Cloud Drive:
```
# Note you can also directly use "GDRIVE_CREDENTIALS_DATA" as env variable and pull it
- name: Create credentials.json
env:
GDRIVE_CREDENTIALS_DATA: ${{ secrets.GDRIVE_CREDENTIALS_DATA }}
run: |
echo $GDRIVE_CREDENTIALS_DATA > credentials_1.json- name: Modify DVC Remote
run: |
uv run dvc remote modify --local myremote gdrive_service_account_json_file_path credentials_1.json- name: DVC Pull Data
run: |
uv run dvc pull -v
```- Now you can trigger workflow by clicking "Run workflow" in github actions
Note: I have used `uv` [package](https://pypi.org/project/uv/) in github workflow to set a virtual environment as `dvc-gdrive` is causing some issues with github server instance. So you can also run it without `uv run` before dvc commands.
data:image/s3,"s3://crabby-images/3e4d6/3e4d6da5568c73d1c48275d7cbaa41d48ae79d40" alt="workflow-trigger"
- You might see a error like this but its not a problem wait for sometime it is internally downloading files
data:image/s3,"s3://crabby-images/65302/65302beec7b849b35db8074a96e8a649b8dc0a31" alt="default_error"
- After 5 minutes(Depending on the data size) you can see successfull run
data:image/s3,"s3://crabby-images/25707/25707ff097b6bccad30ba528e7e7e5468a3d9cfb" alt="run_success"
**Reference**
- Refered 1st point alone in "Using service account" in https://dvc.org/doc/user-guide/data-management/remote-storage/google-drive#using-service-accounts
- URL config setup - https://dvc.org/doc/user-guide/data-management/remote-storage/google-drive#url-format
- Using Service Account - GDrive - https://dvc.org/doc/user-guide/data-management/remote-storage/google-drive#using-service-accounts