https://github.com/ajithvcoder/dvc-gdrive-workflow-setup
tutorial to connect dvc and gdrive and run github actions
https://github.com/ajithvcoder/dvc-gdrive-workflow-setup
data-version-control dvc-pipeline github-actions github-workflows google-drive google-drive-api
Last synced: 3 months ago
JSON representation
tutorial to connect dvc and gdrive and run github actions
- Host: GitHub
- URL: https://github.com/ajithvcoder/dvc-gdrive-workflow-setup
- Owner: ajithvcoder
- Created: 2024-10-20T17:57:39.000Z (8 months ago)
- Default Branch: main
- Last Pushed: 2024-10-20T20:40:26.000Z (8 months ago)
- Last Synced: 2025-02-14T08:36:38.881Z (5 months ago)
- Topics: data-version-control, dvc-pipeline, github-actions, github-workflows, google-drive, google-drive-api
- Homepage:
- Size: 1.71 MB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
### Using Service Account Method with DVC in GitHub CI/CD Pipeline
**Content**
1. [Setup Service account and Google drive folder](#setup-service-account-and-google-drive-folder)
2. [Local Setup](#local-setup)
3. [Github Repo setup](#github-repo-setup)
4. [Github Actions](#github-actions)### Setup Service account and Google drive folder
Go to Google cloud console -> Click APIs & Services -> Click Enable APIs and Services

Enable Drive Labels API, Google Drive API, Google Drive Activity API

**Setup Service account and get json key**
In this method, we can store data in a Google Drive and fetch it using service account authentication.
To create a service account, navigate to IAM & Admin in the left sidebar, and select Service Accounts.

Click + CREATE SERVICE ACCOUNT, enter a service account name (e.g., "My DVC Project"). If you are new and don't know what permissions to choose, it's better to give owner permissions.

Add all user accounts for which you need to grant access.

Then click CREATE AND CONTINUE. Click DONE, and you will be returned to the overview page.
Now you can see your service account; click on it and go to the Keys tab.

Under Add Key, select Create New Key, choose JSON, and click CREATE.
Download the generated projectname-xxxxxx.json key file to a safe location.Important: Store the API key in a local folder as credentials.json, but do not commit it to GitHub. If you do so, GitHub will raise a warning, and Google will be notified, revoking the credentials.
**Google drive folder**
Create a folder in your google drive. I have created a folder with name "dvc-storage-test"

Important: Give permission to the folder `anyone with the link` with **editor** access. Also share with your service account for example this is my service account mail id "[email protected]" and give editor access. The folder in should be shared to specific users (or groups) so they can use it with DVC. "Anyone with a link" is not guaranteed to work.

Now get the id of the folder. For example this my folder url `https://drive.google.com/drive/folders/1b4v577_NGcEuUZK3WP6vZe5G9dBTGcsG`. So this is my gdrive folder id - `1b4v577_NGcEuUZK3WP6vZe5G9dBTGcsG`
This is the configuration url i need to add to dvc config later `gdrive://1b4v577_NGcEuUZK3WP6vZe5G9dBTGcsG` i.e gdrive://
Reference: [here](https://dvc.org/doc/user-guide/data-management/remote-storage/google-drive#url-format)### Local Setup
Hereafter, in your local setup, you need to handle two things: the JSON file (dvcmanager-38xxxxxxx.json or any JSON with projectname-xxx.json which you downloaded as a key) and the Google Drive URL (gdrive://).
- Create a folder named data locally.
- Copy the contents from this Kaggle dataset (https://www.kaggle.com/datasets/khushikhushikhushi/dog-breed-image-dataset) into the data folder and unzip it. Remove all files that are not needed (e.g., archive.zip is not needed after unzipping).
**Tree example**
|- data
|----dataset
|--------Beagle
|--------Boxer
|-------- etc folders
|-------- etc folders
- Install dvc and dvc-gdrive
```pip install dvc dvc-gdrive```
- Run git init (if you are not in a git folder already)
- Run dvc init
- Now run `dvc remote add -d myremote gdrive:// command`. Reference [here](https://dvc.org/doc/user-guide/data-management/remote-storage/google-drive#url-format)
eg: ```dvc remote add -d myremote gdrive://1b4v577_NGcEuUZK3WP6vZe5G9dBTGcsG````
You will see that "myremote" has been added in the .dvc file.
- Run `dvc remote modify myremote gdrive_use_service_account true`
- Run `dvc remote modify myremote gdrive_acknowledge_abuse true`
- Run ```dvc remote modify myremote --local gdrive_service_account_json_file_path path/to/file.json```. i.e For example: ```dvc remote modify --local myremote gdrive_service_account_json_file_path devcmanager-385390fe7f4f.json```
You can see similar config in your `.dvc/config` file

- Run ```dvc add data``` i.e `dvc add
- Run ```dvc config core.autostage true``` (optional)
- Run ```dvc push -r myremote -v```
- Wait for about 10 minutes if it's around 800 MB of data for pushing; if it's in GitHub Actions, wait for 15 minutes.
Now you can check your Google Drive folder; you should see a folder named "files" like this:

### Github Repo setup
- Now push this to your GitHub repository. Note that in the .dvc folder, by default, you can only push the "config" and ".gitignore" files. Don't change this; let it remain as is.
- Important: Never push the project-xxx.json file. If you do, Google will identify it and revoke the token; you'll need to set the key again.
- Add only .dvc/config, .gitignore, and data.dvc files.
- After pushing to repo, in github in your repo click on "Secrets and variables" -> "Actions" -> "Repository secret" in your GitHub repo and create a secret named "GDRIVE_CREDENTIALS_DATA" Copy the content of your project-xxx.json file (credentials.json file) into the content field.

### Github Actions
- Create a .github/workflows folder locally for setting up your GitHub Actions workflow.
- You can refer to the `dvc-pipeline.yml` file for complete content.
Below is the code used to set up authentication and pull data inside GitHub CI/CD from Google Cloud Drive:
```
# Note you can also directly use "GDRIVE_CREDENTIALS_DATA" as env variable and pull it
- name: Create credentials.json
env:
GDRIVE_CREDENTIALS_DATA: ${{ secrets.GDRIVE_CREDENTIALS_DATA }}
run: |
echo $GDRIVE_CREDENTIALS_DATA > credentials_1.json- name: Modify DVC Remote
run: |
uv run dvc remote modify --local myremote gdrive_service_account_json_file_path credentials_1.json- name: DVC Pull Data
run: |
uv run dvc pull -v
```- Now you can trigger workflow by clicking "Run workflow" in github actions
Note: I have used `uv` [package](https://pypi.org/project/uv/) in github workflow to set a virtual environment as `dvc-gdrive` is causing some issues with github server instance. So you can also run it without `uv run` before dvc commands.

- You might see a error like this but its not a problem wait for sometime it is internally downloading files

- After 5 minutes(Depending on the data size) you can see successfull run

**Reference**
- Refered 1st point alone in "Using service account" in https://dvc.org/doc/user-guide/data-management/remote-storage/google-drive#using-service-accounts
- URL config setup - https://dvc.org/doc/user-guide/data-management/remote-storage/google-drive#url-format
- Using Service Account - GDrive - https://dvc.org/doc/user-guide/data-management/remote-storage/google-drive#using-service-accounts