https://github.com/asadhanif3188/mlops-1-get-started-with-dvc

Last synced: 2 months ago
JSON representation

Host: GitHub
URL: https://github.com/asadhanif3188/mlops-1-get-started-with-dvc
Owner: asadhanif3188
Created: 2024-08-17T21:00:22.000Z (10 months ago)
Default Branch: main
Last Pushed: 2024-08-17T21:17:06.000Z (10 months ago)
Last Synced: 2025-04-15T13:57:36.386Z (2 months ago)
Size: 4.88 KB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# MLOps-1: Get Started with DVC

DVC is "Git for data"!

## Initializing a project
Imagine we want to build an ML project from scratch. Let's start by creating a Git repository:

```
git init
```

We will use our current working directory as a DVC project. Let's initialize it by running `dvc init` inside a Git project:

```
dvc init
```

A few internal files are created that should be added to Git:

```
git status
git add .
git commit -m "Initialize DVC"

```

**_Now we're ready to use DVC!_**

## Tracking data

Working inside an initialized project directory, let's pick a piece of data to work with. We'll use an example **data.xml** file, though any text or binary file (or directory) will do. Start by running:
```
dvc get https://github.com/iterative/dataset-registry get-started/data.xml -o data/data.xml
```

*Note:* We used `dvc get` above to turn any Git repo into a "data registry". `dvc get` can download any file or directory tracked in a DVC repository.

Use `dvc add` to start tracking the dataset file:
```
dvc add data/data.xml
```

*Note:* DVC stores information about the added file in a special `.dvc` file named `data/data.xml.dvc`. This small, human-readable metadata file acts as a placeholder for the original data for the purpose of Git tracking.

Next, run the following commands to track changes in Git:
```
git add data/data.xml.dvc data/.gitignore
git add .
git commit -m "Add raw data"
```

*Note:* Now the **metadata about our data** is versioned alongside our source code, while the original data file was added to `.gitignore`.

## Storing and sharing

We can upload DVC-tracked data to a variety of storage systems (remote or local) referred to as "remotes". For simplicity, we will use a "local remote" for practice, which is just a directory in the local file system.

### Configuring a remote

Before pushing data to a remote we need to set it up using the `dvc remote add` command:

```
mkdir D:\workspace-mlops\dvcstore
dvc remote add -d myremote D:\workspace-mlops\dvcstore
```

*Note:* DVC supports many remote **storage types**, including *Amazon S3, NFS, SSH, Google Drive, Azure Blob Storage, and HDFS*.

An example for a common use case is configuring an *Amazon S3* remote:
```
dvc remote add -d storage s3://mybucket/dvcstore
```
For this to work, we'll need an AWS account and credentials set up to allow access.

### Uploading data
Now that a storage remote was configured, run dvc push to upload data:
```
dvc push
```

*Note:* `dvc push` copied the data cached locally to the remote storage we set up earlier.

Usually, we would also want to Git track any code changes that led to the data change (`git add, git commit and git push`).

### Retrieving data

Once DVC-tracked data and models are stored remotely, they can be downloaded with `dvc pull` when needed (e.g. in other copies of this project). Usually, we run it after `git pull` or `git clone`.

Let's try this now:

```
dvc pull
```

*Note:* After running `dvc push` above, the `dvc pull` command afterwards was short-circuited by DVC for efficiency. The project's `data/data.xml` file, our cache and the remote storage were all already in sync. We need to empty the cache and delete `data/data.xml` from our project if we want to have DVC actually moving data around.

Let's do that now:
```
rm -rf .dvc\cache
rm -f data\data.xml
```

Now we can run `dvc pull` to retrieve the data from the remote:
```
dvc pull -v
```

## Making local changes

Next, let's say we obtained more data from some external source. We will simulate this by doubling the dataset contents:
```
cp data/data.xml data/dup-data.xml
cat data/dup-data.xml >> data/data.xml
rm -f data/dup-data.xml
```

After modifying the data, run `dvc add` again to track the latest version:
```
dvc add data/data.xml
```

Now we can run `dvc push` to upload the changes to the remote storage, followed by a `git commit` to track them:

```
dvc push
git add . && git commit -m "Dataset updated"
```

## Switching between versions
A commonly used workflow is to use `git checkout` to switch to a branch or checkout a specific `.dvc` file revision, followed by a `dvc checkout` to sync data into your workspace:

```
git checkout <...>
dvc checkout
```

## Return to a previous version of the dataset
Let's go back to the original version of the data:

```
git checkout HEAD~1 data/data.xml.dvc
dvc checkout
```

Let's commit it (no need to do `dvc push` this time since this original version of the dataset was already saved):

```
git commit data/data.xml.dvc -m "Revert dataset updates"
```

**Note:** As we can see, DVC is technically not a version control system by itself! It manipulates `.dvc` files, whose contents define the data file versions. Git is already used to version our code, and now it can also version our data alongside it.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/asadhanif3188/mlops-1-get-started-with-dvc

Awesome Lists containing this project

README