https://github.com/asadhanif3188/mlops-1-get-started-with-dvc
https://github.com/asadhanif3188/mlops-1-get-started-with-dvc
Last synced: 2 months ago
JSON representation
- Host: GitHub
- URL: https://github.com/asadhanif3188/mlops-1-get-started-with-dvc
- Owner: asadhanif3188
- Created: 2024-08-17T21:00:22.000Z (10 months ago)
- Default Branch: main
- Last Pushed: 2024-08-17T21:17:06.000Z (10 months ago)
- Last Synced: 2025-04-15T13:57:36.386Z (2 months ago)
- Size: 4.88 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# MLOps-1: Get Started with DVC
DVC is "Git for data"!
## Initializing a project
Imagine we want to build an ML project from scratch. Let's start by creating a Git repository:```
git init
```We will use our current working directory as a DVC project. Let's initialize it by running `dvc init` inside a Git project:
```
dvc init
```A few internal files are created that should be added to Git:
```
git status
git add .
git commit -m "Initialize DVC"```
**_Now we're ready to use DVC!_**
## Tracking data
Working inside an initialized project directory, let's pick a piece of data to work with. We'll use an example **data.xml** file, though any text or binary file (or directory) will do. Start by running:
```
dvc get https://github.com/iterative/dataset-registry get-started/data.xml -o data/data.xml
```*Note:* We used `dvc get` above to turn any Git repo into a "data registry". `dvc get` can download any file or directory tracked in a DVC repository.
Use `dvc add` to start tracking the dataset file:
```
dvc add data/data.xml
```*Note:* DVC stores information about the added file in a special `.dvc` file named `data/data.xml.dvc`. This small, human-readable metadata file acts as a placeholder for the original data for the purpose of Git tracking.
Next, run the following commands to track changes in Git:
```
git add data/data.xml.dvc data/.gitignore
git add .
git commit -m "Add raw data"
```*Note:* Now the **metadata about our data** is versioned alongside our source code, while the original data file was added to `.gitignore`.
## Storing and sharing
We can upload DVC-tracked data to a variety of storage systems (remote or local) referred to as "remotes". For simplicity, we will use a "local remote" for practice, which is just a directory in the local file system.
### Configuring a remote
Before pushing data to a remote we need to set it up using the `dvc remote add` command:
```
mkdir D:\workspace-mlops\dvcstore
dvc remote add -d myremote D:\workspace-mlops\dvcstore
```*Note:* DVC supports many remote **storage types**, including *Amazon S3, NFS, SSH, Google Drive, Azure Blob Storage, and HDFS*.
An example for a common use case is configuring an *Amazon S3* remote:
```
dvc remote add -d storage s3://mybucket/dvcstore
```
For this to work, we'll need an AWS account and credentials set up to allow access.### Uploading data
Now that a storage remote was configured, run dvc push to upload data:
```
dvc push
```*Note:* `dvc push` copied the data cached locally to the remote storage we set up earlier.
Usually, we would also want to Git track any code changes that led to the data change (`git add, git commit and git push`).
### Retrieving data
Once DVC-tracked data and models are stored remotely, they can be downloaded with `dvc pull` when needed (e.g. in other copies of this project). Usually, we run it after `git pull` or `git clone`.
Let's try this now:
```
dvc pull
```*Note:* After running `dvc push` above, the `dvc pull` command afterwards was short-circuited by DVC for efficiency. The project's `data/data.xml` file, our cache and the remote storage were all already in sync. We need to empty the cache and delete `data/data.xml` from our project if we want to have DVC actually moving data around.
Let's do that now:
```
rm -rf .dvc\cache
rm -f data\data.xml
```Now we can run `dvc pull` to retrieve the data from the remote:
```
dvc pull -v
```## Making local changes
Next, let's say we obtained more data from some external source. We will simulate this by doubling the dataset contents:
```
cp data/data.xml data/dup-data.xml
cat data/dup-data.xml >> data/data.xml
rm -f data/dup-data.xml
```After modifying the data, run `dvc add` again to track the latest version:
```
dvc add data/data.xml
```Now we can run `dvc push` to upload the changes to the remote storage, followed by a `git commit` to track them:
```
dvc push
git add . && git commit -m "Dataset updated"
```## Switching between versions
A commonly used workflow is to use `git checkout` to switch to a branch or checkout a specific `.dvc` file revision, followed by a `dvc checkout` to sync data into your workspace:```
git checkout <...>
dvc checkout
```## Return to a previous version of the dataset
Let's go back to the original version of the data:```
git checkout HEAD~1 data/data.xml.dvc
dvc checkout
```Let's commit it (no need to do `dvc push` this time since this original version of the dataset was already saved):
```
git commit data/data.xml.dvc -m "Revert dataset updates"
```**Note:** As we can see, DVC is technically not a version control system by itself! It manipulates `.dvc` files, whose contents define the data file versions. Git is already used to version our code, and now it can also version our data alongside it.