Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.
Awesome Lists | Featured Topics | Projects
https://github.com/ei-grad/cats-vs-dogs-dvc

Last synced: 15 days ago
JSON representation
Host: GitHub
URL: https://github.com/ei-grad/cats-vs-dogs-dvc
Owner: ei-grad
Created: 2024-09-07T02:22:59.000Z (4 months ago)
Default Branch: main
Last Pushed: 2024-09-07T02:45:02.000Z (4 months ago)
Last Synced: 2024-11-14T06:06:48.320Z (about 2 months ago)
Size: 2.93 KB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
Awesome Lists containing this project

README

        # Rye + DVC: Cats vs Dogs Keras Tutorial

This repository guides you through building a machine learning pipeline for classifying images from the **Kaggle Cats vs Dogs** dataset using a custom CNN model built from scratch with **Keras**. The project uses **DVC** (Data Version Control) to manage the pipeline stages, datasets, and model metrics, while **Rye** is used to manage the Python environment. Follow the steps in this `README.md` to create and execute the pipeline stages manually.

## Project Overview

The pipeline includes the following stages:

1. **Download**: Download the dataset.

2. **Prepare**: Filter out corrupted images and save good ones into a TensorFlow dataset.

3. **Train**: Train a custom CNN model using Keras.

4. **Validate**: Evaluate the model on the validation data.

Each stage is created manually, and the results are tracked with DVC. Additionally, the **Tracking Metrics with Git Branches** section explains how to use Git branches and DVC metrics to track and compare model performance across different experiments.

## Requirements

- [Git](https://git-scm.com/)

- [Rye](https://rye.astral.sh/) (for environment management)

- [DVC](https://dvc.org/doc/install)

- **Curl** for downloading the dataset

- **Keras**

- **Matplotlib** for visualization

- **Dvclive** for tracking metrics

## Setting Up the Project

### 1. Fork the Repository

First, fork this repository into your own GitHub account. After forking, clone the repository:

```bash

git clone https://github.com/your-username/cats-vs-dogs-dvc.git

cd cats-vs-dogs-dvc

```

### 2. Install Rye

If you don't have Rye is not installed, follow the installation instructions from the [official documentation](https://rye.astral.sh/guide/installation/).

### 3. Initialize the Python project and virtual environment

Rye [workspaces](https://rye.astral.sh/guide/workspaces/) and [virtual projects](https://rye.astral.sh/guide/virtual/) simplify dependency management for non-package projects. In this tutorial, since we don't have an actual Python package or library, running `rye init --virtual` ensures the project only syncs its dependencies (e.g., `keras`, `dvc`, `dvclive`). This setup is ideal for managing tools without creating a package. Additionally, packages or libraries can be added later in this workspace, sharing the same virtual environment.

Run the following command to initialize a virtual environment and install the required dependencies:

```bash

rye init --virtual

rye add dvc keras tensorflow matplotlib dvclive

```

This command creates a Python virtual environment and installs the necessary dependencies for the project.

After initializing the environment and installing the dependencies, commit the initial state of the project:

```bash

git add . && git commit -m "Initialized python project via Rye"

```

### 4. Activate the Virtual Environment

Activate the virtual environment with:

```bash

source .venv/bin/activate

```

This command modifies the shell's environment variables to use Python and packages installed in the `.venv` directory, isolating project dependencies from the global Python environment. If you start a new shell session, call this command to ensure the correct environment is used.

If DVC is not used directly in your scripts, you can install it in a separate dedicated virtual environment with:

```bash

rye install dvc

```

### 5. Initialize DVC

Next, initialize DVC in the project directory:

```bash

dvc init

# Enable automatic `git add` after DVC stages

dvc config core.autostage true

git add . && git commit -m "Initialized DVC"

```

## Pipeline Stages

### 1. Download the Dataset

Create a script `download.sh` to download:

```bash

#!/bin/bash

mkdir data && cd data

curl -O https://download.microsoft.com/download/3/E/1/3E1C3F21-ECDB-4869-8368-6DEBA77B919F/kagglecatsanddogs_5340.zip

```

Run it in the DVC stage named `download`:

```bash

dvc stage add --run -n download \

  -d download.sh \

  -o data/kagglecatsanddogs_5340.zip \

  bash download.sh

```

Commit the changes:

```bash

git add . && git commit -m "Added download stage"

```

### 2. Prepare the Data

Create a script `prepare.py` to filter out corrupted images and generate a dataset using Keras' `image_dataset_from_directory` utility. Apply data augmentation and prefetching for better performance:

```python

import shutil

import zipfile

import keras

with zipfile.ZipFile("data/kagglecatsanddogs_5340.zip", "r") as zip_ref:

    zip_ref.extractall("data")

# Create training and validation datasets

train_ds, val_ds = keras.utils.image_dataset_from_directory(

    "data/PetImages",

    validation_split=0.2,

    subset="both",

    seed=1337,

    image_size=(180, 180),

    batch_size=None,

)

# Save the processed datasets

# Skip corrupt images via `.ignore_errors()` (otherwise the

# `Dataset.save()` will hang)

print("Saving the train dataset...")

train_ds.ignore_errors().save("data/train_ds")

print("Saving the validation dataset...")

val_ds.ignore_errors().save("data/val_ds")

# Remove PetImages directory, as it's no longer needed

shutil.rmtree("data/PetImages")

```

Run the `prepare` stage:

```bash

dvc stage add --run -n prepare \

  -d prepare.py \

  -d data/kagglecatsanddogs_5340.zip \

  -o data/train_ds \

  -o data/val_ds \

  python prepare.py

```

Commit the changes:

```bash

git add . && git commit -m "Added prepare stage"

```

### 3. Train the Model

Create a script `train.py` to build and train a CNN using Keras and track metrics with Dvclive:

```python

import keras

from keras import layers

from dvclive import Live

from tensorflow.data import Dataset, AUTOTUNE

# Model creation

def make_model(input_shape, num_classes):

    inputs = keras.Input(shape=input_shape)

    x = layers.Rescaling(1.0 / 255)(inputs)

    x = layers.Conv2D(128, 3, strides=2, padding="same")(x)

    x = layers.BatchNormalization()(x)

    x = layers.Activation("relu")(x)

    previous_block_activation = x

    for size in [256, 512, 728]:

        x = layers.Activation("relu")(x)

        x = layers.SeparableConv2D(size, 3, padding="same")(x)

        x = layers.BatchNormalization()(x)

        x = layers.Activation("relu")(x)

        x = layers.SeparableConv2D(size, 3, padding="same")(x)

        x = layers.BatchNormalization()(x)

        x = layers.MaxPooling2D(3, strides=2, padding="same")(x)

        residual = layers.Conv2D(size, 1, strides=2, padding="same")(previous_block_activation)

        x = layers.add([x, residual])

        previous_block_activation = x

    x = layers.SeparableConv2D(1024, 3, padding="same")(x)

    x = layers.BatchNormalization()(x)

    x = layers.Activation("relu")(x)

    x = layers.GlobalAveragePooling2D()(x)

    x = layers.Dropout(0.25)(x)

    if num_classes == 2:

        units = 1

    else:

        units = num_classes

    outputs = layers.Dense(units, activation=None)(x)

    return keras.Model(inputs, outputs)

# Data augmentation

def data_augmentation(images):

    for layer in [

        layers.RandomFlip("horizontal"),

        layers.RandomRotation(0.1),

    ]:

        images = layer(images)

    return images

batch_size = 128

# Load datasets

train_ds = Dataset.load("data/train_ds").batch(batch_size, drop_remainder=True)

val_ds = Dataset.load("data/val_ds").batch(batch_size, drop_remainder=True)

# Apply data augmentation and prefetching

train_ds = train_ds.map(

    lambda img, label: (data_augmentation(img), label),

    num_parallel_calls=AUTOTUNE

)

train_ds = train_ds.prefetch(AUTOTUNE)

val_ds = val_ds.prefetch(AUTOTUNE)

learning_rate = 3e-4

epochs = 25

model = make_model(input_shape=(180, 180, 3), num_classes=2)

# Compile the model

model.compile(

    optimizer=keras.optimizers.Adam(learning_rate=learning_rate),

    loss=keras.losses.BinaryCrossentropy(from_logits=True),

    metrics=['accuracy'],

)

# Track training metrics with Dvclive

with Live() as live:

    live.log_param("learning_rate", learning_rate)

    live.log_param("optimizer", model.optimizer.__class__.__name__)

    live.log_param("epochs", epochs)

    for epoch in range(epochs):

        history = model.fit(train_ds, validation_data=val_ds, epochs=1)

        acc = history.history["accuracy"][-1]

        val_acc = history.history["val_accuracy"][-1]

        loss = history.history["loss"][-1]

        val_loss = history.history["val_loss"][-1]

        live.log_metric("train_accuracy", acc)

        live.log_metric("val_accuracy", val_acc)

        live.log_metric("train_loss", loss)

        live.log_metric("val_loss", val_loss)

        live.next_step()

model.save('model/cats_vs_dogs.keras')

```

Run the `train` stage:

```bash

dvc stage add --run -n train \

  -d train.py \

  -d data/train_ds \

  -o model/cats_vs_dogs.keras \

  python train.py

```

Commit the changes and trained model:

```bash

git add . && git commit -m "Added train stage"

```

### 4. Validate the Model

Create a script `validate.py` to evaluate the model on the validation data and save metrics:

```python

import keras

from tensorflow.data import Dataset

import numpy as np

import json

# Load model and validation dataset

model = keras.models.load_model('model/cats_vs_dogs.keras')

val_ds = Dataset.load('data/val_ds').batch(128, drop_remainder=True)

# Evaluate the model

loss, acc = model.evaluate(val_ds)

# Save metrics

metrics = {"accuracy": acc}

with open('metrics.json', 'w') as f:

    json.dump(metrics, f)

print(f"Validation accuracy: {acc:.2f}")

```

Run the `validate` stage:

```bash

dvc stage add --run -n validate \

  -d validate.py \

  -d model/cats_vs_dogs.h5 \

  -d data/val_ds \

  -M metrics.json \

  python validate.py

```

Commit the changes:

```bash

dvc repro

git add . && git commit -m "Added validate stage"

```