Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.
Awesome Lists | Featured Topics | Projects
https://github.com/dominodatalab/domino-ai-flows

Last synced: about 2 months ago
JSON representation
Host: GitHub
URL: https://github.com/dominodatalab/domino-ai-flows
Owner: dominodatalab
License: apache-2.0
Created: 2024-04-01T22:57:33.000Z (9 months ago)
Default Branch: main
Last Pushed: 2024-06-14T21:49:35.000Z (7 months ago)
Last Synced: 2024-08-08T22:06:42.296Z (5 months ago)
Language: Python
Size: 134 KB
Stars: 0
Watchers: 9
Forks: 1
Open Issues: 0
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- License: LICENSE.md
Awesome Lists containing this project

README

        # Domino AI Training Flow

This repo mocks a sample AI data prepartion and training script using Domino Flows. This flow:

1. Loads two datasets in from different sources

2. Merges the data together

3. Does some data preprocessing

4. Trains a model using the processed data

To run the flow, execute the following command in a workspace using the Domino Standard Environment: 

```

pyflyte run --remote flow.py model_training --data_path_a /mnt/code/data/datasetA.csv --data_path_b /mnt/code/data/datasetB.csv

```

Once you run the command, navigate to `Flows page > Runs pivot > Run name` to monitor the flow execution:

![Execution Link](https://github.com/ddl-jwu/domino-ai-flows/blob/00f667768b18e28985aeda8721ca8e1b8c92a9c4/screenshots/run.png?raw=true)

Clicking on the `View graph` button at the top-right of the runs view will take you to the the Flyte UI where you can see a graph view of the flow:

![Monitor](https://github.com/ddl-jwu/domino-ai-flows/blob/00f667768b18e28985aeda8721ca8e1b8c92a9c4/screenshots/graph.png?raw=true)

# Flow Definition

The flow is defined in the `flow.py` file. The sample flow contains 5 tasks - two for loading in data, one for merging the data together, one for preprocessing the data, and one for model training. Each task ultimately triggers a Domino Job and returns the outputs. We'll go through each of the tasks in detail.

**Data Loading**

The purpose of the data loading tasks is to load the data into flows and make a snapshot of it to capture the data at its current point in time. The code snippet below shows the definition for the two data loading tasks:

```

task1 = run_domino_job_task(

    flyte_task_name='Load Data A',

    command='python /mnt/code/scripts/load-data-A.py',

    hardware_tier_name='Small',

    inputs=[

        Input(name='data_path', type=str, value=data_path_A)

    ],

    output_specs=[

        Output(name='datasetA', type=FlyteFile[TypeVar('csv')])

    ],

    use_project_defaults_for_omitted=True

)

task2 = run_domino_job_task(

    flyte_task_name='Load Data B',

    command='python /mnt/code/scripts/load-data-B.py',

    hardware_tier_name='Small',

    inputs=[

        Input(name='data_path', type=str, value=data_path_B)

    ],

    output_specs=[

        Output(name='datasetB', type=FlyteFile[TypeVar('csv')])

    ],

    use_project_defaults_for_omitted=True

)

```

As you can see, we use the `run_domino_job_task` helper method to create the task with the appropriate parameters. Explaining each parameter in detail:

- `flyte_task_name`: The name for the task.

- `command`: The command that will be used in the Domino Job.

- `environment`: The name of the environment you want to use for the task. If not specified, the default environment for the project will be used. 

- `hardware_tier_name`: The name of the hardware tier you want to use for the task. If not specified, the default hardware tier for the project will be used.

- `inputs`: A list of inputs that the task depends on. Note how the `data_path` inputs are set to the arguments provided through the command line when starting the flow.

- `outputs`: A list of outputs that will be produced by that task. Supported input and output types are documented [here](https://docs.flyte.org/en/latest/user_guide/data_types_and_io/index.html)

Inputs can be accessed within Domino Jobs at `/workflow/inputs/`. In a similar fashion, outputs must be written to `/workflow/outputs/` inside the Domino Job for them to be tracked and returned in the task. See the scripts inside the `script` folder for more details.

**Data Merging**

With both datasets loaded in, the next task is to merge the datasets together into a single one. The code snippet below shows the definition for the data merging task:

```

task3 = run_domino_job_task(

    flyte_task_name='Merge Data',

    command='python /mnt/code/scripts/merge-data.py',

    hardware_tier_name='Small',

    inputs=[

        Input(name='datasetA', type=FlyteFile[TypeVar('csv')], value=task1['datasetA']),

        Input(name='datasetB', type=FlyteFile[TypeVar('csv')], value=task2['datasetB'])

    ],

    output_specs=[

        Output(name='merged_data', type=FlyteFile[TypeVar('csv')])

    ],

    use_project_defaults_for_omitted=True

)

```

As you can see, the same `run_domino_job_task` function is used again. One thing to note is how the output from the data data loading tasks are referenced via calling `task1['datasetA']` / `task2['datasetB']` and it specified as an input to the data merging task.

**Data Preparation**

The next task is to take the merged dataset and perform any necessary data cleaning. The code snippet below shows the definition for the data preprocessing task. 

```

task4 = run_domino_job_task(

    flyte_task_name='Process Data',

    command='python /mnt/code/scripts/process-data.py',

    hardware_tier_name='Small',

    inputs=[

        Input(name='data_path', type=FlyteFile[TypeVar('csv')], value=task3['merged_data'])

    ],

    output_specs=[

        Output(name='processed_data', type=FlyteFile[TypeVar('csv')])

    ],

    use_project_defaults_for_omitted=True

)

```

As you can see, very similar structure as previous tasks.

**Model training**

The final task trains a model using the processed data. The code snippet below shows the definition for the model training task:

```

task5 = run_domino_job_task(

    flyte_task_name='Train Model',

    command='python /mnt/code/scripts/train-model.py',

    hardware_tier_name='Small',

    inputs=[

        Input(name='processed_data', type=FlyteFile[TypeVar('csv')], value=task4['processed_data']),

        Input(name='epochs', type=int, value=10),

        Input(name='batch_size', type=int, value=32)

    ],

    output_specs=[

        Output(name='model', type=FlyteFile[TypeVar('pkl')])

    ],

    use_project_defaults_for_omitted=True

)

```

The output of the last task is then returned as the final output to the flow. 

# Environment Requirements

The project just requires the `Domino Standard Environment` which comes included with all Domino deployments.

# Hardware Requirements

This project works with a the default `small-k8s` tier that comes with all Domino deployments.

# License

This template is licensed under Apache 2.0 and contains the following open source components: 

* Flytekit [Apache 2.0](https://github.com/flyteorg/flytekit/blob/master/LICENSE)