https://github.com/getindata/kedro-starters

Kedro starters by GetInData
https://github.com/getindata/kedro-starters

Last synced: over 1 year ago
JSON representation

Kedro starters by GetInData

Host: GitHub
URL: https://github.com/getindata/kedro-starters
Owner: getindata
License: apache-2.0
Created: 2022-10-14T12:16:10.000Z (almost 4 years ago)
Default Branch: develop
Last Pushed: 2023-12-05T18:26:25.000Z (over 2 years ago)
Last Synced: 2025-04-09T20:11:32.896Z (over 1 year ago)
Language: Python
Size: 169 KB
Stars: 3
Watchers: 11
Forks: 3
Open Issues: 1
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- Contributing: CONTRIBUTING.md
- License: LICENSE

Awesome Lists containing this project

README

          # Pipeline

> *Note:* This is a `README.md` boilerplate generated using `Kedro {{ cookiecutter.kedro_version }}`.

## Overview

[Transcoding](https://kedro.readthedocs.io/en/stable/data/data_catalog.html#transcoding-datasets) is used to convert the Spark DataFrames into pandas DataFrames after splitting the data into training and testing sets.

This pipeline:

1. splits the data into training dataset and testing dataset using a configurable ratio found in `conf/base/parameters.yml`

2. runs a simple 1-nearest neighbour model (`make_prediction` node) and makes prediction dataset

3. reports the model accuracy on a test set (`report_accuracy` node)

## Pipeline inputs

### `example_iris_data`

|      |                    |

| ---- | ------------------ |

| Type | `spark.SparkDataSet` |

| Description | Example iris data containing columns |

### `parameters`

|      |                    |

| ---- | ------------------ |

| Type | `dict` |

| Description | Project parameter dictionary that must contain the following keys: `train_fraction` (the ratio used to determine the train-test split), `random_state` (random generator to ensure train-test split is deterministic) and `target_column` (identify the target column in the dataset) |

## Pipeline intermediate outputs

### `X_train`

|      |                    |

| ---- | ------------------ |

| Type | `pyspark.sql.DataFrame` |

| Description | DataFrame containing train set features |

### `y_train`

|      |                    |

| ---- | ------------------ |

| Type | `pyspark.sql.DataFrame` |

| Description | Series containing train set target |

### `X_test`

|      |                    |

| ---- | ------------------ |

| Type | `pyspark.sql.DataFrame` |

| Description | DataFrame containing test set features |

### `y_test`

|      |                    |

| ---- | ------------------ |

| Type | `pyspark.sql.DataFrame` |

| Description | Series containing test set target |

### `y_pred`

|      |                    |

| ---- | ------------------ |

| Type | `pandas.Series` |

| Description | Predictions from the 1-nearest neighbour model |

## Pipeline outputs

### `None`

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/getindata/kedro-starters

Awesome Lists containing this project

README