https://github.com/getindata/kedro-starters
Kedro starters by GetInData
https://github.com/getindata/kedro-starters
Last synced: about 1 year ago
JSON representation
Kedro starters by GetInData
- Host: GitHub
- URL: https://github.com/getindata/kedro-starters
- Owner: getindata
- License: apache-2.0
- Created: 2022-10-14T12:16:10.000Z (over 3 years ago)
- Default Branch: develop
- Last Pushed: 2023-12-05T18:26:25.000Z (over 2 years ago)
- Last Synced: 2025-04-09T20:11:32.896Z (about 1 year ago)
- Language: Python
- Size: 169 KB
- Stars: 3
- Watchers: 11
- Forks: 3
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- Contributing: CONTRIBUTING.md
- License: LICENSE
Awesome Lists containing this project
README
# Pipeline
> *Note:* This is a `README.md` boilerplate generated using `Kedro {{ cookiecutter.kedro_version }}`.
## Overview
[Transcoding](https://kedro.readthedocs.io/en/stable/data/data_catalog.html#transcoding-datasets) is used to convert the Spark DataFrames into pandas DataFrames after splitting the data into training and testing sets.
This pipeline:
1. splits the data into training dataset and testing dataset using a configurable ratio found in `conf/base/parameters.yml`
2. runs a simple 1-nearest neighbour model (`make_prediction` node) and makes prediction dataset
3. reports the model accuracy on a test set (`report_accuracy` node)
## Pipeline inputs
### `example_iris_data`
| | |
| ---- | ------------------ |
| Type | `spark.SparkDataSet` |
| Description | Example iris data containing columns |
### `parameters`
| | |
| ---- | ------------------ |
| Type | `dict` |
| Description | Project parameter dictionary that must contain the following keys: `train_fraction` (the ratio used to determine the train-test split), `random_state` (random generator to ensure train-test split is deterministic) and `target_column` (identify the target column in the dataset) |
## Pipeline intermediate outputs
### `X_train`
| | |
| ---- | ------------------ |
| Type | `pyspark.sql.DataFrame` |
| Description | DataFrame containing train set features |
### `y_train`
| | |
| ---- | ------------------ |
| Type | `pyspark.sql.DataFrame` |
| Description | Series containing train set target |
### `X_test`
| | |
| ---- | ------------------ |
| Type | `pyspark.sql.DataFrame` |
| Description | DataFrame containing test set features |
### `y_test`
| | |
| ---- | ------------------ |
| Type | `pyspark.sql.DataFrame` |
| Description | Series containing test set target |
### `y_pred`
| | |
| ---- | ------------------ |
| Type | `pandas.Series` |
| Description | Predictions from the 1-nearest neighbour model |
## Pipeline outputs
### `None`