{"id":21514822,"url":"https://github.com/getindata/kedro-starters","last_synced_at":"2025-04-09T20:11:38.148Z","repository":{"id":62957880,"uuid":"551445495","full_name":"getindata/kedro-starters","owner":"getindata","description":"Kedro starters by GetInData","archived":false,"fork":false,"pushed_at":"2023-12-05T18:26:25.000Z","size":173,"stargazers_count":3,"open_issues_count":1,"forks_count":3,"subscribers_count":11,"default_branch":"develop","last_synced_at":"2025-04-09T20:11:32.896Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/getindata.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2022-10-14T12:16:10.000Z","updated_at":"2024-02-18T11:38:24.000Z","dependencies_parsed_at":"2023-01-29T21:55:20.798Z","dependency_job_id":null,"html_url":"https://github.com/getindata/kedro-starters","commit_stats":null,"previous_names":[],"tags_count":4,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/getindata%2Fkedro-starters","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/getindata%2Fkedro-starters/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/getindata%2Fkedro-starters/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/getindata%2Fkedro-starters/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/getindata","download_url":"https://codeload.github.com/getindata/kedro-starters/tar.gz/refs/heads/develop","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248103872,"owners_count":21048245,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-23T23:53:05.790Z","updated_at":"2025-04-09T20:11:38.140Z","avatar_url":"https://github.com/getindata.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Pipeline\n\n\u003e *Note:* This is a `README.md` boilerplate generated using `Kedro {{ cookiecutter.kedro_version }}`.\n\n## Overview\n\n[Transcoding](https://kedro.readthedocs.io/en/stable/data/data_catalog.html#transcoding-datasets) is used to convert the Spark DataFrames into pandas DataFrames after splitting the data into training and testing sets.\n\nThis pipeline:\n1. splits the data into training dataset and testing dataset using a configurable ratio found in `conf/base/parameters.yml`\n2. runs a simple 1-nearest neighbour model (`make_prediction` node) and makes prediction dataset\n3. reports the model accuracy on a test set (`report_accuracy` node)\n\n## Pipeline inputs\n\n### `example_iris_data`\n\n|      |                    |\n| ---- | ------------------ |\n| Type | `spark.SparkDataSet` |\n| Description | Example iris data containing columns |\n\n\n### `parameters`\n\n|      |                    |\n| ---- | ------------------ |\n| Type | `dict` |\n| Description | Project parameter dictionary that must contain the following keys: `train_fraction` (the ratio used to determine the train-test split), `random_state` (random generator to ensure train-test split is deterministic) and `target_column` (identify the target column in the dataset) |\n\n\n## Pipeline intermediate outputs\n\n### `X_train`\n\n|      |                    |\n| ---- | ------------------ |\n| Type | `pyspark.sql.DataFrame` |\n| Description | DataFrame containing train set features |\n\n### `y_train`\n\n|      |                    |\n| ---- | ------------------ |\n| Type | `pyspark.sql.DataFrame` |\n| Description | Series containing train set target |\n\n### `X_test`\n\n|      |                    |\n| ---- | ------------------ |\n| Type | `pyspark.sql.DataFrame` |\n| Description | DataFrame containing test set features |\n\n### `y_test`\n\n|      |                    |\n| ---- | ------------------ |\n| Type | `pyspark.sql.DataFrame` |\n| Description | Series containing test set target |\n\n### `y_pred`\n\n|      |                    |\n| ---- | ------------------ |\n| Type | `pandas.Series` |\n| Description | Predictions from the 1-nearest neighbour model |\n\n\n## Pipeline outputs\n\n### `None`","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgetindata%2Fkedro-starters","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fgetindata%2Fkedro-starters","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgetindata%2Fkedro-starters/lists"}