Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/waylonwalker/kedro-auto-catalog
Kedro catalog create with default configuration
https://github.com/waylonwalker/kedro-auto-catalog
data data-science kedro kedro-catalog kedro-hook kedro-plugin
Last synced: about 1 month ago
JSON representation
Kedro catalog create with default configuration
- Host: GitHub
- URL: https://github.com/waylonwalker/kedro-auto-catalog
- Owner: WaylonWalker
- License: mit
- Created: 2023-02-15T20:02:13.000Z (almost 2 years ago)
- Default Branch: main
- Last Pushed: 2023-07-14T19:34:25.000Z (over 1 year ago)
- Last Synced: 2024-10-29T05:34:50.278Z (about 2 months ago)
- Topics: data, data-science, kedro, kedro-catalog, kedro-hook, kedro-plugin
- Language: Python
- Homepage:
- Size: 35.2 KB
- Stars: 6
- Watchers: 3
- Forks: 2
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- License: LICENSE.txt
Awesome Lists containing this project
README
# Kedro Auto Catalog
A configurable version of the built in `kedro catalog create` cli. Default
types can be configured in the projects settings.py, to get these types rather
than `MemoryDataSets`.[![PyPI - Version](https://img.shields.io/pypi/v/kedro-auto-catalog.svg)](https://pypi.org/project/kedro-auto-catalog)
[![PyPI - Python Version](https://img.shields.io/pypi/pyversions/kedro-auto-catalog.svg)](https://pypi.org/project/kedro-auto-catalog)---
**Table of Contents**
- [Installation](#installation)
- [License](#license)## Installation
```console
pip install kedro-auto-catalog
```## Configuration
Configure the project defaults in `src//settings.py` with this
dict.```python
AUTO_CATALOG = {
"directory": "data",
"subdirs": ["raw", "intermediate", "primary"],
"layers": ["raw", "intermediate", "primary"],
"default_extension": "parquet",
"default_type": "pandas.ParquetDataSet",
}
```## Usage
To auto create catalog entries for the `__default__` pipeline, run this from the command line.
```bash
kedro auto-catalog -p __default__
```If you want a reminder of what to do, use the `--help`.
```bash
❯ kedro auto-catalog --help❯
Usage: kedro auto-catalog [OPTIONS]Create Data Catalog YAML configuration with missing datasets.
Add configurable datasets to Data Catalog YAML configuration file for each
dataset in a registered pipeline if it is missing from the `DataCatalog`.The catalog configuration will be saved to
`//catalog/.yml` file.Configure the project defaults in `src//settings.py` with this
dict.Options:
-e, --env TEXT Environment to create Data Catalog YAML file in.
Defaults to `base`.
-p, --pipeline TEXT Name of a pipeline. [required]
-h, --help Show this message and exit.
```## Example
Using the
[kedro-spaceflights](https://github.com/quantumblacklabs/kedro-starter-spaceflights)
example, running `kedro auto-catalog -p __default__` yields the following
catalog in `conf/base/catalog/__default__.yml````yaml
X_test:
filepath: data/X_test.pq
type: pandas.ParquetDataSet
X_train:
filepath: data/X_train.pq
type: pandas.ParquetDataSet
y_test:
filepath: data/y_test.parquet
type: pandas.ParquetDataSet
y_train:
filepath: data/y_train.parquet
type: pandas.ParquetDataSet
```## subdirs and layers
If we use the example configuration with `"subdirs": ["raw", "intermediate",
"primary"]` and `"layers": ["raw", "intermediate", "primary"]`, it will convert
any leading subdir/layer in your dataset name into a directory. If we change y_test
to `raw_y_test`, it will put `y_test.parquet` in the `raw` directory, and in the raw layer.```yml
raw_y_test:
filepath: data/raw/y_test.parquet
layer: raw
type: pandas.ParquetDataSet
```## License
`kedro-auto-catalog` is distributed under the terms of the [MIT](https://spdx.org/licenses/MIT.html) license.