Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/waylonwalker/kedro-auto-catalog

Kedro catalog create with default configuration
https://github.com/waylonwalker/kedro-auto-catalog

data data-science kedro kedro-catalog kedro-hook kedro-plugin

Last synced: about 1 month ago
JSON representation

Kedro catalog create with default configuration

Awesome Lists containing this project

README

        

# Kedro Auto Catalog

A configurable version of the built in `kedro catalog create` cli. Default
types can be configured in the projects settings.py, to get these types rather
than `MemoryDataSets`.

[![PyPI - Version](https://img.shields.io/pypi/v/kedro-auto-catalog.svg)](https://pypi.org/project/kedro-auto-catalog)
[![PyPI - Python Version](https://img.shields.io/pypi/pyversions/kedro-auto-catalog.svg)](https://pypi.org/project/kedro-auto-catalog)

---

**Table of Contents**

- [Installation](#installation)
- [License](#license)

## Installation

```console
pip install kedro-auto-catalog
```

## Configuration

Configure the project defaults in `src//settings.py` with this
dict.

```python
AUTO_CATALOG = {
"directory": "data",
"subdirs": ["raw", "intermediate", "primary"],
"layers": ["raw", "intermediate", "primary"],
"default_extension": "parquet",
"default_type": "pandas.ParquetDataSet",
}
```

## Usage

To auto create catalog entries for the `__default__` pipeline, run this from the command line.

```bash
kedro auto-catalog -p __default__
```

If you want a reminder of what to do, use the `--help`.

```bash
❯ kedro auto-catalog --help❯
Usage: kedro auto-catalog [OPTIONS]

Create Data Catalog YAML configuration with missing datasets.

Add configurable datasets to Data Catalog YAML configuration file for each
dataset in a registered pipeline if it is missing from the `DataCatalog`.

The catalog configuration will be saved to
`//catalog/.yml` file.

Configure the project defaults in `src//settings.py` with this
dict.

Options:
-e, --env TEXT Environment to create Data Catalog YAML file in.
Defaults to `base`.
-p, --pipeline TEXT Name of a pipeline. [required]
-h, --help Show this message and exit.
```

## Example

Using the
[kedro-spaceflights](https://github.com/quantumblacklabs/kedro-starter-spaceflights)
example, running `kedro auto-catalog -p __default__` yields the following
catalog in `conf/base/catalog/__default__.yml`

```yaml
X_test:
filepath: data/X_test.pq
type: pandas.ParquetDataSet
X_train:
filepath: data/X_train.pq
type: pandas.ParquetDataSet
y_test:
filepath: data/y_test.parquet
type: pandas.ParquetDataSet
y_train:
filepath: data/y_train.parquet
type: pandas.ParquetDataSet
```

## subdirs and layers

If we use the example configuration with `"subdirs": ["raw", "intermediate",
"primary"]` and `"layers": ["raw", "intermediate", "primary"]`, it will convert
any leading subdir/layer in your dataset name into a directory. If we change y_test
to `raw_y_test`, it will put `y_test.parquet` in the `raw` directory, and in the raw layer.

```yml
raw_y_test:
filepath: data/raw/y_test.parquet
layer: raw
type: pandas.ParquetDataSet
```

## License

`kedro-auto-catalog` is distributed under the terms of the [MIT](https://spdx.org/licenses/MIT.html) license.