https://github.com/zillow/zdatasets

Dataset SDK for consistent read/write [batch, online, streaming] data.
https://github.com/zillow/zdatasets

data metaflow ml

Last synced: 10 months ago
JSON representation

Dataset SDK for consistent read/write [batch, online, streaming] data.

Host: GitHub
URL: https://github.com/zillow/zdatasets
Owner: zillow
License: apache-2.0
Created: 2021-10-12T20:07:24.000Z (over 4 years ago)
Default Branch: main
Last Pushed: 2024-05-07T23:31:08.000Z (almost 2 years ago)
Last Synced: 2025-03-30T03:11:49.745Z (11 months ago)
Topics: data, metaflow, ml
Language: Python
Homepage:
Size: 494 KB
Stars: 6
Watchers: 6
Forks: 3
Open Issues: 4
Metadata Files:
- Readme: README.md
- Contributing: CONTRIBUTING.md
- License: LICENSE

Awesome Lists containing this project

README

          ![Tests](https://github.com/zillow/datasets/actions/workflows/test.yml/badge.svg)

[![Coverage Status](https://coveralls.io/repos/github/zillow/datasets/badge.svg)](https://coveralls.io/github/zillow/datasets)

[![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/zillow/datasets/main?urlpath=lab/tree/datasets/tutorials)

# Welcome to zdatasets

==================================================

TODO

```python

import pandas as pd

from metaflow import FlowSpec, step

from zdatasets import Dataset, Mode

from zdatasets.metaflow import DatasetParameter

from zdatasets.plugins import BatchOptions

# Can also invoke from CLI:

#  > python zdatasets/tutorials/0_hello_dataset_flow.py run \

#    --hello_dataset '{"name": "HelloDataset", "mode": "READ_WRITE", \

#    "options": {"type": "BatchOptions", "partition_by": "region"}}'

class HelloDatasetFlow(FlowSpec):

    hello_dataset = DatasetParameter(

        "hello_dataset",

        default=Dataset("HelloDataset", mode=Mode.READ_WRITE, options=BatchOptions(partition_by="region")),

    )

    @step

    def start(self):

        df = pd.DataFrame({"region": ["A", "A", "A", "B", "B", "B"], "zpid": [1, 2, 3, 4, 5, 6]})

        print("saving data_frame: \n", df.to_string(index=False))

        # Example of writing to a dataset

        self.hello_dataset.write(df)

        # save this as an output dataset

        self.output_dataset = self.hello_dataset

        self.next(self.end)

    @step

    def end(self):

        print(f"I have dataset \n{self.output_dataset=}")

        # output_dataset to_pandas(partitions=dict(region="A")) only

        df: pd.DataFrame = self.output_dataset.to_pandas(partitions=dict(region="A"))

        print('self.output_dataset.to_pandas(partitions=dict(region="A")):')

        print(df.to_string(index=False))

if __name__ == "__main__":

    HelloDatasetFlow()

```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/zillow/zdatasets

Awesome Lists containing this project

README