https://github.com/zillow/zdatasets
Dataset SDK for consistent read/write [batch, online, streaming] data.
https://github.com/zillow/zdatasets
data metaflow ml
Last synced: about 1 month ago
JSON representation
Dataset SDK for consistent read/write [batch, online, streaming] data.
- Host: GitHub
- URL: https://github.com/zillow/zdatasets
- Owner: zillow
- License: apache-2.0
- Created: 2021-10-12T20:07:24.000Z (over 3 years ago)
- Default Branch: main
- Last Pushed: 2024-05-07T23:31:08.000Z (about 1 year ago)
- Last Synced: 2025-03-30T03:11:49.745Z (2 months ago)
- Topics: data, metaflow, ml
- Language: Python
- Homepage:
- Size: 494 KB
- Stars: 6
- Watchers: 6
- Forks: 3
- Open Issues: 4
-
Metadata Files:
- Readme: README.md
- Contributing: CONTRIBUTING.md
- License: LICENSE
Awesome Lists containing this project
README

[](https://coveralls.io/github/zillow/datasets)
[](https://mybinder.org/v2/gh/zillow/datasets/main?urlpath=lab/tree/datasets/tutorials)# Welcome to zdatasets
==================================================TODO
```python
import pandas as pd
from metaflow import FlowSpec, stepfrom zdatasets import Dataset, Mode
from zdatasets.metaflow import DatasetParameter
from zdatasets.plugins import BatchOptions# Can also invoke from CLI:
# > python zdatasets/tutorials/0_hello_dataset_flow.py run \
# --hello_dataset '{"name": "HelloDataset", "mode": "READ_WRITE", \
# "options": {"type": "BatchOptions", "partition_by": "region"}}'
class HelloDatasetFlow(FlowSpec):
hello_dataset = DatasetParameter(
"hello_dataset",
default=Dataset("HelloDataset", mode=Mode.READ_WRITE, options=BatchOptions(partition_by="region")),
)@step
def start(self):
df = pd.DataFrame({"region": ["A", "A", "A", "B", "B", "B"], "zpid": [1, 2, 3, 4, 5, 6]})
print("saving data_frame: \n", df.to_string(index=False))# Example of writing to a dataset
self.hello_dataset.write(df)# save this as an output dataset
self.output_dataset = self.hello_datasetself.next(self.end)
@step
def end(self):
print(f"I have dataset \n{self.output_dataset=}")# output_dataset to_pandas(partitions=dict(region="A")) only
df: pd.DataFrame = self.output_dataset.to_pandas(partitions=dict(region="A"))
print('self.output_dataset.to_pandas(partitions=dict(region="A")):')
print(df.to_string(index=False))if __name__ == "__main__":
HelloDatasetFlow()```