https://github.com/google-parfait/dataset_grouper

Libraries for efficient and scalable group-structured dataset pipelines.
https://github.com/google-parfait/dataset_grouper

apache-beam datasets federated-learning jax pytorch tensorflow tensorflow-datasets

Last synced: 11 months ago
JSON representation

Libraries for efficient and scalable group-structured dataset pipelines.

Host: GitHub
URL: https://github.com/google-parfait/dataset_grouper
Owner: google-parfait
License: apache-2.0
Created: 2023-05-26T18:20:41.000Z (about 3 years ago)
Default Branch: main
Last Pushed: 2025-06-18T21:23:48.000Z (about 1 year ago)
Last Synced: 2025-07-08T03:29:28.528Z (about 1 year ago)
Topics: apache-beam, datasets, federated-learning, jax, pytorch, tensorflow, tensorflow-datasets
Language: Python
Homepage:
Size: 60.5 KB
Stars: 26
Watchers: 1
Forks: 4
Open Issues: 0
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- Contributing: CONTRIBUTING.md
- License: LICENSE

Awesome Lists containing this project

README

          # Dataset Grouper - Scalable Dataset Pipelines for Group-Structured Learning

![PyPI version](https://img.shields.io/pypi/v/dataset_grouper)

Dataset Grouper is a library for creating, writing, and iterating over datasets

with group-level structure. It is primarily intended for creating large-scale

datasets for federated learning research.

## Installation

We recommend installing via PyPI. Please check the [PyPI](https://pypi.org/project/dataset-grouper/) page for up-to-date version requirements, including python version requirements.

```

pip install --upgrade pip

pip install dataset-grouper

```

## Getting Started

Below is a simple starting example, that partitions MNIST across 10 clients by

label.

First we import some necessary packages, including [Apache Beam](https://beam.apache.org/), which will be used to run the Dataset Grouper's pipelines.

```python

import apache_beam as beam

import dataset_grouper as dsgp

import tensorflow_datasets as tfds

```

Next, we download and prepare the MNIST dataset.

```python

dataset_builder = tfds.builder('mnist')

dataset_builder.download_and_prepare(...)

```

We now write a function that assigns each MNIST example a client identifier

(generally a `bytes` object). In this case, we will partition examples according

to their label, but you can use much more interesting partition functions as

well.

```python

def label_partition(x):

  label = x['label'].numpy()

  return str(label).encode('utf-8')

```

Finally, we build a pipeline that will partition MNIST according to this

function, and run it using Beam's

[Direct Runner](https://beam.apache.org/documentation/runners/direct/).

```python

mnist_pipeline = dsgp.tfds_to_tfrecords(

    dataset_builder=dataset_builder,

    split='test',

    get_key_fn=label_partition,

    file_path_prefix=...

)

with beam.Pipeline() as root:

  mnist_pipeline(root)

```

This will save a version of MNIST that has been partitioned according to labels

to a [TFRecord](https://www.tensorflow.org/tutorials/load_data/tfrecord) format.

We can also load it to iterate over client datasets.

```python

partitioned_dataset = dsgp.PartitionedDataset(

  file_pattern=...,

  tfds_features='mnist')

for group_dataset in partitioned_dataset.build_group_stream():

  pass

```

Generally, `PartitionedDataset.build_group_stream()` is a `tf.data.Dataset` that

yields datasets, each of which is contains all the examples held by one group.

If you'd like to use these datasets with NumPy, you can simply do:

```python

group_dataset_numpy = group_dataset.as_numpy_iterator()

```

## What Else?

The example above is primarily for educational purposes. MNIST is a relatively

small dataset, and can generally fit entirely into memory. For more interesting

examples, check out the examples folder.

Dataset Grouper is intended more for large-scale datasets, especially those

datasets that do not fit into memory. For these datasets, we recommend using

more sophisticated [Beam runners](https://beam.apache.org/documentation/runners/capability-matrix/), in order to partition the data in a distributed fashion.

## Disclaimers

This is not an officially supported Google product.

This is a utility library that downloads and prepares public datasets. We do

not host or distribute these datasets, vouch for their quality or fairness, or

claim that you have license to use the dataset. It is your responsibility to

determine whether you have permission to use the dataset under the dataset's

license.

If you're interested in learning more about responsible AI practices, please

see Google AI's [Responsible AI Practices](https://ai.google/education/responsible-ai-practices).

Dataset Grouper is Apache 2.0 licensed. See the [`LICENSE`](LICENSE) file.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/google-parfait/dataset_grouper

Awesome Lists containing this project

README