{"id":28753534,"url":"https://github.com/google-deepmind/multi_object_datasets","last_synced_at":"2025-06-17T00:40:57.170Z","repository":{"id":65808225,"uuid":"205347347","full_name":"google-deepmind/multi_object_datasets","owner":"google-deepmind","description":"Multi-object image datasets with ground-truth segmentation masks and generative factors.","archived":false,"fork":false,"pushed_at":"2021-12-17T16:36:02.000Z","size":3067,"stargazers_count":248,"open_issues_count":10,"forks_count":24,"subscribers_count":11,"default_branch":"master","last_synced_at":"2024-04-16T04:53:39.851Z","etag":null,"topics":["datasets","deepmind","representation-learning","segmentation"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/google-deepmind.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2019-08-30T09:12:57.000Z","updated_at":"2024-04-12T09:19:13.000Z","dependencies_parsed_at":"2023-02-11T09:30:48.701Z","dependency_job_id":null,"html_url":"https://github.com/google-deepmind/multi_object_datasets","commit_stats":null,"previous_names":["google-deepmind/multi_object_datasets"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/google-deepmind/multi_object_datasets","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/google-deepmind%2Fmulti_object_datasets","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/google-deepmind%2Fmulti_object_datasets/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/google-deepmind%2Fmulti_object_datasets/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/google-deepmind%2Fmulti_object_datasets/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/google-deepmind","download_url":"https://codeload.github.com/google-deepmind/multi_object_datasets/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/google-deepmind%2Fmulti_object_datasets/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":260268635,"owners_count":22983601,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["datasets","deepmind","representation-learning","segmentation"],"created_at":"2025-06-17T00:40:39.711Z","updated_at":"2025-06-17T00:40:57.148Z","avatar_url":"https://github.com/google-deepmind.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Multi-Object Datasets\n\nThis repository contains datasets for multi-object representation learning, used\nin developing scene decomposition methods like\n[MONet](https://arxiv.org/abs/1901.11390) [1],\n[IODINE](http://proceedings.mlr.press/v97/greff19a.html) [2], and [SIMONe](https://papers.nips.cc/paper/2021/hash/a860a7886d7c7e2a8d3eaac96f76dc0d-Abstract.html)\n[3]. The datasets we provide are:\n\n1.  [Multi-dSprites](#multi-dsprites)\n2.  [Objects Room](#objects-room)\n3.  [CLEVR (with masks)](#clevr-with-masks)\n4.  [Tetrominoes](#tetrominoes)\n5.  [CATER (with masks)](#cater-with-masks)\n\n![preview](preview.gif)\n\nThe datasets consist of multi-object scenes. Each image or video is accompanied by\nground-truth segmentation masks for all objects in the scene. For some datasets\n(excluding Objects Room and CATER), we also provide\nper-object generative factors to facilitate\nrepresentation learning. The generative factors include all necessary and\nsufficient features (size, color, position, etc.) to describe and render the\nobjects present in a scene.\n\nLastly, the `segmentation_metrics` module contains a TensorFlow implementation\nof the\n[adjusted Rand index](https://en.wikipedia.org/wiki/Rand_index#Adjusted_Rand_index)\n[4], which can be used to compare inferred object segmentations with\nground-truth segmentation masks. All code has been tested to work with\nTensorFlow r1.14.\n\n## Bibtex\n\nIf you use one of these datasets in your work, please cite it as follows:\n\n```\n@misc{multiobjectdatasets19,\n  title={Multi-Object Datasets},\n  author={Kabra, Rishabh and Burgess, Chris and Matthey, Loic and\n          Kaufman, Raphael Lopez and Greff, Klaus and Reynolds, Malcolm and\n          Lerchner, Alexander},\n  howpublished={https://github.com/deepmind/multi-object-datasets/},\n  year={2019}\n}\n```\n\n## Descriptions\n\n### Multi-dSprites\n\nThis is a dataset based on\n[dSprites](https://github.com/deepmind/dsprites-dataset). Each image consists of\nmultiple oval, heart, or square-shaped sprites (with some occlusions) set\nagainst a uniformly colored background.\n\nWe're releasing three versions of this dataset containing 1M datapoints each:\n\n1.1 Binarized: each image has 2-3 white sprites on a black background.\n\n1.2 Colored sprites on grayscale: each scene has 2-5 randomly colored HSV\nsprites on a randomly sampled grayscale background.\n\n1.3 Colored sprites and background: each scene has 1-4 sprites. All colors are\nrandomly sampled RGB values.\n\nEach datapoint contains an image, a number of background and object masks, and\nthe following ground-truth features per object: `x` and `y` positions, `shape`,\n`color` (rgb values), `orientation`, and `scale`. Lastly, `visibility` is a\nbinary feature indicating which objects are not null.\n\n### Objects Room\n\nThis dataset is based on the [MuJoCo](http://www.mujoco.org/) environment used\nby the Generative Query Network [5] and is a multi-object extension of the\n[3d-shapes dataset](https://github.com/deepmind/3d-shapes). The training set\ncontains 1M scenes with up to three objects. We also provide ~1K test examples\nfor the following variants:\n\n2.1 Empty room: scenes consist of the sky, walls, and floor only.\n\n2.2 Six objects: exactly 6 objects are visible in each image.\n\n2.3 Identical color: 4-6 objects are placed in the room and have an identical,\nrandomly sampled color.\n\nDatapoints consist of an image and fixed number of masks. The first four masks\ncorrespond to the sky, floor, and two halves of the wall respectively. The\nremaining masks correspond to the foreground objects.\n\n### CLEVR (with masks)\n\nWe adapted the\n[open-source script](https://github.com/facebookresearch/clevr-dataset-gen)\nprovided by Johnson et al. to produce ground-truth segmentation masks for CLEVR\n[6] scenes. These were generated afresh, so images in this dataset are not\nidentical to those in the original CLEVR dataset. We ignore the original\nquestion-answering task.\n\nThe images and masks in the dataset are of size 320x240. We also provide all\nground-truth factors included in the original dataset (namely `x`, `y`, and `z`\nposition, `pixel_coords`, and `rotation`, which are real-valued; plus `size`,\n`material`, `shape`, and `color`, which are encoded as integers) along with a\n`visibility` vector to indicate which objects are not null.\n\n### Tetrominoes\n\nThis is a dataset of Tetris-like shapes (aka tetrominoes). Each 35x35 image\ncontains three tetrominoes, sampled from 17 unique shapes/orientations. Each\ntetromino has one of six possible colors (red, green, blue, yellow, magenta,\ncyan). We provide `x` and `y` position, `shape`, and `color` (integer-coded) as\nground-truth features. Datapoints also include a `visibility` vector.\n\n### CATER (with masks)\n\nWe adapted the\n[open-source script](https://github.com/rohitgirdhar/CATER)\nprovided by Girdhar et al. to produce ground-truth segmentation masks for CATER\n[7] videos. We use identical settings as the `max2action_cameramotion` version\nof the dataset, containing a moving camera and up to two moving objects at any\ntime. We ignore the original tasks.\n\nThe videos and masks we provide are of size 64x64, obtained by taking a central\ncrop and downscaling the original 320x240 images. Each video contains 33 frames.\nFor each frame we also provide a 4x4 `camera_matrix` containing the orientation\nand position of the camera, and `object_positions` containing the 3D allocentric\npositions of all objects in the scene.\n\nNote that each split (train and test) of this dataset is sharded across\n100 TFRecord files. To load either split fully, pass all corresponding filenames\ninto the dataset loader.\n\n\n## Download\n\nThe datasets can be downloaded from\n[Google Cloud Storage](https://console.cloud.google.com/storage/browser/multi-object-datasets).\nEach dataset is a single\n[TFRecords](https://www.tensorflow.org/tutorials/load_data/tf_records) file. To\ndownload a particular dataset, use the web interface, or run `wget` with the\nappropriate filename as follows:\n\n```shell\n  wget https://storage.googleapis.com/multi-object-datasets/multi_dsprites/multi_dsprites_colored_on_colored.tfrecords\n```\n\nTo download all datasets, you'll need the `gsutil` tool, which comes with the\n[Google Cloud SDK](https://cloud.google.com/sdk/docs/). Simply run:\n\n```shell\n  gsutil cp -r gs://multi-object-datasets .\n```\n\nThe approximate download sizes are:\n\n1.  Multi-dSprites: between 500 MB and 1 GB.\n2.  Objects Room: the training set is 7 GB. The test sets are 6-8 MB.\n3.  CLEVR (with masks): 10.5 GB.\n4.  Tetrominoes: 300 MB.\n5.  CATER (with masks): the training set is 8 GB. The test set is 4 GB.\n\n## Usage\n\nAfter downloading the dataset files, you can read them as\n[`tf.data.Dataset`](https://www.tensorflow.org/api_docs/python/tf/data/Dataset)\ninstances with the readers provided. The example below shows how to read the\ncolored-sprites-and-background version of Multi-dSprites:\n\n```python\n  from multi_object_datasets import multi_dsprites\n  import tensorflow as tf\n\n  tf_records_path = 'path/to/multi_dsprites_colored_on_colored.tfrecords'\n  batch_size = 32\n\n  dataset = multi_dsprites.dataset(tf_records_path, 'colored_on_colored')\n  batched_dataset = dataset.batch(batch_size)  # optional batching\n  iterator = batched_dataset.make_one_shot_iterator()\n  data = iterator.get_next()\n\n  with tf.train.SingularMonitoredSession() as sess:\n    d = sess.run(data)\n```\n\nAll dataset readers return images and segmentation masks in the following\ncanonical format (assuming the dataset is batched as above):\n\n-   'image': `Tensor` of shape [batch_size, height, width, channels] and type\n    uint8. For video datasets, the shape is [batch_size, sequence_length,\n    height, width, channels].\n\n-   'mask': `Tensor` of shape [batch_size, max_num_entities, height, width,\n    channels] and type uint8.  For video datasets, the shape is [batch_size,\n    sequence_length, max_num_entities, height, width, channels]. The tensor\n    takes on values of 255 or 0, denoting whether a pixel belongs to a\n    particular entity or not.\n\nYou can compare predicted object segmentation masks with the ground-truth masks\nusing `segmentation_metrics.adjusted_rand_index` as below:\n\n```python\n  max_num_entities = multi_dsprites.MAX_NUM_ENTITIES['colored_on_colored']\n  # Ground-truth segmentation masks are always returned in the canonical\n  # [batch_size, max_num_entities, height, width, channels] format. To use these\n  # as an input for `segmentation_metrics.adjusted_rand_index`, we need them in\n  # the [batch_size, n_points, n_true_groups] format,\n  # where n_true_groups == max_num_entities. We implement this reshape below.\n  # Note that 'oh' denotes 'one-hot'.\n  desired_shape = [batch_size,\n                   multi_dsprites.IMAGE_SIZE[0] * multi_dsprites.IMAGE_SIZE[1],\n                   max_num_entities]\n  true_groups_oh = tf.transpose(data['mask'], [0, 2, 3, 4, 1])\n  true_groups_oh = tf.reshape(true_groups_oh, desired_shape)\n\n  random_prediction = tf.random_uniform(desired_shape[:-1],\n                                        minval=0, maxval=max_num_entities,\n                                        dtype=tf.int32)\n  random_prediction_oh = tf.one_hot(random_prediction, depth=max_num_entities)\n\n  ari = segmentation_metrics.adjusted_rand_index(true_groups_oh,\n                                                 random_prediction_oh)\n```\n\nTo exclude all background pixels from the ARI score (as in [2]), you can compute\nit as follows instead. This assumes the first true group contains all background\npixels:\n\n```python\n  ari_nobg = segmentation_metrics.adjusted_rand_index(true_groups_oh[..., 1:],\n                                                      random_prediction_oh)\n```\n\n## References\n\n[1] Burgess, C. P., Matthey, L., Watters, N., Kabra, R., Higgins, I., Botvinick,\nM., \u0026 Lerchner, A. (2019). Monet: Unsupervised scene decomposition and\nrepresentation. arXiv preprint arXiv:1901.11390.\n\n[2] Greff, K., Kaufman, R. L., Kabra, R., Watters, N., Burgess, C., Zoran, D.,\nMatthey, L., Botvinick, M., \u0026 Lerchner, A. (2019). Multi-Object Representation\nLearning with Iterative Variational Inference. Proceedings of the 36th\nInternational Conference on Machine Learning, in PMLR 97:2424-2433.\n\n[3] Kabra, R., Zoran, D., Erdogan, G., Matthey, L., Creswell, A., Botvinick, M.,\nLerchner, A., \u0026 Burgess, C. P. (2021). SIMONe: View-Invariant,\nTemporally-Abstracted Object Representations via Unsupervised Video\nDecomposition. Advances in Neural Information Processing Systems.\n\n[4] Rand, W. M. (1971). Objective criteria for the evaluation of clustering\nmethods. Journal of the American Statistical association, 66(336), 846-850.\n\n[5] Eslami, S., Rezende, D. J., Besse, F., Viola, F., Morcos, A., Garnelo, M.,\nRuderman, A., Rusu, A., Danihelka, I., Gregor, K., Reichert, D., Buesing, L.,\nWeber, T., Vinyals, O., Rosenbaum, D., Rabinowitz, N., King, H., Hillier, C.,\nBotvinick, M., Wierstra, D., Kavukcuoglu, K., \u0026 Hassabis, D. (2018). Neural\nscene representation and rendering. Science, 360(6394), 1204-1210.\n\n[6] Johnson, J., Hariharan, B., van der Maaten, L., Fei-Fei, L., Lawrence\nZitnick, C., \u0026 Girshick, R. (2017). Clevr: A diagnostic dataset for\ncompositional language and elementary visual reasoning. In Proceedings of the\nIEEE Conference on Computer Vision and Pattern Recognition (pp. 2901-2910).\n\n[7] Girdhar, R., \u0026 Ramanan, D. (2019, September). CATER: A diagnostic dataset\nfor Compositional Actions \u0026 TEmporal Reasoning. In International Conference on\nLearning Representations.\n\n\n## Disclaimers\n\nThis is not an official Google product.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgoogle-deepmind%2Fmulti_object_datasets","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fgoogle-deepmind%2Fmulti_object_datasets","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgoogle-deepmind%2Fmulti_object_datasets/lists"}