{"id":20823334,"url":"https://github.com/google/space","last_synced_at":"2026-01-14T08:31:58.052Z","repository":{"id":215399029,"uuid":"732200709","full_name":"google/space","owner":"google","description":"Unified storage framework for the entire machine learning lifecycle","archived":true,"fork":false,"pushed_at":"2024-03-03T02:12:23.000Z","size":845,"stargazers_count":155,"open_issues_count":1,"forks_count":8,"subscribers_count":6,"default_branch":"main","last_synced_at":"2025-12-21T00:41:38.877Z","etag":null,"topics":["apache-arrow","apache-parquet","data-warehouse","dataops","dataset","dml","lakehouse","machine-learning","mlops","multimodal","multimodal-data","olap","ray","tensorflow","tensorflow-dataset"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/google.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-12-15T22:49:08.000Z","updated_at":"2025-10-26T18:08:45.000Z","dependencies_parsed_at":"2024-02-16T02:27:59.003Z","dependency_job_id":"ba563125-e39c-42ec-bed3-4ac9bffe11a4","html_url":"https://github.com/google/space","commit_stats":null,"previous_names":["google/space"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/google/space","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/google%2Fspace","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/google%2Fspace/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/google%2Fspace/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/google%2Fspace/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/google","download_url":"https://codeload.github.com/google/space/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/google%2Fspace/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":28414191,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-01-14T08:31:27.429Z","status":"ssl_error","status_checked_at":"2026-01-14T08:31:19.098Z","response_time":107,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["apache-arrow","apache-parquet","data-warehouse","dataops","dataset","dml","lakehouse","machine-learning","mlops","multimodal","multimodal-data","olap","ray","tensorflow","tensorflow-dataset"],"created_at":"2024-11-17T22:18:08.823Z","updated_at":"2026-01-14T08:31:58.029Z","avatar_url":"https://github.com/google.png","language":"Python","readme":"# Space: Unified Storage for Machine Learning\n\n[![Python CI](https://github.com/google/space/actions/workflows/python-ci.yml/badge.svg?branch=main)](https://github.com/google/space/actions/workflows/python-ci.yml)\n\n\u003chr/\u003e\n\nUnify data in your entire machine learning lifecycle with **Space**, a comprehensive storage solution that seamlessly handles data from ingestion to training.\n\n**Key Features:**\n- **Ground Truth Database**\n  - Store and manage multimodal data in open source file formats, row or columnar, local or in cloud.\n  - Ingest from various sources, including ML datasets, files, and labeling tools.\n  - Support data manipulation (append, insert, update, delete) and version control.\n- **OLAP Database and Lakehouse**\n  - [Iceberg](https://github.com/apache/iceberg) style [open table format](/docs/design.md#metadata-design).\n  - Optimized for unstructued data via [reference](./docs/design.md#data-files) operations.\n  - Quickly analyze data using SQL engines like [DuckDB](https://github.com/duckdb/duckdb).\n- **Distributed Data Processing Pipelines**\n  - Integrate with processing frameworks like [Ray](https://github.com/ray-project/ray) for efficient data transformation.\n  - Store processed results as Materialized Views (MVs); incrementally update MVs when the source is changed.\n- **Seamless Training Framework Integration**\n  - Access Space datasets and MVs directly via random access interfaces.\n  - Convert to popular ML dataset formats (e.g., [TFDS](https://github.com/tensorflow/datasets), [HuggingFace](https://github.com/huggingface/datasets), [Ray](https://github.com/ray-project/ray)).\n\n\u003cimg src=\"docs/pics/overview.png\" width=\"800\" /\u003e\n\n## Onboarding Examples\n\n- [Manage Tensorflow COCO dataset](notebooks/tfds_coco_tutorial.ipynb)\n- [Ground truth database of LabelStudio](notebooks/label_studio_tutorial.ipynb)\n- [Transforms and materialized views: Segment Anything as example](notebooks/segment_anything_tutorial.ipynb)\n- [Incrementally build embedding vector indexes](notebooks/incremental_embedding_index.ipynb)\n- [Parallel ingestion from WebDataset](notebooks/webdataset_ingestion.ipynb)\n- [Convert from/to HuggingFace datasets](notebooks/huggingface_conversion.ipynb)\n\n## Space 101\n\n- Space uses [Arrow](https://arrow.apache.org/docs/python/index.html) in the API surface, e.g., schema, filter, data IO.\n- All file paths in Space are [relative](./docs/design.md#relative-paths); datasets are immediately usable after downloading or moving.\n- Space stores data itself, or a reference of data, in Parquet files. The reference can be the address of a row in ArrayRecord file, or the path of a standalone file (limitted support, see `space.core.schema.types.files`).\n- `space.TfFeatures` is a built-in field type providing serializers for nested dicts of numpy arrays, based on [TFDS FeaturesDict](https://www.tensorflow.org/datasets/api_docs/python/tfds/features/FeaturesDict).\n- Please find more information in the [design](docs/design.md) and [performance](docs/performance.md) docs.\n\n## Quick Start\n\n- [Install](#install)\n- [Cluster Setup and Performance Tuning](#cluster-setup-and-performance-tuning)\n- [Create and Load Datasets](#create-and-load-datasets)\n- [Write and Read](#write-and-read)\n- [Transform and Materialized Views](#transform-and-materialized-views)\n- [ML Frameworks Integration](#ml-frameworks-integration)\n- [Inspect Metadata](#inspect-metadata)\n\n### Install\n\nInstall:\n```bash\npip install space-datasets\n```\n\nOr install from code:\n```bash\ncd python\npip install .[dev]\n```\n\n### Cluster Setup and Performance Tuning\n\nSee the [setup and performance doc](/docs/performance.md#ray-runner-setup).\n\n### Create and Load Datasets\n\nCreate a Space dataset with two index fields (`id`, `image_name`) (store in Parquet) and a record field (`feature`) (store in ArrayRecord).\n\nThis example uses the plain `binary` type for the record field. Space supports a type `space.TfFeatures` that integrates with the [TFDS feature serializer](https://www.tensorflow.org/datasets/api_docs/python/tfds/features/FeaturesDict). See more details in a [TFDS example](/notebooks/tfds_coco_tutorial.ipynb).\n\n```py\nimport pyarrow as pa\nfrom space import Dataset\n\nschema = pa.schema([\n  (\"id\", pa.int64()),\n  (\"image_name\", pa.string()),\n  (\"feature\", pa.binary())])\n\nds = Dataset.create(\n  \"/path/to/\u003cmybucket\u003e/example_ds\",\n  schema,\n  primary_keys=[\"id\"],\n  record_fields=[\"feature\"])  # Store this field in ArrayRecord files\n\n# Load the dataset from files later:\nds = Dataset.load(\"/path/to/\u003cmybucket\u003e/example_ds\")\n```\n\nOptionally, you can use `catalogs` to manage datasets by names instead of locations:\n\n```py\nfrom space import DirCatalog\n\n# DirCatalog manages datasets in a directory.\ncatalog = DirCatalog(\"/path/to/\u003cmybucket\u003e\")\n\n# Same as the creation above.\nds = catalog.create_dataset(\"example_ds\", schema,\n  primary_keys=[\"id\"], record_fields=[\"feature\"])\n\n# Same as the load above.\nds = catalog.dataset(\"example_ds\")\n\n# List all datasets and materialized views.\nprint(catalog.datasets())\n```\n\n### Write and Read\n\nAppend, delete some data. Each mutation generates a new version of data, represented by an increasing integer ID. Users can add tags to version IDs as alias.\n```py\nimport pyarrow.compute as pc\nfrom space import RayOptions\n\n# Create a local runner:\nrunner = ds.local()\n\n# Or create a Ray runner:\nrunner = ds.ray(ray_options=RayOptions(max_parallelism=8))\n\n# To avoid https://github.com/ray-project/ray/issues/41333, wrap the runner \n# with @ray.remote when running in a remote Ray cluster.\n#\n# @ray.remote\n# def run():\n#   return runner.read_all()\n#\n\n# Appending data generates a new dataset version `snapshot_id=1`\n# Write methods:\n# - append(...): no primary key check.\n# - insert(...): fail if primary key exists.\n# - upsert(...): overwrite if primary key exists.\nids = range(100)\nrunner.append({\n  \"id\": ids,\n  \"image_name\": [f\"{i}.jpg\" for i in ids],\n  \"feature\": [f\"somedata{i}\".encode(\"utf-8\") for i in ids]\n})\nds.add_tag(\"after_append\")  # Version management: add tag to snapshot\n\n# Deletion generates a new version `snapshot_id=2`\nrunner.delete(pc.field(\"id\") == 1)\nds.add_tag(\"after_delete\")\n\n# Show all versions\nds.versions().to_pandas()\n# \u003e\u003e\u003e\n#    snapshot_id               create_time tag_or_branch\n# 0            2 2024-01-12 20:23:57+00:00  after_delete\n# 1            1 2024-01-12 20:23:38+00:00  after_append\n# 2            0 2024-01-12 20:22:51+00:00          None\n\n# Read options:\n# - filter_: optional, apply a filter (push down to reader).\n# - fields: optional, field selection.\n# - version: optional, snapshot_id or tag, time travel back to an old version.\n# - batch_size: optional, output size.\nrunner.read_all(\n  filter_=pc.field(\"image_name\")==\"2.jpg\",\n  fields=[\"feature\"],\n  version=\"after_add\"  # or snapshot ID `1`\n)\n\n# Read the changes between version 0 and 2.\nfor change in runner.diff(0, \"after_delete\"):\n  print(change.change_type)\n  print(change.data)\n  print(\"===============\")\n```\n\nCreate a new branch and make changes in the new branch:\n\n```py\n# The default branch is \"main\"\nds.add_branch(\"dev\")\nds.set_current_branch(\"dev\")\n# Make changes in the new branch, the main branch is not updated.\n# Switch back to the main branch.\nds.set_current_branch(\"main\")\n```\n\n### Transform and Materialized Views\n\nSpace supports transforming a dataset to a view, and materializing the view to files. The transforms include:\n\n- Mapping batches using a user defined function (UDF).\n- Filter using a UDF.\n- Joining two views/datasets.\n\nWhen the source dataset is modified, refreshing the materialized view incrementally synchronizes changes, which saves compute and IO cost. See more details in a [Segment Anything example](/notebooks/segment_anything_tutorial.ipynb). Reading or refreshing views must be the `Ray` runner, because they are implemented based on [Ray transform](https://docs.ray.io/en/latest/data/transforming-data.html).\n\nA materialized view `mv` can be used as a view `mv.view` or a dataset `mv.dataset`. The former always reads data from the source dataset's files and processes all data on-the-fly. The latter directly reads processed data from the MV's files, skips processing data.\n\n#### Example of map_batches\n\n```py\n# A sample transform UDF.\n# Input is {\"field_name\": [values, ...], ...}\ndef modify_feature_udf(batch):\n  batch[\"feature\"] = [d + b\"123\" for d in batch[\"feature\"]]\n  return batch\n\n# Create a view and materialize it.\nview = ds.map_batches(\n  fn=modify_feature_udf,\n  output_schema=ds.schema,\n  output_record_fields=[\"feature\"]\n)\n\nview_runner = view.ray()\n# Reading a view will read the source dataset and apply transforms on it.\n# It processes all data using `modify_feature_udf` on the fly.\nfor d in view_runner.read():\n  print(d)\n\nmv = view.materialize(\"/path/to/\u003cmybucket\u003e/example_mv\")\n# Or use a catalog:\n# mv = catalog.materialize(\"example_mv\", view)\n\nmv_runner = mv.ray()\n# Refresh the MV up to version tag `after_add` of the source.\nmv_runner.refresh(\"after_add\", batch_size=64)  # Reading batch size\n# Or, mv_runner.refresh() refresh to the latest version\n\n# Use the MV runner instead of view runner to directly read from materialized\n# view files, no data processing any more.\nmv_runner.read_all()\n```\n\n#### Example of join\n\nSee a full example in the [Segment Anything example](/notebooks/segment_anything_tutorial.ipynb). Creating a materialized view of join result is not supported yet.\n\n```py\n# If input is a materialized view, using `mv.dataset` instead of `mv.view`\n# Only support 1 join key, it must be primary key of both left and right.\njoined_view = mv_left.dataset.join(mv_right.dataset, keys=[\"id\"])\n```\n\n### ML Frameworks Integration\n\nThere are several ways to integrate Space storage with ML frameworks. Space provides a random access data source for reading data in ArrayRecord files:\n\n```py\nfrom space import RandomAccessDataSource\n\ndatasource = RandomAccessDataSource(\n  # \u003cfield-name\u003e: \u003cstorage-location\u003e, for reading data from ArrayRecord files.\n  {\n    \"feature\": \"/path/to/\u003cmybucket\u003e/example_mv\",\n  },\n  # Don't auto deserialize data, because we store them as plain bytes.\n  deserialize=False)\n\nlen(datasource)\ndatasource[2]\n```\n\nA dataset or view can also be read as a Ray dataset:\n```py\nray_ds = ds.ray_dataset()\nray_ds.take(2)\n```\n\nData in Parquet files can be read as a HuggingFace dataset:\n```py\nfrom datasets import load_dataset\n\nhuggingface_ds = load_dataset(\"parquet\", data_files={\"train\": ds.index_files()})\n\n```\n\n### Inspect Metadata\n\nList file path of all index (Parquet) files:\n```python\nds.index_files()\n# Or show more statistics information of Parquet files.\nds.storage.index_manifest()  # Accept filter and snapshot_id\n```\n\nShow statistics information of all ArrayRecord files:\n```python\nds.storage.record_manifest()  # Accept filter and snapshot_id\n```\n\n## Status\nSpace is a new project under active development.\n\n:construction: Ongoing tasks:\n- Performance benchmark and improvement.\n\n## Disclaimer\nThis is not an officially supported Google product.\n","funding_links":[],"categories":["Table of Contents"],"sub_categories":["Machine Learning"],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgoogle%2Fspace","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fgoogle%2Fspace","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgoogle%2Fspace/lists"}