Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/google/xarray-beam
Distributed Xarray with Apache Beam
https://github.com/google/xarray-beam
beam dask xarray zarr
Last synced: about 2 months ago
JSON representation
Distributed Xarray with Apache Beam
- Host: GitHub
- URL: https://github.com/google/xarray-beam
- Owner: google
- License: apache-2.0
- Created: 2021-05-11T20:48:07.000Z (over 3 years ago)
- Default Branch: main
- Last Pushed: 2024-10-23T18:12:08.000Z (3 months ago)
- Last Synced: 2024-10-24T01:35:46.912Z (3 months ago)
- Topics: beam, dask, xarray, zarr
- Language: Python
- Homepage: https://xarray-beam.readthedocs.io
- Size: 271 KB
- Stars: 135
- Watchers: 7
- Forks: 7
- Open Issues: 20
-
Metadata Files:
- Readme: README.md
- Contributing: CONTRIBUTING.md
- License: LICENSE
Awesome Lists containing this project
- Awesome-Earth-Artificial-Intelligence - Xarray-Beam - Python library for building Apache Beam pipelines with Xarray datasets. (Tools)
README
# Xarray-Beam
Xarray-Beam is a Python library for building
[Apache Beam](https://beam.apache.org/) pipelines with
[Xarray](http://xarray.pydata.org/en/stable/) datasets.The project aims to facilitate data transformations and analysis on large-scale
multi-dimensional labeled arrays, such as:- Ad-hoc computation on Xarray data, by dividing a `xarray.Dataset` into many
smaller pieces ("chunks").
- Adjusting array chunks, using the
[Rechunker algorithm](https://rechunker.readthedocs.io/en/latest/algorithm.html).
- Ingesting large, multi-dimensional array datasets into an analysis-ready,
cloud-optimized format, namely [Zarr](https://zarr.readthedocs.io/) (see
also [Pangeo Forge](https://github.com/pangeo-forge/pangeo-forge-recipes)).
- Calculating statistics (e.g., "climatology") across distributed datasets
with arbitrary groups.For more about our approach and how to get started,
**[read the documentation](https://xarray-beam.readthedocs.io/)**!**Warning: Xarray-Beam is a sharp tool 🔪**
Xarray-Beam is relatively new, and focused on expert users:
- We use it extensively at Google for processing large-scale weather datasets,
but there is not yet a vibrant external community.
- It provides low-level abstractions that facilitate writing very large
scale data pipelines (e.g., 100+ TB), but by design it requires explicitly
thinking about how every operation is parallelized.## Installation
Xarray-Beam requires recent versions of immutabledict, Xarray, Dask, Rechunker,
Zarr, and Apache Beam. For best performance when writing Zarr files, use Xarray
0.19.0 or later.## Disclaimer
Xarray-Beam is an experiment that we are sharing with the outside world in the
hope that it will be useful. It is not a supported Google product. We welcome
feedback, bug reports and code contributions, but cannot guarantee they will be
addressed.See the "Contribution guidelines" for more.
## Credits
Contributors:
- Stephan Hoyer
- Jason Hickey
- Cenk Gazen
- Alex Merose