https://github.com/deepseek-ai/smallpond

A lightweight data processing framework built on DuckDB and 3FS.
https://github.com/deepseek-ai/smallpond

data-processing duckdb

Last synced: 4 months ago
JSON representation

A lightweight data processing framework built on DuckDB and 3FS.

Host: GitHub
URL: https://github.com/deepseek-ai/smallpond
Owner: deepseek-ai
License: mit
Created: 2025-02-24T09:28:17.000Z (9 months ago)
Default Branch: main
Last Pushed: 2025-03-05T18:23:54.000Z (9 months ago)
Last Synced: 2025-06-04T01:56:30.837Z (6 months ago)
Topics: data-processing, duckdb
Language: Python
Homepage:
Size: 1.77 MB
Stars: 4,668
Watchers: 48
Forks: 415
Open Issues: 28
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

awesome-duckdb - smallpond - A distributed data processing framework by DeepSeek built on DuckDB and 3FS. (Libraries Powered by DuckDB)

README

          # smallpond

[![CI](https://github.com/deepseek-ai/smallpond/actions/workflows/ci.yml/badge.svg?branch=main)](https://github.com/deepseek-ai/smallpond/actions/workflows/ci.yml)

[![PyPI](https://img.shields.io/pypi/v/smallpond)](https://pypi.org/project/smallpond/)

[![Docs](https://img.shields.io/badge/docs-latest-brightgreen.svg)](https://deepseek-ai.github.io/smallpond/)

[![License](https://img.shields.io/badge/license-MIT-blue.svg)](LICENSE)

A lightweight data processing framework built on [DuckDB] and [3FS].

## Features

- 🚀 High-performance data processing powered by DuckDB

- 🌍 Scalable to handle PB-scale datasets

- 🛠️ Easy operations with no long-running services

## Installation

Python 3.8 to 3.12 is supported.

```bash

pip install smallpond

```

## Quick Start

```bash

# Download example data

wget https://duckdb.org/data/prices.parquet

```

```python

import smallpond

# Initialize session

sp = smallpond.init()

# Load data

df = sp.read_parquet("prices.parquet")

# Process data

df = df.repartition(3, hash_by="ticker")

df = sp.partial_sql("SELECT ticker, min(price), max(price) FROM {0} GROUP BY ticker", df)

# Save results

df.write_parquet("output/")

# Show results

print(df.to_pandas())

```

## Documentation

For detailed guides and API reference:

- [Getting Started](docs/source/getstarted.rst)

- [API Reference](docs/source/api.rst)

## Performance

We evaluated smallpond using the [GraySort benchmark] ([script]) on a cluster comprising 50 compute nodes and 25 storage nodes running [3FS].  The benchmark sorted 110.5TiB of data in 30 minutes and 14 seconds, achieving an average throughput of 3.66TiB/min.

Details can be found in [3FS - Gray Sort].

[DuckDB]: https://duckdb.org/

[3FS]: https://github.com/deepseek-ai/3FS

[GraySort benchmark]: https://sortbenchmark.org/

[script]: benchmarks/gray_sort_benchmark.py

[3FS - Gray Sort]: https://github.com/deepseek-ai/3FS?tab=readme-ov-file#2-graysort

## Development

```bash

pip install .[dev]

# run unit tests

pytest -v tests/test*.py

# build documentation

pip install .[docs]

cd docs

make html

python -m http.server --directory build/html

```

## License

This project is licensed under the [MIT License](LICENSE).

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/deepseek-ai/smallpond

Awesome Lists containing this project

README