https://github.com/deepseek-ai/smallpond
A lightweight data processing framework built on DuckDB and 3FS.
https://github.com/deepseek-ai/smallpond
data-processing duckdb
Last synced: 4 months ago
JSON representation
A lightweight data processing framework built on DuckDB and 3FS.
- Host: GitHub
- URL: https://github.com/deepseek-ai/smallpond
- Owner: deepseek-ai
- License: mit
- Created: 2025-02-24T09:28:17.000Z (9 months ago)
- Default Branch: main
- Last Pushed: 2025-03-05T18:23:54.000Z (9 months ago)
- Last Synced: 2025-06-04T01:56:30.837Z (6 months ago)
- Topics: data-processing, duckdb
- Language: Python
- Homepage:
- Size: 1.77 MB
- Stars: 4,668
- Watchers: 48
- Forks: 415
- Open Issues: 28
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
- awesome-duckdb - smallpond - A distributed data processing framework by DeepSeek built on DuckDB and 3FS. (Libraries Powered by DuckDB)
README
# smallpond
[](https://github.com/deepseek-ai/smallpond/actions/workflows/ci.yml)
[](https://pypi.org/project/smallpond/)
[](https://deepseek-ai.github.io/smallpond/)
[](LICENSE)
A lightweight data processing framework built on [DuckDB] and [3FS].
## Features
- 🚀 High-performance data processing powered by DuckDB
- 🌍 Scalable to handle PB-scale datasets
- 🛠️ Easy operations with no long-running services
## Installation
Python 3.8 to 3.12 is supported.
```bash
pip install smallpond
```
## Quick Start
```bash
# Download example data
wget https://duckdb.org/data/prices.parquet
```
```python
import smallpond
# Initialize session
sp = smallpond.init()
# Load data
df = sp.read_parquet("prices.parquet")
# Process data
df = df.repartition(3, hash_by="ticker")
df = sp.partial_sql("SELECT ticker, min(price), max(price) FROM {0} GROUP BY ticker", df)
# Save results
df.write_parquet("output/")
# Show results
print(df.to_pandas())
```
## Documentation
For detailed guides and API reference:
- [Getting Started](docs/source/getstarted.rst)
- [API Reference](docs/source/api.rst)
## Performance
We evaluated smallpond using the [GraySort benchmark] ([script]) on a cluster comprising 50 compute nodes and 25 storage nodes running [3FS]. The benchmark sorted 110.5TiB of data in 30 minutes and 14 seconds, achieving an average throughput of 3.66TiB/min.
Details can be found in [3FS - Gray Sort].
[DuckDB]: https://duckdb.org/
[3FS]: https://github.com/deepseek-ai/3FS
[GraySort benchmark]: https://sortbenchmark.org/
[script]: benchmarks/gray_sort_benchmark.py
[3FS - Gray Sort]: https://github.com/deepseek-ai/3FS?tab=readme-ov-file#2-graysort
## Development
```bash
pip install .[dev]
# run unit tests
pytest -v tests/test*.py
# build documentation
pip install .[docs]
cd docs
make html
python -m http.server --directory build/html
```
## License
This project is licensed under the [MIT License](LICENSE).