https://github.com/rapidsai/legate-dataframe
Distributed cuDF on Legate
https://github.com/rapidsai/legate-dataframe
Last synced: 7 months ago
JSON representation
Distributed cuDF on Legate
- Host: GitHub
- URL: https://github.com/rapidsai/legate-dataframe
- Owner: rapidsai
- License: apache-2.0
- Created: 2024-11-22T15:02:25.000Z (about 1 year ago)
- Default Branch: main
- Last Pushed: 2025-06-24T14:29:59.000Z (7 months ago)
- Last Synced: 2025-06-24T15:38:50.877Z (7 months ago)
- Language: C++
- Homepage: https://rapidsai.github.io/legate-dataframe/
- Size: 364 KB
- Stars: 3
- Watchers: 5
- Forks: 8
- Open Issues: 15
-
Metadata Files:
- Readme: README.md
- Contributing: CONTRIBUTING.md
- License: LICENSE
Awesome Lists containing this project
README
# Legate-dataframe: a scalable dataframe library
A prototype of a legate-enabled version of [libcudf](https://docs.rapids.ai/api/libcudf/stable/).
This is **not** a drop-in replacement of [Pandas](https://pandas.pydata.org/), instead it follows the more low-level API of libcudf.
In the future, we plan to introduce a high-level pure Python package that implements all the nice-to-have features known from Pandas using the low-level API's primitives.
[Python API and further documentation](https://rapidsai.github.io/legate-dataframe/).
## Install
You can install `legate-dataframe` packages from the [conda legate channel](https://anaconda.org/legate/)
using
```bash
conda install -c legate -c rapidsai -c conda-forge legate-dataframe
```
To include development releases add the `legate/label/experimental` channel.
## Build
Legate-dataframe uses the Legate C++ API from Legate-core and cuPyNumeric.
cuPyNumeric is only used in Python tests and examples so it isn't strictly necessary.
The current tested versions are legate and cuPyNumeric 24.11 release available from
the [conda legate channel](https://anaconda.org/legate/).
### Legate-dataframe
First we clone `legate-dataframe` and install the dependencies:
```
git clone https://github.com/rapidsai/legate-dataframe.git
cd legate-dataframe
mamba env update --name legate-dev --file conda/environments/all_cuda-124_arch-x86_64.yaml
```
Then we can build, install, and test the project:
```
./build.sh
./build.sh test
```
## Feature Status
| Feature | Status | Limitations
|--------------------------------------|:----------------------:|----------------------------------|
| Copy to/from cuDF DataFrame | :white_check_mark: | |
| Parquet read & write | :white_check_mark: | |
| CSV read & write | :white_check_mark: | |
| Zero-copy to/from cuPyNumeric arrays | :white_check_mark: | |
| Hash based inner join | :white_check_mark: | |
| Hash based left join | :white_check_mark: | |
| Hash based full/outer join | :white_check_mark: | |
| GroupBy Aggregation | :white_check_mark: | Basic aggs. like SUM and NUNIQUE |
| Numeric data types | :white_check_mark: | |
| Datetime data types | :white_check_mark: | |
| String data types | :white_check_mark: | |
| Null masked columns | :white_check_mark: | |
## Example
### Python
```python
import tempfile
import cudf
import cupynumeric
from legate.core import get_legate_runtime
from legate_dataframe import LogicalColumn, LogicalTable
from legate_dataframe.lib.parquet import parquet_read, parquet_write
def main(tmpdir):
# Let's start by creating a logical table from a cuDF dataframe
# This takes a local dataframe and distribute it between Legate nodes
df = cudf.DataFrame({"a": [1, 2, 3, 4], "b": [-1, -2, -3, -4]})
tbl1 = LogicalTable.from_cudf(df)
# We can write the logical table to disk using the Parquet file format.
# The table is written into multiple files, one file per partition:
# /tmpdir/
# ├── part-0.parquet
# ├── part-1.parquet
# ├── part-2.parquet
# └── ...
parquet_write(tbl1, path=tmpdir)
# NB: since Legate execute tasks lazily, we issue a blocking fence
# in order to wait until all files has been written to disk.
get_legate_runtime().issue_execution_fence(block=True)
# Then we can read the parquet files back into a logical table. We
# provide a Glob string that reference all the parquet files that
# should go into the logical table.
tbl2 = parquet_read(glob_string=f"{tmpdir}/*.parquet")
# LogicalColumn implements the `__legate_data_interface__` interface,
# which makes it possible for other Legate libraries, such as cuPyNumeric,
# to operate on columns seamlessly.
ary = cupynumeric.add(tbl1["a"], tbl2["b"])
assert ary.sum() == 0
ary[:] = [4, 3, 2, 1]
# We can create a new logical column from any 1-D array like object that
# exposes the `__legate_data_interface__` interface.
col = LogicalColumn(ary)
# We can create a new logical table from existing logical columns.
LogicalTable(columns=(col, tbl2["b"]), column_names=["a", "b"])
if __name__ == "__main__":
with tempfile.TemporaryDirectory() as tmpdir:
main(tmpdir)
# Since Legate execute tasks lazily, we issue a blocking fence here
# to make sure all task has finished before `tmpdir` is removed.
get_legate_runtime().issue_execution_fence(block=True)
```
### C++
```c++
#include
#include
#include
#include
#include
#include
int main(int argc, char** argv)
{
// First we initialize Legate use either `legate` or `LEGATE_CONFIG` to customize launch
legate::start();
// Then let's create a new logical column
legate::dataframe::LogicalColumn col_a = legate::dataframe::sequence(20, -10);
// Compute the absolute value of each row in `col_a`
legate::dataframe::LogicalColumn col_b = unary_operation(col_a, cudf::unary_operator::ABS);
// Create a new logical table that contains the two existing columns (zero-copy)
legate::dataframe::LogicalTable tbl_a{{col_a, col_a}};
// We can write the logical table to disk using the Parquet file format.
// The table is written into multiple files, one file per partition:
// /tmpdir/
// ├── part-0.parquet
// ├── part-1.parquet
// ├── part-2.parquet
// └── ...
legate::dataframe::parquet_write(tbl_a, "./my_parquet_file");
// NB: since Legate execute tasks lazily, we issue a blocking fence
// in order to wait until all files has been written to disk.
legate::Runtime::get_runtime()->issue_execution_fence(true);
// Then we can read the parquet files back into a logical table. We
// provide a Glob string that reference all the parquet files that
// should go into the logical table.
auto tbl_b = legate::dataframe::parquet_read("./my_parquet_file/*.parquet");
// Clean up
std::filesystem::remove_all("./my_parquet_file");
return 0;
}
```
## Contributing
Please see our [our guide](CONTRIBUTING.md) and the [developer guide](DEVELOPER_GUIDE.md).