Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/sbl-sdsc/df-parallel
Comparison of Dataframe libraries for parallel processing of large tabular files on CPU and GPU.
https://github.com/sbl-sdsc/df-parallel
cuda-toolkit dask dask-cudf dask-dataframes dataframes gpu-computing parallel-processing pyspark-dataframes rapidsai
Last synced: 13 days ago
JSON representation
Comparison of Dataframe libraries for parallel processing of large tabular files on CPU and GPU.
- Host: GitHub
- URL: https://github.com/sbl-sdsc/df-parallel
- Owner: sbl-sdsc
- License: bsd-3-clause
- Created: 2022-05-21T02:07:44.000Z (over 2 years ago)
- Default Branch: main
- Last Pushed: 2024-06-27T01:26:43.000Z (6 months ago)
- Last Synced: 2024-06-27T04:17:34.765Z (6 months ago)
- Topics: cuda-toolkit, dask, dask-cudf, dask-dataframes, dataframes, gpu-computing, parallel-processing, pyspark-dataframes, rapidsai
- Language: Jupyter Notebook
- Homepage:
- Size: 3.33 MB
- Stars: 5
- Watchers: 1
- Forks: 3
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE.md
Awesome Lists containing this project
README
# df-parallel
This repo demonstrates how to setup CONDA environments for popular Dataframe libraries and process large tabular data files.
It compares parallel and out-of-core (data that are too large to fit into the computer's memory) reading and processing of large datasets on CPU and GPU.
| Dataframe Library | Parallel | Out-of-core | CPU/GPU | Evaluation |
| ------------------| -------- | ----------- | ------- | ---------- |
| Pandas | no | no [1] | CPU | eager |
| Dask | yes | yes | CPU | lazy |
| Spark | yes | yes | CPU | lazy |
| cuDF | yes | no | GPU | eager |
| Dask-cuDF | yes | yes | GPU | lazy |[1] Pandas can read data in chunks, but they have to be processed independently.
## Running Jupyter Lab locally (CPU only)
------
Prerequisites: Miniconda3 (light-weight, preferred) or Anaconda3 and Mamba* Install [Miniconda3](https://docs.conda.io/en/latest/miniconda.html)
* Install Mamba: ```conda install mamba -n base -c conda-forge```
------1. Clone this git repository
```
git clone https://github.com/sbl-sdsc/df-parallel.git
```
2. Create CONDA environment```
mamba env create -f df-parallel/environment.yml
```
3. Activate the CONDA environment```
conda activate df-parallel
```
4. Launch Jupyter Lab```
jupyter lab
```5. Deactivate the CONDA environment
```
conda deactivate
```------
> To remove the CONDA environment, run ```conda env remove -n df-parallel```
------## Running Jupyter Lab on SDSC Expanse
To launch Jupyter Lab on [Expanse](https://www.sdsc.edu/services/hpc/expanse/), use the [galyleo](https://github.com/mkandes/galyleo#galyleo) script. Specify your ACCESS account number with the --account option. If you do not have an ACCESS acount and allocation on Expanse, you can apply through NSF’s [ACCESS program](https://allocations.access-ci.org/get-your-first-project) or for a trial allocation, contact .1. Clone this git repository
```
git clone https://github.com/sbl-sdsc/df-parallel.git
```2a. Run on CPU (Pandas, Dask, and Spark dataframes):
```
galyleo launch --account --partition shared --cpus 10 --memory 20 --time-limit 00:30:00 --conda-env df-parallel --conda-yml "${HOME}/df-parallel/environment.yml" --mamba
```2b. Run on GPU (required for cuDF and Dask-cuDF dataframes):
```
galyleo launch --account --partition gpu-shared --cpus 10 --memory 92 --gpus 1 --time-limit 00:30:00 --conda-env df-parallel-gpu --conda-yml "${HOME}/df-parallel/environment-gpu.yml" --mamba
```## Running the example notebooks
After Jupyter Lab has been launched, run the Notebook [DownloadData.ipynb](DownloadData.ipynb) to create a dataset. In this notebook, specify the number of copies (`ncopies`) to be made from the orignal dataset to increase its size. By default, a single copy is created. After the dataset has been created, run the dataframe specific notebooks. Note, the cuDF and Dask-cuDF dataframe libraries require a GPU.## Test results (not representative)
Results for running on SDSC [Expanse GPU node](https://www.sdsc.edu/support/user_guides/expanse.html) with 10 CPU cores (Intel Xeon Gold 6248 2.5 GHz), 1 GPU (NVIDIA V100 SMX2, 32GB), and 92 GB of memory (DDR4 DRAM), local storage (1.6 TB Samsung PM1745b NVMe PCIe SSD).Datafile size (gene_info.tsv as of June 2022):
* Dataset 1: 5.4 GB (18 GB in Pandas)
* Dataset 2: 21.4 GB (4 x Dataset 1) (62.4 GB in Pandas)
* Dataset 3: 43.7 GB (8 x Dataset 1)| Dataframe Library | time(5.4 GB) (s) | time(21.4 GB) (s) | time(43.7 GB) (s) | Parallel | Out-of-core | CPU/GPU |
| -----------------| ----------------: | ----------------: | ----------------: |--------- | ----------- | ------- |
| Pandas | 56.3 |222.4 | -- [2] | no | no | CPU |
| Dask | 15.7 | 42.1 | 121.8 | yes | yes | CPU |
| Spark | 14.2 | 31.2 | 56.5 | yes | yes | CPU |
| cuDF | 3.2 | -- [2] | -- [2] | yes | no | GPU |
| Dask-cuDF | 7.3 | 11.9 | 19.0 | yes | yes | GPU |[2] out of memory