{"id":22645325,"url":"https://github.com/sbl-sdsc/df-parallel","last_synced_at":"2025-04-12T00:33:50.623Z","repository":{"id":38136646,"uuid":"494656259","full_name":"sbl-sdsc/df-parallel","owner":"sbl-sdsc","description":"Comparison of Dataframe libraries for parallel processing of large tabular files on CPU and GPU.","archived":false,"fork":false,"pushed_at":"2024-06-27T01:26:43.000Z","size":3495,"stargazers_count":6,"open_issues_count":0,"forks_count":3,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-03-25T20:15:45.223Z","etag":null,"topics":["cuda-toolkit","dask","dask-cudf","dask-dataframes","dataframes","gpu-computing","parallel-processing","pyspark-dataframes","rapidsai"],"latest_commit_sha":null,"homepage":"","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"bsd-3-clause","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/sbl-sdsc.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE.md","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2022-05-21T02:07:44.000Z","updated_at":"2024-07-30T22:34:04.000Z","dependencies_parsed_at":"2024-06-19T07:04:29.405Z","dependency_job_id":"43497e32-94da-400d-b57d-ea84d4a0a102","html_url":"https://github.com/sbl-sdsc/df-parallel","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sbl-sdsc%2Fdf-parallel","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sbl-sdsc%2Fdf-parallel/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sbl-sdsc%2Fdf-parallel/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sbl-sdsc%2Fdf-parallel/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/sbl-sdsc","download_url":"https://codeload.github.com/sbl-sdsc/df-parallel/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248501696,"owners_count":21114676,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["cuda-toolkit","dask","dask-cudf","dask-dataframes","dataframes","gpu-computing","parallel-processing","pyspark-dataframes","rapidsai"],"created_at":"2024-12-09T06:05:02.377Z","updated_at":"2025-04-12T00:33:50.597Z","avatar_url":"https://github.com/sbl-sdsc.png","language":"Jupyter Notebook","readme":"# df-parallel\n\nThis repo demonstrates how to setup CONDA environments for popular Dataframe libraries and process large tabular data files.\n\nIt compares parallel and out-of-core (data that are too large to fit into the computer's memory) reading and processing of large datasets on CPU and GPU.\n\n| Dataframe Library | Parallel | Out-of-core | CPU/GPU | Evaluation |\n| ------------------| -------- | ----------- | ------- | ---------- |\n| Pandas      | no      | no [1]  | CPU | eager |\n| Dask        | yes     | yes | CPU | lazy |\n| Spark       | yes     | yes | CPU | lazy |\n| cuDF        | yes     | no  | GPU | eager |\n| Dask-cuDF   | yes     | yes | GPU | lazy |\n\n[1] Pandas can read data in chunks, but they have to be processed independently.\n\n## Running Jupyter Lab locally (CPU only)\n------\nPrerequisites: Miniconda3 (light-weight, preferred) or Anaconda3 and Mamba\n\n* Install [Miniconda3](https://docs.conda.io/en/latest/miniconda.html)\n* Install Mamba: ```conda install mamba -n base -c conda-forge```\n------\n\n1. Clone this git repository\n\n```\ngit clone https://github.com/sbl-sdsc/df-parallel.git\n```\n2. Create CONDA environment\n\n```\nmamba env create -f df-parallel/environment.yml\n```\n3. Activate the CONDA environment\n\n```\nconda activate df-parallel\n```\n4. Launch Jupyter Lab\n\n```\njupyter lab\n```\n\n5. Deactivate the CONDA environment\n\n```\nconda deactivate\n```\n\n------\n\u003e To remove the CONDA environment, run ```conda env remove -n df-parallel```\n------\n\n\n## Running Jupyter Lab on SDSC Expanse\nTo launch Jupyter Lab on [Expanse](https://www.sdsc.edu/services/hpc/expanse/), use the [galyleo](https://github.com/mkandes/galyleo#galyleo) script. Specify your ACCESS account number with the --account option. If you do not have an ACCESS acount and allocation on Expanse, you can apply through NSF’s [ACCESS program](https://allocations.access-ci.org/get-your-first-project) or for a trial allocation, contact \u003cconsult@sdsc.edu\u003e.\n\n1. Clone this git repository\n\n```\ngit clone https://github.com/sbl-sdsc/df-parallel.git\n```\n\n\n2a. Run on CPU (Pandas, Dask, and Spark dataframes):\n```\ngalyleo launch --account \u003caccount_number\u003e --partition shared --cpus 10 --memory 20 --time-limit 00:30:00 --conda-env df-parallel --conda-yml \"${HOME}/df-parallel/environment.yml\" --mamba\n```\n\n2b. Run on GPU (required for cuDF and Dask-cuDF dataframes):\n```\ngalyleo launch --account \u003caccount_number\u003e --partition gpu-shared --cpus 10 --memory 92 --gpus 1 --time-limit 00:30:00 --conda-env df-parallel-gpu --conda-yml \"${HOME}/df-parallel/environment-gpu.yml\" --mamba\n```\n\n## Running the example notebooks\nAfter Jupyter Lab has been launched, run the Notebook [DownloadData.ipynb](DownloadData.ipynb) to create a dataset. In this notebook, specify the number of copies (`ncopies`) to be made from the orignal dataset to increase its size. By default, a single copy is created. After the dataset has been created, run the dataframe specific notebooks. Note, the cuDF and Dask-cuDF dataframe libraries require a GPU.\n\n## Test results (not representative)\nResults for running on SDSC [Expanse GPU node](https://www.sdsc.edu/support/user_guides/expanse.html) with 10 CPU cores (Intel Xeon Gold 6248 2.5 GHz), 1 GPU (NVIDIA V100 SMX2, 32GB), and 92 GB of memory (DDR4 DRAM), local storage (1.6 TB Samsung PM1745b NVMe PCIe SSD).\n\nDatafile size (gene_info.tsv as of June 2022): \n\n* Dataset 1: 5.4 GB (18 GB in Pandas)\n* Dataset 2: 21.4 GB (4 x Dataset 1) (62.4 GB in Pandas)\n* Dataset 3: 43.7 GB (8 x Dataset 1)\n\n| Dataframe Library | time(5.4 GB) (s) | time(21.4 GB) (s) | time(43.7 GB) (s) | Parallel | Out-of-core | CPU/GPU |\n| -----------------| ----------------: | ----------------: | ----------------: |--------- | ----------- | ------- |\n| Pandas            | 56.3 |222.4   | -- [2] | no       | no  | CPU |\n| Dask              | 15.7 | 42.1   | 121.8  | yes      | yes | CPU |\n| Spark             | 14.2 | 31.2   |  56.5  | yes      | yes | CPU |\n| cuDF              |  3.2 | -- [2] | -- [2] | yes      | no  | GPU |\n| Dask-cuDF         |  7.3 | 11.9   | 19.0   | yes      | yes | GPU |\n\n[2] out of memory\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsbl-sdsc%2Fdf-parallel","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fsbl-sdsc%2Fdf-parallel","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsbl-sdsc%2Fdf-parallel/lists"}