{"id":13459569,"url":"https://github.com/mzjp2/kedro-dataframe-dropin","last_synced_at":"2026-03-09T10:33:38.163Z","repository":{"id":54718685,"uuid":"329766326","full_name":"mzjp2/kedro-dataframe-dropin","owner":"mzjp2","description":"A Kedro plugin that provides pandas dropin replacements for the pandas datasets (e.g modin and cuDF)","archived":false,"fork":false,"pushed_at":"2021-02-02T17:41:05.000Z","size":528,"stargazers_count":12,"open_issues_count":0,"forks_count":1,"subscribers_count":1,"default_branch":"master","last_synced_at":"2026-02-16T23:58:35.722Z","etag":null,"topics":["data","gpu-acceleration","kedro-catalog","kedro-plugin","modin","rapidsai"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/mzjp2.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE.md","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2021-01-15T00:17:39.000Z","updated_at":"2025-02-09T03:03:32.000Z","dependencies_parsed_at":"2022-08-14T00:40:54.930Z","dependency_job_id":null,"html_url":"https://github.com/mzjp2/kedro-dataframe-dropin","commit_stats":null,"previous_names":[],"tags_count":2,"template":false,"template_full_name":null,"purl":"pkg:github/mzjp2/kedro-dataframe-dropin","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mzjp2%2Fkedro-dataframe-dropin","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mzjp2%2Fkedro-dataframe-dropin/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mzjp2%2Fkedro-dataframe-dropin/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mzjp2%2Fkedro-dataframe-dropin/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/mzjp2","download_url":"https://codeload.github.com/mzjp2/kedro-dataframe-dropin/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mzjp2%2Fkedro-dataframe-dropin/sbom","scorecard":{"id":671776,"data":{"date":"2025-08-11","repo":{"name":"github.com/mzjp2/kedro-dataframe-dropin","commit":"f8a329c0cbd5c58e73e7d6fbd6be6104cc856053"},"scorecard":{"version":"v5.2.1-40-gf6ed084d","commit":"f6ed084d17c9236477efd66e5b258b9d4cc7b389"},"score":2.8,"checks":[{"name":"Dangerous-Workflow","score":10,"reason":"no dangerous workflow patterns detected","details":null,"documentation":{"short":"Determines if the project's GitHub Action workflows avoid dangerous patterns.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#dangerous-workflow"}},{"name":"Maintained","score":0,"reason":"0 commit(s) and 0 issue activity found in the last 90 days -- score normalized to 0","details":null,"documentation":{"short":"Determines if the project is \"actively maintained\".","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#maintained"}},{"name":"Packaging","score":-1,"reason":"packaging workflow not detected","details":["Warn: no GitHub/GitLab publishing workflow detected."],"documentation":{"short":"Determines if the project is published as a package that others can easily download, install, easily update, and uninstall.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#packaging"}},{"name":"Token-Permissions","score":0,"reason":"detected GitHub workflow tokens with excessive permissions","details":["Warn: no topLevel permission defined: .github/workflows/main.yml:1","Info: no jobLevel write permissions found"],"documentation":{"short":"Determines if the project's workflows follow the principle of least privilege.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#token-permissions"}},{"name":"Code-Review","score":0,"reason":"Found 0/11 approved changesets -- score normalized to 0","details":null,"documentation":{"short":"Determines if the project requires human code review before pull requests (aka merge requests) are merged.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#code-review"}},{"name":"Binary-Artifacts","score":10,"reason":"no binaries found in the repo","details":null,"documentation":{"short":"Determines if the project has generated executable (binary) artifacts in the source repository.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#binary-artifacts"}},{"name":"Pinned-Dependencies","score":0,"reason":"dependency not pinned by hash detected -- score normalized to 0","details":["Warn: GitHub-owned GitHubAction not pinned by hash: .github/workflows/main.yml:13: update your workflow using https://app.stepsecurity.io/secureworkflow/mzjp2/kedro-dataframe-dropin/main.yml/master?enable=pin","Warn: GitHub-owned GitHubAction not pinned by hash: .github/workflows/main.yml:15: update your workflow using https://app.stepsecurity.io/secureworkflow/mzjp2/kedro-dataframe-dropin/main.yml/master?enable=pin","Warn: pipCommand not pinned by hash: .github/workflows/main.yml:20","Info:   0 out of   2 GitHub-owned GitHubAction dependencies pinned","Info:   0 out of   1 pipCommand dependencies pinned"],"documentation":{"short":"Determines if the project has declared and pinned the dependencies of its build process.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#pinned-dependencies"}},{"name":"CII-Best-Practices","score":0,"reason":"no effort to earn an OpenSSF best practices badge detected","details":null,"documentation":{"short":"Determines if the project has an OpenSSF (formerly CII) Best Practices Badge.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#cii-best-practices"}},{"name":"Security-Policy","score":0,"reason":"security policy file not detected","details":["Warn: no security policy file detected","Warn: no security file to analyze","Warn: no security file to analyze","Warn: no security file to analyze"],"documentation":{"short":"Determines if the project has published a security policy.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#security-policy"}},{"name":"Fuzzing","score":0,"reason":"project is not fuzzed","details":["Warn: no fuzzer integrations found"],"documentation":{"short":"Determines if the project uses fuzzing.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#fuzzing"}},{"name":"License","score":10,"reason":"license file detected","details":["Info: project has a license file: LICENSE.md:0","Info: FSF or OSI recognized license: MIT License: LICENSE.md:0"],"documentation":{"short":"Determines if the project has defined a license.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#license"}},{"name":"Signed-Releases","score":-1,"reason":"no releases found","details":null,"documentation":{"short":"Determines if the project cryptographically signs release artifacts.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#signed-releases"}},{"name":"Branch-Protection","score":-1,"reason":"internal error: error during branchesHandler.setup: internal error: githubv4.Query: Resource not accessible by integration","details":null,"documentation":{"short":"Determines if the default and release branches are protected with GitHub's branch protection settings.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#branch-protection"}},{"name":"SAST","score":0,"reason":"SAST tool is not run on all commits -- score normalized to 0","details":["Warn: 0 commits out of 9 are checked with a SAST tool"],"documentation":{"short":"Determines if the project uses static code analysis.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#sast"}},{"name":"Vulnerabilities","score":0,"reason":"39 existing vulnerabilities detected","details":["Warn: Project is vulnerable to: PYSEC-2024-48 / GHSA-fj7x-q9j7-g6q6","Warn: Project is vulnerable to: PYSEC-2022-42986 / GHSA-43fp-rhv2-5gv8","Warn: Project is vulnerable to: PYSEC-2023-135 / GHSA-xqr8-7jwr-rhp7","Warn: Project is vulnerable to: PYSEC-2022-204 / GHSA-f4q6-9qm4-h8j4","Warn: Project is vulnerable to: PYSEC-2024-4 / GHSA-2mqj-m65w-jghx","Warn: Project is vulnerable to: PYSEC-2023-165 / GHSA-cwvm-v4w8-q58c","Warn: Project is vulnerable to: PYSEC-2022-42992 / GHSA-hcpj-qp55-gfph","Warn: Project is vulnerable to: PYSEC-2023-137 / GHSA-pr76-5cm5-w9cj","Warn: Project is vulnerable to: PYSEC-2023-161 / GHSA-wfm5-v35h-vwf4","Warn: Project is vulnerable to: PYSEC-2024-60 / GHSA-jjg7-2v4v-x38h","Warn: Project is vulnerable to: GHSA-cpwx-vrp4-4pq7","Warn: Project is vulnerable to: GHSA-h5c8-rqwp-cp95","Warn: Project is vulnerable to: GHSA-h75v-3vvj-5mfj","Warn: Project is vulnerable to: GHSA-q2x7-8rv6-6q7h","Warn: Project is vulnerable to: GHSA-33p9-3p43-82vq","Warn: Project is vulnerable to: PYSEC-2022-42974 / GHSA-m678-f26j-3hrp","Warn: Project is vulnerable to: GHSA-747f-ww56-4q4h","Warn: Project is vulnerable to: GHSA-rm69-wvpv-r2w7","Warn: Project is vulnerable to: GHSA-6p56-wp2h-9hxr","Warn: Project is vulnerable to: GHSA-fpfv-jqm9-f5jm","Warn: Project is vulnerable to: PYSEC-2022-42969","Warn: Project is vulnerable to: PYSEC-2021-112 / GHSA-hwfp-hg2m-9vr2","Warn: Project is vulnerable to: GHSA-9hjg-9r4m-mvj7","Warn: Project is vulnerable to: GHSA-9wx4-h78v-vm56","Warn: Project is vulnerable to: PYSEC-2023-74 / GHSA-j8r2-6x86-q33q","Warn: Project is vulnerable to: GHSA-753j-mpmx-qq6g","Warn: Project is vulnerable to: GHSA-7cx3-6m66-7c5m","Warn: Project is vulnerable to: GHSA-8w49-h785-mj3c","Warn: Project is vulnerable to: PYSEC-2023-75 / GHSA-hj3f-6gcp-jg8j","Warn: Project is vulnerable to: GHSA-qppv-j76h-2rpx","Warn: Project is vulnerable to: GHSA-w235-7p84-xx57","Warn: Project is vulnerable to: GHSA-34jh-p97f-mpxf","Warn: Project is vulnerable to: PYSEC-2021-59 / GHSA-5phf-pp7p-vc2r","Warn: Project is vulnerable to: PYSEC-2023-212 / GHSA-g4mx-q9vg-27p4","Warn: Project is vulnerable to: GHSA-pq67-6m6q-mj2v","Warn: Project is vulnerable to: PYSEC-2021-108 / GHSA-q2q7-5pp4-w6pg","Warn: Project is vulnerable to: PYSEC-2023-192 / GHSA-v845-jxx5-vc9f","Warn: Project is vulnerable to: PYSEC-2024-187 / GHSA-rqc4-2hc7-8c8v","Warn: Project is vulnerable to: GHSA-jfmj-5v4g-7637"],"documentation":{"short":"Determines if the project has open, known unfixed vulnerabilities.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#vulnerabilities"}}]},"last_synced_at":"2025-08-21T20:20:10.756Z","repository_id":54718685,"created_at":"2025-08-21T20:20:10.756Z","updated_at":"2025-08-21T20:20:10.756Z"},"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":30291807,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-03-09T02:57:19.223Z","status":"ssl_error","status_checked_at":"2026-03-09T02:56:26.373Z","response_time":61,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["data","gpu-acceleration","kedro-catalog","kedro-plugin","modin","rapidsai"],"created_at":"2024-07-31T10:00:19.800Z","updated_at":"2026-03-09T10:33:38.138Z","avatar_url":"https://github.com/mzjp2.png","language":"Python","funding_links":[],"categories":["[Kedro plugins](https://docs.kedro.org/en/stable/extend_kedro/plugins.html)"],"sub_categories":[],"readme":"![logo](static/logo.png)\n\n# kedro-dataframe-dropin\n\n![github-action](https://github.com/mzjp2/kedro-dataframe-dropin/workflows/Lint%20and%20test/badge.svg)\n![code-style](https://img.shields.io/badge/code%20style-black-000000.svg)\n![license](https://img.shields.io/badge/License-MIT-green.svg)\n\n## How do I get started?\n\n```bash\n$ pip install kedro-dataframe-dropin --upgrade\n```\n\n## Then what?\n\nReplace your `pandas.*DataSet` in your `catalog.yml` with\n\n```\nkedro_dataframe_dropin.[rapids|modin].*DataSet\n```\n\nand reap the benefits, as long as your node and pipeline code is compatible with the `cudf`/`modin` API (that tries to replicate `pandas` as much as possible) and your data format is supported by the respective libraries (for example, `cudf` doesn't support the `read_excel` method)\n## What is kedro-dataframe-dropin?\n\nkedro-dataframe-dropin is a Kedro plugin that provides modified versions of the `pandas.*` dataset definitions (e.g `pandas.CSVDataSet`) from Kedro, where each dataset has been replaced with one of `pandas` drop-in replacements.\n\nFor example `kedro_dataframe_dropin.modin.CSVDataSet` replicates `pandas.CSVDataSet` but with the `modin.pandas` package replacing `pandas`. Likewise, `kedro_dataframe_dropin.rapids.CSVDataSet` provides a `cuDF`-backed version of the `CSVDataSet`.\n\n## Why does this exist?\n\nThere might be several reasons why you'd want to consider a drop-in replacement for Pandas. The use-cases are outlined in various places, such as: the [modin documentation](http://modin.readthedocs.io) or [the RAPIDS website](https://rapids.ai).\n\nHowever, the only dataframe-backed datasets that Kedro has out of the box are the `pandas` and `pyspark` ones. If you wanted to use, say, a `modin` dataframe backed by `Dask` or `Ray`, you'd need to write a [custom dataset](https://kedro.readthedocs.io/en/stable/07_extend_kedro/03_custom_datasets.html) for each file format (`.csv`, `.xls`, etc...).\n\nThis lets you swap out your `catalog.yml` from:\n\n```yaml\n# conf/base/catalog.yml [before]\nrockets:\n    type: pandas.CSVDataSet\n    filepath: data/01_raw/rockets.csv\n\nreviews:\n    type: pandas.ExcelDataSet\n    filepath: data/01_raw/reviews.xslsx\n```\n\nto:\n\n```yaml\n# conf/base/catalog.yml [after]\nrockets:\n    type: kedro_dataframe_dropin.rapids.CSVDataSet\n    filepath: data/01_raw/rockets.csv\n\nreviews:\n    type: kedro_dataframe_dropin.modin.ExcelDataSet\n    filepath: data/01_raw/reviews.xlsx\n```\n\nand as long as the code within your nodes fits within `modin` or `cudf`'s implementation of a subset of the `pandas` API, you'll be done!\n\n## What dropins are currently supported?\n\n| dropin       | supported |\n| ------------ | --------- |\n| modin[ray]   | ✅        |\n| modin[dask]  | ✅        |\n| cudf         | ✅        |\n| dask         | 🟠        |\n| dask-cudf    | 🟠        |\n\n✅: compatible\n🟠: No kedro versioning and some datasets (like `SQLTableDataSet`) don't work despite being available on both `kedro` and the drop-in.\n\n## What happens when Kedro adds or changes a `pandas` dataset?\n\nThe beauty of it is that this will stay in complete sync with Kedro's `pandas.*` library without any code changes or releases required. It's implemented through hot-swapping the `pandas` module with one of the replacements you specified.\n\n## Examples\n\nAs an example of why you might want to use this, here are the results of some very rough and preliminary benchmarking. These were conducted on a Google Colaboratory notebook (thanks Google!) with a Tesla T4 GPU and a 2-core CPU. The data used was a 5 million row CSV, weighing in at around a 100mb downloaded from [here](http://eforexcel.com/wp/downloads-18-sample-csv-files-data-sets-for-testing-sales/).\n\n```\n# base/conf/catalog.yml\ncudf:\n  type: kedro_dataframe_dropin.rapids.CSVDataSet\n  filepath: data/01_raw/data.csv\n\npandas:\n  type: pandas.CSVDataSet\n  filepath: data/01_raw/data.csv\n```\n\nUsing the two datasets within the `kedro ipython` console shows a world of difference, with reading the file in being 10x faster, doing a groupby being 6x faster and taking the mean being 5x faster.\n\nThis helps shorten:\n\n* The feedback loop when prototyping and exploring your data within a `kedro ipython` or a `kedro jupyter` session\n* The feedback loop when running your pipelines in development and debugging/experimenting with various different methodologies\n* Your production runtime\n\n```\nIn [1]: %timeit gdf = catalog.load(\"cudf\")\n702 ms ± 7.32 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)\n\nIn [2]: %timeit df = catalog.load(\"pandas\")\n8.22 s ± 101 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)\n\nIn [3]: %timeit gdf.groupby(\"Region\")\n4.75 µs ± 56.5 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)\n\nIn [4]: %timeit df.groupby(\"Region\")\n26.7 µs ± 397 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)\n\nIn [5]: %timeit df[\"Total Revenue\"].mean()\n11.8 ms ± 87.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)\n\nIn [6]: %timeit gdf[\"Total Revenue\"].mean()\n2.71 ms ± 31.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)\n```\n\nAny additional benchmarks you do and want to share back would be much appreciated. Feel free to open an issue!\n\n## Some special notes on RAPIDS\n\n### The rest of the `cu*` ecosystem\n\nYour data processing step gets faster (assuming you have the right conditions) by plugging in the `cudf` module from RAPIDs in place of `pandas`, but it doesn't end there.\n\nYou can continue to make use of your GPU speedup in the rest of your pipeline lifecycle (predictions, ML, graph, etc...) by using the rest of the `cuda` ecosystem of tools (`cuML` and the ilk) in place of tools like `sklearn`.\n\n### Why are some data formats missing?\n\nWith the way this plugin was designed, it only hot swaps in `cudf` in place of `pandas` where the Kedro pandas dataset exists.\n\nSo as it stands today, with the Kedro codebase not having an `ORCDataSet` for example, this plugin won't have it either. You'll need to build your own custom own.\n\nOr better yet, head over to the [Kedro](https;//github.com/quantumblacklabs/kedro) codebase and contribute the `pandas` version of it to their codebase. This plugin will then automatically pick it up and provide a `cudf`-equivalent.\n\n## Some special notes on `dask-cuDF` and `dask`\n\nNote that `dask` and `dask-cuDF` will delay compute and operations across nodes are actually building up a computation graph. They will be parallelised across your CPU/GPU when you invoke a `.compute()` operation (like `len` or save it to disk by having its output be a non-memory dataset in the catalog).\n\nNote that Kedro versioning won't be possible with these datasets, since Kedro completely owns the I/O and simply passes the file handle down to `dask`/`dask-cuDF` which doesn't accept it - since file handles can't be shared across (CPU or GPU) workers. Instead what we do is extract the filepath and pass it to `dask` who also use `fsspec` and so you still have full remote-layer interopability with the benefit of parallelised compute.\n\nConsider giving Matthew Rocklin's [blog post on `dask-cuDF`](http://matthewrocklin.com/blog/2019/01/13/dask-cudf-first-steps) and the philsophy of it simply being a different \"engine\" for `dask.DataFrame` a read.\n\n## Caveats\n\nKeep in mind that in order to remain consistent with the adage of not copying memory, when passing these dataframes between nodes, they _will not_ be copied - but simply passed through as the same underlying Python object, so if you're doing mutable operations on them across different nodes, keep in that in mind.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmzjp2%2Fkedro-dataframe-dropin","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmzjp2%2Fkedro-dataframe-dropin","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmzjp2%2Fkedro-dataframe-dropin/lists"}