{"id":13728325,"url":"https://github.com/capeprivacy/cape-dataframes","last_synced_at":"2026-05-11T17:01:22.999Z","repository":{"id":38029038,"uuid":"258512322","full_name":"capeprivacy/cape-dataframes","owner":"capeprivacy","description":"Privacy transformations on Spark and Pandas dataframes backed by a simple policy language.","archived":false,"fork":false,"pushed_at":"2023-07-25T21:25:55.000Z","size":446,"stargazers_count":173,"open_issues_count":12,"forks_count":20,"subscribers_count":13,"default_branch":"master","last_synced_at":"2025-03-30T07:07:49.269Z","etag":null,"topics":["collaboration","data-science","hacktoberfest","machine-learning","pandas","policy","privacy","python","spark"],"latest_commit_sha":null,"homepage":"https://docs.capeprivacy.com","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/capeprivacy.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null}},"created_at":"2020-04-24T12:55:37.000Z","updated_at":"2025-03-17T23:19:38.000Z","dependencies_parsed_at":"2024-01-07T16:22:59.643Z","dependency_job_id":null,"html_url":"https://github.com/capeprivacy/cape-dataframes","commit_stats":{"total_commits":157,"total_committers":11,"mean_commits":"14.272727272727273","dds":0.5859872611464968,"last_synced_commit":"ed65cece5caebcce1ac549573514834effab5ecd"},"previous_names":["capeprivacy/cape-python"],"tags_count":7,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/capeprivacy%2Fcape-dataframes","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/capeprivacy%2Fcape-dataframes/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/capeprivacy%2Fcape-dataframes/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/capeprivacy%2Fcape-dataframes/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/capeprivacy","download_url":"https://codeload.github.com/capeprivacy/cape-dataframes/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247451652,"owners_count":20940939,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["collaboration","data-science","hacktoberfest","machine-learning","pandas","policy","privacy","python","spark"],"created_at":"2024-08-03T02:00:40.534Z","updated_at":"2026-05-11T17:01:17.963Z","avatar_url":"https://github.com/capeprivacy.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Cape Dataframes\n\n[![](https://github.com/capeprivacy/cape-dataframes/workflows/Main/badge.svg)](https://github.com/capeprivacy/cape-dataframes/actions/workflows/main.yml)\n[![License](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](https://opensource.org/licenses/Apache-2.0) \n[![codecov](https://codecov.io/gh/capeprivacy/cape-python/branch/master/graph/badge.svg?token=L9A8HFAJK5)](https://codecov.io/gh/capeprivacy/cape-python)\n[![PyPI version](https://badge.fury.io/py/cape-privacy.svg)](https://badge.fury.io/py/cape-privacy)\n[![Cape Community Discord](https://img.shields.io/discord/1027271440061435975)](https://discord.gg/nQW7YxUYjh)\n\nA Python library supporting data transformations and collaborative privacy policies, for data science projects in Pandas and Apache Spark\n\nSee below for instructions on how to get started or visit the [documentation](https://github.com/capeprivacy/cape-dataframes/tree/master/docs/).\n\n## Getting started\n\n### Prerequisites\n\n* Python 3.6 or above, and pip\n* Pandas 1.0+\n* PySpark 3.0+ (if using Spark)\n* [Make](https://www.gnu.org/software/make/) (if installing from source)\n\n### Install with pip\n\nCape Dataframes is available through PyPi.\n\n```sh\npip install cape-dataframes\n```\n\nSupport for Apache Spark is optional.  If you plan on using the library together with Apache Spark, we suggest the following instead:\n\n```sh\npip install cape-dataframes[spark]\n```\n\nWe recommend running it in a virtual environment, such as [venv](https://docs.python.org/3/library/venv.html).\n\n### Install from source\n\nIt is possible to install the library from source. This installs all dependencies, including Apache Spark:\n\n```sh\ngit clone https://github.com/capeprivacy/cape-dataframes.git\ncd cape-dataframes\nmake bootstrap\n```\n### Usage example\n\n*This example is an abridged version of the tutorial found [here](https://github.com/capeprivacy/cape-dataframes/tree/master/examples/tutorials)*\n\n\n```python\ndf = pd.DataFrame({\n    \"name\": [\"alice\", \"bob\"],\n    \"age\": [34, 55],\n    \"birthdate\": [pd.Timestamp(1985, 2, 23), pd.Timestamp(1963, 5, 10)],\n})\n\ntokenize = Tokenizer(max_token_len=10, key=b\"my secret\")\nperturb_numeric = NumericPerturbation(dtype=dtypes.Integer, min=-10, max=10)\n\ndf[\"name\"] = tokenize(df[\"name\"])\ndf[\"age\"] = perturb_numeric(df[\"age\"])\n\nprint(df.head())\n# \u003e\u003e\n#          name  age  birthdate\n# 0  f42c2f1964   34 1985-02-23\n# 1  2e586494b2   63 1963-05-10\n```\n\nThese steps can be saved in policy files so you can share them and collaborate with your team:\n\n```yaml\n# my-policy.yaml\nlabel: my-policy\nversion: 1\nrules:\n  - match:\n      name: age\n    actions:\n      - transform:\n          type: numeric-perturbation\n          dtype: Integer\n          min: -10\n          max: 10\n          seed: 4984\n  - match:\n      name: name\n    actions:\n      - transform:\n          type: tokenizer\n          max_token_len: 10\n          key: my secret\n``` \n\nYou can then load this policy and apply it to your data frame:\n\n```python\n# df can be a Pandas or Spark data frame \npolicy = cape.parse_policy(\"my-policy.yaml\")\ndf = cape.apply_policy(policy, df)\n\nprint(df.head())\n# \u003e\u003e\n#          name  age  birthdate\n# 0  f42c2f1964   34 1985-02-23\n# 1  2e586494b2   63 1963-05-10\n```\n\nYou can see more [examples and usage](https://github.com/capeprivacy/cape-dataframes/tree/master/examples/) or read our [documentation](https://github.com/capeprivacy/cape-dataframes/tree/master/docs/).\n\n## About Cape Privacy and Cape Dataframes\n\n[Cape Privacy](https://capeprivacy.com) empowers developers to easily encrypt data and process it confidentially. No cryptography or key management required.. Learn more at [capeprivacy.com](https://capeprivacy.com).\n\nCape Dataframes brings Cape's policy language to Pandas and Apache Spark. The supported techniques include tokenization with linkability as well as perturbation and rounding. You can experiment with these techniques programmatically, in Python or in human-readable policy files.\n\n### Project status and roadmap\n\nCape Python 0.1.1 was released 24th June 2020. It is actively maintained and developed, alongside other elements of the Cape ecosystem.\n\n**Upcoming features:**\n\n* Reversible tokenisation: allow reversing of tokenization to reveal the raw value.\n* Expand pipeline integrations: add Apache Beam, Apache Flink, Apache Arrow Flight or Dask integration as another pipeline we can support, either as part of Cape Dataframes or in its own separate project.\n\n## Help and resources\n\nIf you need help using Cape Dataframes, you can:\n\n* View the [documentation](https://github.com/capeprivacy/cape-dataframes/tree/master/docs/).\n* Submit an issue.\n* Talk to us on the [Cape Community Discord](https://discord.gg/nQW7YxUYjh) [![Cape Community Discord](https://img.shields.io/discord/1027271440061435975)](https://discord.gg/nQW7YxUYjh)\n\nPlease file [feature requests](https://github.com/capeprivacy/cape-dataframes/issues/new?template=feature_request.md) and \n[bug reports](https://github.com/capeprivacy/cape-dataframes/issues/new?template=bug_report.md) as GitHub issues.\n\n### Contributing\n\nView our [contributing](CONTRIBUTING.md) guide for more information.\n\n### Code of conduct\n\nOur [code of conduct](https://capeprivacy.com/conduct/) is included on the Cape Privacy website. All community members are expected to follow it. Please refer to that page for information on how to report problems.\n\n## License\n\nLicensed under Apache License, Version 2.0 (see [LICENSE](https://github.com/capeprivacy/cape-python/blob/master/LICENSE) or http://www.apache.org/licenses/LICENSE-2.0). Copyright as specified in [NOTICE](https://github.com/capeprivacy/cape-python/blob/master/NOTICE).\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcapeprivacy%2Fcape-dataframes","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fcapeprivacy%2Fcape-dataframes","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcapeprivacy%2Fcape-dataframes/lists"}