{"id":13689137,"url":"https://github.com/sfu-db/dataprep","last_synced_at":"2025-05-14T06:12:38.529Z","repository":{"id":37470970,"uuid":"186311346","full_name":"sfu-db/dataprep","owner":"sfu-db","description":"Open-source low code data preparation library in python. Collect, clean and visualization your data in python with a few lines of code.","archived":false,"fork":false,"pushed_at":"2024-06-27T16:57:45.000Z","size":224360,"stargazers_count":2143,"open_issues_count":165,"forks_count":212,"subscribers_count":25,"default_branch":"develop","last_synced_at":"2025-04-09T02:15:53.831Z","etag":null,"topics":["apis","apiwrapper","cleaning","connector","data-exploration","data-science","datacleaning","dataconnector","dataprep","datapreparation","eda","exploratory-data-analysis","webconnector"],"latest_commit_sha":null,"homepage":"http://dataprep.ai","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/sfu-db.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2019-05-12T22:37:24.000Z","updated_at":"2025-04-08T00:18:03.000Z","dependencies_parsed_at":"2024-10-14T15:21:09.451Z","dependency_job_id":"48369d3c-0aa2-4bef-afe1-16c6d27d017f","html_url":"https://github.com/sfu-db/dataprep","commit_stats":{"total_commits":692,"total_committers":47,"mean_commits":14.72340425531915,"dds":0.7947976878612717,"last_synced_commit":"55c3932a8cdc8f9ae69660a4083f3519fe159dc5"},"previous_names":[],"tags_count":24,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sfu-db%2Fdataprep","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sfu-db%2Fdataprep/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sfu-db%2Fdataprep/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sfu-db%2Fdataprep/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/sfu-db","download_url":"https://codeload.github.com/sfu-db/dataprep/tar.gz/refs/heads/develop","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":254080576,"owners_count":22011446,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["apis","apiwrapper","cleaning","connector","data-exploration","data-science","datacleaning","dataconnector","dataprep","datapreparation","eda","exploratory-data-analysis","webconnector"],"created_at":"2024-08-02T15:01:35.078Z","updated_at":"2025-05-14T06:12:38.464Z","avatar_url":"https://github.com/sfu-db.png","language":"Python","funding_links":[],"categories":["Data Manipulation","Python","📊 Data Profiling","其他_机器学习与深度学习","Data Exploration"],"sub_categories":["Data-centric AI"],"readme":"\u003cdiv align=\"center\"\u003e\u003cimg width=\"100%\" src=\"https://github.com/sfu-db/dataprep/raw/develop/assets/logo.png\"/\u003e\u003c/div\u003e\n\n---\n\n\u003cp align=\"center\"\u003e\n  \u003ca href=\"LICENSE\"\u003e\u003cimg src=\"https://img.shields.io/pypi/l/dataprep?style=flat-square\"/\u003e\u003c/a\u003e\n  \u003ca href=\"https://sfu-db.github.io/dataprep/\"\u003e\u003cimg src=\"https://img.shields.io/badge/dynamic/json?color=blue\u0026label=docs\u0026prefix=v\u0026query=%24.info.version\u0026url=https%3A%2F%2Fpypi.org%2Fpypi%2Fdataprep%2Fjson\u0026style=flat-square\"/\u003e\u003c/a\u003e\n  \u003ca href=\"https://pypi.org/project/dataprep/\"\u003e\u003cimg src=\"https://img.shields.io/pypi/pyversions/dataprep?style=flat-square\"/\u003e\u003c/a\u003e\n  \u003c!-- \u003ca href=\"https://www.codacy.com/gh/sfu-db/dataprep?utm_source=github.com\u0026amp;utm_medium=referral\u0026amp;utm_content=sfu-db/dataprep\u0026amp;utm_campaign=Badge_Coverage\"\u003e\u003cimg src=\"https://app.codacy.com/project/badge/Coverage/ed658f08dcce4f088c850253475540ba\"/\u003e\u003c/a\u003e --\u003e\n\u003c!--   \u003ca href=\"https://codecov.io/gh/sfu-db/dataprep\"\u003e\u003cimg src=\"https://img.shields.io/codecov/c/github/sfu-db/dataprep?style=flat-square\"/\u003e\u003c/a\u003e --\u003e\n  \u003c!-- \u003ca href=\"https://www.codacy.com/gh/sfu-db/dataprep?utm_source=github.com\u0026amp;utm_medium=referral\u0026amp;utm_content=sfu-db/dataprep\u0026amp;utm_campaign=Badge_Grade\"\u003e\u003cimg src=\"https://app.codacy.com/project/badge/Grade/ed658f08dcce4f088c850253475540ba\"/\u003e\u003c/a\u003e --\u003e\n\u003c/p\u003e\n\n\u003cp align=\"center\"\u003e\n  \u003ca href=\"https://sfu-db.github.io/dataprep/\"\u003eDocumentation\u003c/a\u003e\n  | \n  \u003ca href=\"https://github.com/sfu-db/dataprep/discussions\"\u003eForum\u003c/a\u003e\n\u003c/p\u003e\n\n\u003cp align=\"center\"\u003e\u003cb\u003eLow code data preparation\u003c/b\u003e\u003c/p\u003e\n\nCurrently, you can use DataPrep to:\n\n- Collect data from common data sources (through [`dataprep.connector`](#connector))\n- Do your exploratory data analysis (through [`dataprep.eda`](#eda))\n- Clean and standardize data (through [`dataprep.clean`](#clean))\n- ...more modules are coming\n\n## Releases\n\n\u003cdiv align=\"center\"\u003e\n  \u003ctable\u003e\n    \u003ctr\u003e\n      \u003cth\u003eRepo\u003c/th\u003e\n      \u003cth\u003eVersion\u003c/th\u003e\n      \u003cth\u003eDownloads\u003c/th\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n      \u003ctd\u003ePyPI\u003c/td\u003e\n      \u003ctd\u003e\u003ca href=\"https://pypi.org/project/dataprep/\"\u003e\u003cimg src=\"https://img.shields.io/pypi/v/dataprep?style=flat-square\"/\u003e\u003c/a\u003e\u003c/td\u003e\n      \u003ctd\u003e\u003ca href=\"https://pepy.tech/project/dataprep\"\u003e\u003cimg src=\"https://pepy.tech/badge/dataprep\"/\u003e\u003c/a\u003e\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e \n      \u003ctd\u003econda-forge\u003c/td\u003e\n      \u003ctd\u003e\u003ca href=\"https://anaconda.org/conda-forge/dataprep\"\u003e\u003cimg src=\"https://img.shields.io/conda/vn/conda-forge/dataprep.svg\"/\u003e\u003c/a\u003e\u003c/td\u003e\n      \u003ctd\u003e\u003ca href=\"https://anaconda.org/conda-forge/dataprep\"\u003e\u003cimg src=\"https://img.shields.io/conda/dn/conda-forge/dataprep.svg\"/\u003e\u003c/a\u003e\u003c/td\u003e\n    \u003c/tr\u003e\n  \u003c/table\u003e\n\u003c/div\u003e\n\n## Installation\n\n```bash\npip install -U dataprep\n```\n\n## EDA\n\nDataPrep.EDA is the fastest and the easiest EDA (Exploratory Data Analysis) tool in Python. It allows you to understand a Pandas/Dask DataFrame with a few lines of code in seconds.\n\n#### Create Profile Reports, Fast\n\nYou can create a beautiful profile report from a Pandas/Dask DataFrame with the `create_report` function. DataPrep.EDA has the following advantages compared to other tools:\n\n- **[10X Faster](https://arxiv.org/abs/2104.00841)**: DataPrep.EDA can be 10X faster than Pandas-based profiling tools due to its highly optimized Dask-based computing module.\n- **Interactive Visualization**: DataPrep.EDA generates interactive visualizations in a report, which makes the report look more appealing to end users.\n- **Big Data Support**: DataPrep.EDA naturally supports big data stored in a Dask cluster by accepting a Dask dataframe as input.\n\nThe following code demonstrates how to use DataPrep.EDA to create a profile report for the titanic dataset.\n\n```python\nfrom dataprep.datasets import load_dataset\nfrom dataprep.eda import create_report\ndf = load_dataset(\"titanic\")\ncreate_report(df).show_browser()\n```\n\nClick [here](https://docs.dataprep.ai/_downloads/1a61c6aebb3ecbe9dc9742bd6ca78ddb/titanic_dp.html) to see the generated report of the above code.\n\nClick [here](https://docs.dataprep.ai/dev/bench/index.html) to see the benchmark result.\n\n#### Try DataPrep.EDA Online: [DataPrep.EDA Demo in Colab](https://colab.research.google.com/drive/1U_-pAMcne3hK1HbMB3kuEt-093Np_7Uk?usp=sharing)\n\n#### Innovative System Design\n\nDataPrep.EDA is the **_only_** task-centric EDA system in Python. It is carefully designed to improve usability.\n\n- **Task-Centric API Design**: You can declaratively specify a wide range of EDA tasks in different granularity with a single function call. All needed visualizations will be automatically and intelligently generated for you.\n- **Auto-Insights**: DataPrep.EDA automatically detects and highlights the insights (e.g., a column has many outliers) to facilitate pattern discovery about the data.\n- **How-to Guide**: A how-to guide is provided to show the configuration of each plot function. With this feature, you can easily customize the generated visualizations.\n\n#### Learn DataPrep.EDA in 2 minutes:\n\n\u003ca href=\"https://youtu.be/nSkQy3ew3EI\"\u003e\u003cimg src=\"assets/eda_video_cover.png\"/\u003e\u003c/a\u003e\n\nClick [here](https://sfu-db.github.io/dataprep/user_guide/eda/introduction.html) to check all the supported tasks.\n\nCheck [plot](https://sfu-db.github.io/dataprep/user_guide/eda/plot.html), [plot_correlation](https://sfu-db.github.io/dataprep/user_guide/eda/plot_correlation.html), [plot_missing](https://sfu-db.github.io/dataprep/user_guide/eda/plot_missing.html) and [create_report](https://sfu-db.github.io/dataprep/user_guide/eda/create_report.html) to see how each function works.\n\n## Clean\n\nDataPrep.Clean contains about **140+** functions designed for cleaning and validating data in a DataFrame. It provides\n\n- **A Convenient GUI**: incorporated into Jupyter Notebook, users can clean their own DataFrame without any coding (see the video below).\n- **A Unified API**: each function follows the syntax `clean_{type}(df, 'column name')` (see an example below).\n- **Speed**: the computations are parallelized using Dask. It can clean **50K rows per second** on a dual-core laptop (that means cleaning 1 million rows in only 20 seconds).\n- **Transparency**: a report is generated that summarizes the alterations to the data that occured during cleaning.\n\nThe following video shows how to use GUI of Dataprep.Clean\n\u003ca href=\"https://youtu.be/WtJaVBIVoxQ\"\u003e\u003cimg src=\"assets/clean_video_cover.png\"/\u003e\u003c/a\u003e\n\nThe following example shows how to clean and standardize a column of country names.\n\n```python\nfrom dataprep.clean import clean_country\nimport pandas as pd\ndf = pd.DataFrame({'country': ['USA', 'country: Canada', '233', ' tr ', 'NA']})\ndf2 = clean_country(df, 'country')\ndf2\n           country  country_clean\n0              USA  United States\n1  country: Canada         Canada\n2              233        Estonia\n3              tr          Turkey\n4               NA            NaN\n```\n\nType validation is also supported:\n\n```python\nfrom dataprep.clean import validate_country\nseries = validate_country(df['country'])\nseries\n0     True\n1    False\n2     True\n3     True\n4    False\nName: country, dtype: bool\n```\n\nCheck [Documentation of Dataprep.Clean](https://docs.dataprep.ai/user_guide/clean/introduction.html) to see how each function works.\n\n## Connector\n\nConnector now supports loading data from both web API and databases.\n\n### Web API\n\nConnector is an intuitive, open-source API wrapper that speeds up development by standardizing calls to multiple APIs as a simple workflow.\n\nConnector provides a simple wrapper to collect structured data from different Web APIs (e.g., Twitter, Spotify), making web data collection easy and efficient, without requiring advanced programming skills.\n\nDo you want to leverage the growing number of websites that are opening their data through public APIs? Connector is for you!\n\nLet's check out the several benefits that Connector offers:\n\n- **A unified API:** You can fetch data using one or two lines of code to get data from [tens of popular websites](https://github.com/sfu-db/DataConnectorConfigs).\n- **Auto Pagination:** Do you want to invoke a Web API that could return a large result set and need to handle it through pagination? Connector automatically does the pagination for you! Just specify the desired number of returned results (argument `_count`) without getting into unnecessary detail about a specific pagination scheme.\n- **Speed:** Do you want to fetch results more quickly by making concurrent requests to Web APIs? Through the `_concurrency` argument, Connector simplifies concurrency, issuing API requests in parallel while respecting the API's rate limit policy.\n\n#### How to fetch all publications of Andrew Y. Ng?\n\n```python\nfrom dataprep.connector import connect\nconn_dblp = connect(\"dblp\", _concurrency = 5)\ndf = await conn_dblp.query(\"publication\", author = \"Andrew Y. Ng\", _count = 2000)\n```\n\nHere, you can find detailed [Examples.](https://github.com/sfu-db/dataprep/tree/develop/examples)\n\nConnector is designed to be easy to extend. If you want to connect with your own web API, you just have to write a simple [configuration file](https://github.com/sfu-db/DataConnectorConfigs/blob/develop/CONTRIBUTING.md#add-configuration-files) to support it. This configuration file describes the API's main attributes like the URL, query parameters, authorization method, pagination properties, etc.\n\n### Database\n\nConnector now has adopted [connectorx](https://github.com/sfu-db/connector-x) in order to enable loading data from databases (Postgres, Mysql, SQLServer, etc.) into Python dataframes (pandas, dask, modin, arrow, polars) in the fastest and most memory efficient way. [[Benchmark]](https://github.com/sfu-db/connector-x/blob/main/Benchmark.md#benchmark-result-on-aws-r54xlarge)\n\nWhat you need to do is just install `connectorx` (`pip install -U connectorx`) and run one line of code:\n\n```python\nfrom dataprep.connector import read_sql\nread_sql(\"postgresql://username:password@server:port/database\", \"SELECT * FROM lineitem\")\n```\n\nCheck out [here](https://github.com/sfu-db/connector-x#supported-sources--destinations) for supported databases and dataframes and more examples usages.\n\n\n## Lineage\nA Column Level Lineage Graph for SQL. This tool is intended to help you by creating an interactive graph on a webpage to explore the column level lineage among them.\n\n### The lineage module offers:\nA general introduction of the project can be found in this [blog post](https://medium.com/@shz1/lineagex-the-python-library-for-your-lineage-needs-5e51b77a0032).\n- **Automatic dependency creation**: When there are dependency among the SQL files, and those tables are not yet in the database, the lineage module will automatically tries to find the dependency table and creates it.\n- **Clean and simple but very interactive user interface**: The user interface is very simple to use with minimal clutters on the page while showing all of the necessary information.\n- **Variety of SQL statements**: The lineage module supports a variety of SQL statements, aside from the typical `SELECT` statement, it also supports `CREATE TABLE/VIEW [IF NOT EXISTS]` statement as well as the `INSERT` and `DELETE` statement.\n- **[dbt](https://docs.getdbt.com/) support**: The lineage module is also implemented in the [dbt-LineageX](https://github.com/sfu-db/dbt-lineagex), it is added into a dbt project and by using the dbt library [fal](https://github.com/fal-ai/fal), it is able to reuse the Python core and create the similar output from the dbt project.\n\n### Uses and Demo\nThe interactive graph looks like this: \n\u003cimg src=\"https://raw.githubusercontent.com/sfu-db/lineagex/main/docs/example.gif\"/\u003e\nHere is a [live demo](https://zshandy.github.io/lineagex-demo/) with the [mimic-iv concepts_postgres](https://github.com/MIT-LCP/mimic-code/tree/main/mimic-iv/concepts_postgres) files([navigation instructions](https://sfu-db.github.io/lineagex/output.html#how-to-navigate-the-webpage)) and that is created with one line of code:\n```python\nfrom dataprep.lineage import lineagex\nlineagex(sql=path/to/sql, target_schema=\"schema1\", conn_string=\"postgresql://username:password@server:port/database\", search_path_schema=\"schema1, public\")\n```\nCheck out more detailed usage and examples [here](https://sfu-db.github.io/lineagex/intro.html). \n\n## Documentation\n\nThe following documentation can give you an impression of what DataPrep can do:\n\n- [Connector](https://docs.dataprep.ai/user_guide/connector/introduction.html)\n- [EDA](https://docs.dataprep.ai/user_guide/eda/introduction.html)\n- [Clean](https://docs.dataprep.ai/user_guide/clean/introduction.html)\n- [Lineage](https://sfu-db.github.io/lineagex/intro.html)\n- \n## Contribute\n\nThere are many ways to contribute to DataPrep.\n\n- Submit bugs and help us verify fixes as they are checked in.\n- Review the source code changes.\n- Engage with other DataPrep users and developers on StackOverflow.\n- Ask questions \u0026 propose new ideas in our [Forum].\n- [![Twitter]](https://twitter.com/dataprepai)\n- Contribute bug fixes.\n- Providing use cases and writing down your user experience.\n\nPlease take a look at our [wiki] for development documentations!\n\n[build status]: https://img.shields.io/circleci/build/github/sfu-db/dataprep/master?style=flat-square\u0026token=f68e38757f5c98771f46d1c7e700f285a0b9784d\n[forum]: https://github.com/sfu-db/dataprep/discussions\n[wiki]: https://github.com/sfu-db/dataprep/wiki\n[examples]: https://github.com/sfu-db/dataprep/tree/master/examples\n[twitter]: https://img.shields.io/twitter/follow/dataprepai?style=social\n\n## Acknowledgement\n\nSome functionalities of DataPrep are inspired by the following packages.\n\n- [Pandas Profiling](https://github.com/pandas-profiling/pandas-profiling)\n\n  Inspired the report functionality and insights provided in `dataprep.eda`.\n\n- [missingno](https://github.com/ResidentMario/missingno)\n\n  Inspired the missing value analysis in `dataprep.eda`.\n\n## Citing DataPrep\n\nIf you use DataPrep, please consider citing the following paper:\n\nJinglin Peng, Weiyuan Wu, Brandon Lockhart, Song Bian, Jing Nathan Yan, Linghao Xu, Zhixuan Chi, Jeffrey M. Rzeszotarski, and Jiannan Wang. [DataPrep.EDA: Task-Centric Exploratory Data Analysis for Statistical\nModeling in Python.](https://arxiv.org/abs/2104.00841) _SIGMOD 2021_.\n\nBibTeX entry:\n\n```bibtex\n@inproceedings{dataprepeda2021,\n  author    = {Jinglin Peng and Weiyuan Wu and Brandon Lockhart and Song Bian and Jing Nathan Yan and Linghao Xu and Zhixuan Chi and Jeffrey M. Rzeszotarski and Jiannan Wang},\n  title     = {DataPrep.EDA: Task-Centric Exploratory Data Analysis for Statistical Modeling in Python},\n  booktitle = {Proceedings of the 2021 International Conference on Management of Data (SIGMOD '21), June 20--25, 2021, Virtual Event, China},\n  year      = {2021}\n}\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsfu-db%2Fdataprep","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fsfu-db%2Fdataprep","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsfu-db%2Fdataprep/lists"}