{"id":28690982,"url":"https://github.com/tiagohcalves/datapipe","last_synced_at":"2025-06-28T16:41:01.918Z","repository":{"id":62566695,"uuid":"126476654","full_name":"tiagohcalves/datapipe","owner":"tiagohcalves","description":"Pipeline API for manipulating dataframes","archived":false,"fork":false,"pushed_at":"2018-11-22T15:51:32.000Z","size":11746,"stargazers_count":7,"open_issues_count":2,"forks_count":0,"subscribers_count":2,"default_branch":"master","last_synced_at":"2025-06-25T11:06:54.378Z","etag":null,"topics":["data-pipe","data-science","etl","fluid","machine-learning","manipulating-dataframes","pandas-dataframe","pipeline","pipeline-api"],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/tiagohcalves.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2018-03-23T11:34:17.000Z","updated_at":"2018-11-22T15:51:34.000Z","dependencies_parsed_at":"2022-11-03T16:16:04.606Z","dependency_job_id":null,"html_url":"https://github.com/tiagohcalves/datapipe","commit_stats":null,"previous_names":["tiagohcalves/data-pipeline"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/tiagohcalves/datapipe","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tiagohcalves%2Fdatapipe","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tiagohcalves%2Fdatapipe/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tiagohcalves%2Fdatapipe/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tiagohcalves%2Fdatapipe/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/tiagohcalves","download_url":"https://codeload.github.com/tiagohcalves/datapipe/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tiagohcalves%2Fdatapipe/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":261897065,"owners_count":23226653,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["data-pipe","data-science","etl","fluid","machine-learning","manipulating-dataframes","pandas-dataframe","pipeline","pipeline-api"],"created_at":"2025-06-14T06:33:35.973Z","updated_at":"2025-06-28T16:41:01.881Z","avatar_url":"https://github.com/tiagohcalves.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Data Pipe ML\nPipeline API to manipulate dataframes for machine learning.\n\nData Pipe is a framework that wraps Pandas Data Frames to provide a more fluid method to manipulate data. \n\nBasic concepts:\n- Every operation is performed in place. The Data Pipe object keeps one and only one reference to a pandas Data Frame that is constantly updated. \n- ‎Every operation returns a reference to self, which allows chaining methods fluidly. \n- Every method called is recorded internally to provide improved reproducibility and understanding of the preparation pipeline. The exception is the \"load\" method.\n- ‎Data Pipe calls of unimplemented methods default to the internal Data Frame object. This allows quickly accessing some methods, such as shape and head, but please be aware that those calls are not recorded and do not return Data Pipe objects. If it's necessary to use an unimplemented function, please use the Update method to keep manipulating the Data Pipe. \n\n## Installation\n\nYou can install DataPipeML directly from PyPI:\n\n`pip install datapipeml`\n\nOr from source:\n\n```\ngit clone https://github.com/tiagohcalves/datapipe.git\ncd datapipe\npip install .\n```\n### Dependencies\n\nDataPipeML has the following requirements:\n\n* [Pandas](https://github.com/pandas-dev/pandas): 0.22 or higher\n* [Sklearn](http://scikit-learn.org/stable/): 0.19.1 or higher\n\nOlder versions might work but are untested.\n\n### Testing\n\nTo run the unit tests, we recommend [Nose](http://nose.readthedocs.io/en/latest/). Just run:\n\n```\ncd datapipe/datapipeml/tests/\nnosetests test_pipeline.py\n\n..........................\n----------------------------------------------------------------------\nRan 26 tests in 0.237s\n\nOK\n```\n## Example\n\n### Full pipeline with time split\n```\n\u003e\u003e\u003e from datapipeml import DataPipe\n\n\u003e\u003e\u003e train_dp, test_dp = (\n\u003e\u003e\u003e     DataPipe.load(\"data/kiva_loans_sample.csv.gz\")\n\u003e\u003e\u003e             .anonymize(\"id\")\n\u003e\u003e\u003e             .set_index(\"id\")\n\u003e\u003e\u003e             .drop(\"tags\")\n\u003e\u003e\u003e             .drop_sparse()\n\u003e\u003e\u003e             .drop_duplicates()\n\u003e\u003e\u003e             .fill_null()\n\u003e\u003e\u003e             .remove_outliers()\n\u003e\u003e\u003e             .normalize()\n\u003e\u003e\u003e             .set_one_hot()\n\u003e\u003e\u003e             .split_train_test(by=\"date\")\n\u003e\u003e\u003e     )\n\nAnonymizing id\nNo sparse columns to drop\nFound 0 duplicated rows\nFillings columns ['funded_amount', 'loan_amount', 'partner_id', 'term_in_months', 'lender_count']\nRemoving outliers from ['funded_amount', 'loan_amount', 'partner_id', 'term_in_months', 'lender_count']\nNormalizing ['funded_amount', 'loan_amount', 'partner_id', 'term_in_months', 'lender_count']\nEncoding columns ['activity', 'sector', 'country_code', 'country', 'currency', 'repayment_interval']\n        \n\u003e\u003e\u003e train_dp.keep_numerics()\n\u003e\u003e\u003e test_dp.keep_numerics()\n\nDropping columns {'region', 'posted_time', 'date', 'funded_time', 'borrower_genders', 'disbursed_time', 'use'}\nDropping columns {'region', 'posted_time', 'date', 'funded_time', 'borrower_genders', 'disbursed_time', 'use'}\n\n\u003e\u003e\u003e print(train_dp.summary())\n___________________________________________________________|\nMethod Name        |Args               |Kwargs             |\n___________________________________________________________|\nanonymize          |('id',)            |{}                 |\nset_index          |('id',)            |{}                 |\ndrop               |('tags',)          |{}                 |\ndrop_sparse        |()                 |{}                 |\ndrop_duplicates    |()                 |{}                 |\nfill_null          |()                 |{}                 |\nremove_outliers    |()                 |{}                 |\nnormalize          |()                 |{}                 |\nset_one_hot        |()                 |{}                 |\nsplit_train_test   |()                 |{'by': 'date'}     |\nkeep_numerics      |()                 |{}                 |\n___________________________________________________________|\n```\n\n### Create target column and stratified folds\n```\n\u003e\u003e\u003e folds = (\n\u003e\u003e\u003e     DataPipe.load(\"data/kiva_loans_sample.csv.gz\")\n\u003e\u003e\u003e             .set_index(\"id\")\n\u003e\u003e\u003e             .drop_duplicates()\n\u003e\u003e\u003e             .fill_null()\n\u003e\u003e\u003e             .remove_outliers()\n\u003e\u003e\u003e             .normalize()\n\u003e\u003e\u003e             .set_one_hot()\n\u003e\u003e\u003e             .create_column(\"high_loan\", lambda x: 1 if x[\"loan_amount\"] \u003e 2000 else 0)\n\u003e\u003e\u003e             .keep_numerics()\n\u003e\u003e\u003e             .create_folds(stratify_by=\"high_loan\")\n\u003e\u003e\u003e     )\n        \nFound 0 duplicated rows\nFillings columns ['funded_amount', 'loan_amount', 'partner_id', 'term_in_months', 'lender_count']\nRemoving outliers from ['funded_amount', 'loan_amount', 'partner_id', 'term_in_months', 'lender_count']\nNormalizing ['funded_amount', 'loan_amount', 'partner_id', 'term_in_months', 'lender_count']\nOne-hot encoding columns ['activity', 'sector', 'country_code', 'country', 'currency', 'borrower_genders', 'repayment_interval']\nCreating column high_loan\nDropping columns {'tags', 'funded_time', 'disbursed_time', 'region', 'use', 'posted_time', 'date'}\n```\n\n## Additional Features\n\n### Checkpoint\n\nWhen instantiating a new DataPipe object, or through the method `enable_checkpoint`, one can specify the path to save a copy of the data before each function called. This allows to inspect the data for each step of the pipeline, and to create backups, but keep in mind that this is a costly function, both in execution time (dumping to disk is slow) and in space required (a new file is created for each function called). However, through the methods `enable_checkpoint` and `disable_checkpoint`, this feature can be activated only for critial steps in the pipeline.\n\n### Access to underlying DataFrame\n\nEvery function not implemented in the DataPipe class will be forward to the underlying DataFrame object. This means that one can enjoy all methods present on the pandas DataFrame, but beware with the return type, since it will probably broke the pipeline execution. For example, `dp.head(100)` will execute the `head` function on the DataFrame and display the corresponding result. For functions that returns new instances of DataFrame (e.g., the `transpose` function), we recommend the `transform` function, as it is applied to the underlying DataFrame and keeps the DataPipe reference. If any need is not fulfilled by the methods above, it is also possible to access the DataFrame through the `._df` property. \n\n## List of methods\n\nHere are listed all implemented methods. Full documentation is available in the source script of DataPipe.\n\n```\n__init__(self, data=None, verbose: bool = True, parent_pipe = None, force_types: bool = True, checkpoint: str = None, **kwargs)\n\nload(filename, **kwargs)\nsave(self, filename)\ntransform(self, func)\ncast_types(self, type_map: dict)\nset_index(self, columns: list)\nselect(self, query: str)\nsample(self, size: float = 0.1, seed: int = 0, inplace=False)\ndrop(self, columns: list)\nkeep(self, columns: list)\nkeep_numerics(self)\ndrop_sparse(self, threshold: float = 0.05)\ndrop_duplicates(self, key: str = \"\",  keep='first')\nfill_null(self, columns=None, value=\"mean\")\nremove_outliers(self, columns=None, threshold: float = 2.0, fill_value = \"mean\")\nnormalize(self, columns=None, axis: int = 0, norm: str = \"l2\")\nanonymize(self, columns, keys=None, update=True, missing=-1)\nset_one_hot(self, columns=None, limit: int = 100, with_frequency: bool = True, keep_columns: bool = False, update=True)\ncreate_column(self, column_name: str, func)\nsplit_train_test(self, by: str = \"\", size: float = 0.8, seed: int = 0)\ncreate_folds(self, n_folds: int = 5, stratify_by: str = \"\", seed: int = 0, return_iterator: bool = True)\nsummary(self, line_width = 60, with_args:bool = True)\ndisable_checkpoint(self)\nenable_checkpoint(self, path: str = \"\")\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftiagohcalves%2Fdatapipe","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ftiagohcalves%2Fdatapipe","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftiagohcalves%2Fdatapipe/lists"}