{"id":21844077,"url":"https://github.com/altescy/pdpcli","last_synced_at":"2025-04-14T12:11:00.571Z","repository":{"id":45116192,"uuid":"340586589","full_name":"altescy/pdpcli","owner":"altescy","description":"🐾 PdpCLI is a pandas DataFrame processing CLI tool which enables you to build a pandas pipeline from a configuration file.","archived":false,"fork":false,"pushed_at":"2023-10-13T14:49:06.000Z","size":624,"stargazers_count":15,"open_issues_count":2,"forks_count":1,"subscribers_count":2,"default_branch":"main","last_synced_at":"2025-03-28T01:14:29.276Z","etag":null,"topics":["cli","csv","pandas","python"],"latest_commit_sha":null,"homepage":"https://pypi.org/project/pdpcli/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/altescy.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2021-02-20T07:02:39.000Z","updated_at":"2024-01-23T16:02:04.000Z","dependencies_parsed_at":"2022-08-26T10:41:27.870Z","dependency_job_id":null,"html_url":"https://github.com/altescy/pdpcli","commit_stats":null,"previous_names":[],"tags_count":6,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/altescy%2Fpdpcli","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/altescy%2Fpdpcli/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/altescy%2Fpdpcli/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/altescy%2Fpdpcli/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/altescy","download_url":"https://codeload.github.com/altescy/pdpcli/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248877958,"owners_count":21176244,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["cli","csv","pandas","python"],"created_at":"2024-11-27T22:18:23.208Z","updated_at":"2025-04-14T12:11:00.548Z","avatar_url":"https://github.com/altescy.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"PdpCLI\n======\n\n[![Actions Status](https://github.com/altescy/pdpcli/workflows/CI/badge.svg)](https://github.com/altescy/pdpcli/actions?query=workflow%3ACI)\n[![Python version](https://img.shields.io/pypi/pyversions/pdpcli)](https://github.com/altescy/pdpcli)\n[![PyPI version](https://img.shields.io/pypi/v/pdpcli)](https://pypi.org/project/pdpcli/)\n[![License](https://img.shields.io/github/license/altescy/pdpcli)](https://github.com/altescy/pdpcli/blob/master/LICENSE)\n\n### Quick Links\n\n- [Introduction](#Introduction)\n- [Installation](#Installation)\n- [Tutorial](#Tutorial)\n  - [Basic Usage](#basic-usage)\n  - [Data Reader / Writer](#data-reader--writer)\n  - [Plugins](#plugins)\n\n\n## Introduction\n\nPdpCLI is a pandas DataFrame processing CLI tool which enables you to build a pandas pipeline powered by [pdpipe](https://pdpipe.github.io/pdpipe/) from a configuration file. You can also extend pipeline stages and data readers / writers by using your own python scripts.\n\n### Features\n  - Process pandas DataFrame from CLI without wrting Python scripts\n  - Support multiple configuration file formats: YAML, JSON, Jsonnet\n  - Read / write data files in the following formats: CSV, TSV, JSON, JSONL, pickled DataFrame\n  - Import / export data with multiple protocols: S3 / Databse (MySQL, Postgres, SQLite, ...) / HTTP(S)\n  - Extensible pipeline and data readers / writers\n\n\n## Installation\n\nInstalling the library is simple using pip.\n```\n$ pip install \"pdpcli[all]\"\n```\n\n\n## Tutorial\n\n### Basic Usage\n\n1. Write a pipeline config file `config.yml` like below. The `type` fields under `pipeline` correspond to the snake-cased class names of the [`PdpipelineStages`](https://pdpipe.github.io/pdpipe/doc/pdpipe/#types-of-pipeline-stages). Other fields such as `stage` and `columns` are the parameters of the `__init__` methods of the corresponging classes. Internally, this configuration file is converted to Python objects by [`colt`](https://github.com/altescy/colt).\n\n```yaml\npipeline:\n  type: pipeline\n  stages:\n    drop_columns:\n      type: col_drop\n      columns:\n        - name\n        - job\n\n    encode:\n      type: one_hot_encode\n      columns: sex\n\n    tokenize:\n      type: tokenize_text\n      columns: content\n\n    vectorize:\n      type: tfidf_vectorize_token_lists\n      column: content\n      max_features: 10\n```\n\n2. Build a pipeline by training on `train.csv`. The following command generages a pickled pipeline file `pipeline.pkl` after training. If you specify a URL of  file path, it will be automatically downloaded and cached.\n```\n$ pdp build config.yml pipeline.pkl --input-file https://github.com/altescy/pdpcli/raw/main/tests/fixture/data/train.csv\n```\n\n3. Apply the fitted pipeline to `test.csv` and get output of a processed file `processed_test.jsonl` by the following command. PdpCLI automatically detects the output file format based on the file name. In this example, the processed DataFrame will be exported as the JSON-Lines format.\n```\n$ pdp apply pipeline.pkl https://github.com/altescy/pdpcli/raw/main/tests/fixture/data/test.csv --output-file processed_test.jsonl\n```\n\n4. You can also directly run the pipeline from a config file without fitting pipeline.\n```\n$ pdp apply config.yml test.csv --output-file processed_test.jsonl\n```\n\n5. It is possible to override or add parameters by adding command line arguments:\n```\npdp apply config.yml test.csv pipeline.stages.drop_columns.column=name\n```\n\n### Data Reader / Writer\n\nPdpCLI automatically detects a suitable data reader / writer based on a given file name.\nIf you need to use the other data reader / writer, add a `reader` or `writer` config to `config.yml`.\nThe following config is an exmaple to use SQL data reader.\nSQL reader fetches records from the specified database and converts them into a pandas DataFrame.\n```yaml\nreader:\n    type: sql\n    dsn: postgres://${env:POSTGRES_USER}:${env:POSTGRES_PASSWORD}@your.posgres.server/your_database\n```\nConfig files are interpreted by [OmegaConf](https://omegaconf.readthedocs.io/e), so `${env:...}` is interpolated by environment variables.\n\nPrepare your SQL file `query.sql` to fetch data from the database:\n```sql\nselect * from your_table limit 1000\n```\n\nYou can execute the pipeline with SQL data reader via:\n```\n$ POSTGRES_USER=user POSTGRES_PASSWORD=password pdp apply config.yml query.sql\n```\n\n\n### Plugins\n\nBy using plugins, you can extend PdpCLI. This plugin feature enables you to use your own pipeline stages, data readers / writers and commands.\n\n#### Add a new stage\n\n1. Write your plugin script `mypdp.py` like below. `Stage.register(\"\u003cstage-name\u003e\")` registers your pipeline stages, and you can specify these stages by writing `type: \u003cstage-name\u003e` in your config file.\n```python\nimport pdpcli\n\n@pdpcli.Stage.register(\"print\")\nclass PrintStage(pdpcli.Stage):\n    def _prec(self, df):\n        return True\n\n    def _transform(self, df, verbose):\n        print(df.to_string(index=False))\n        return df\n```\n\n2. Update `config.yml` to use your plugin.\n```yml\npipeline:\n    type: pipeline\n    stages:\n        drop_columns:\n        ...\n\n        print:\n            type: print\n\n        encode:\n        ...\n```\n\n2. Execute command with `--module mypdp` and you can see the processed DataFrame after running `drop_columns`.\n```\n$ pdp apply config.yml test.csv --module mypdp\n```\n\n#### Add a new command\n\nYou can also add new commands not only stages.\n\n1. Add the following script to `mypdp.py`. This `greet` command prints out a greeting message with your name.\n```python\n@pdpcli.Subcommand.register(\n    name=\"greet\",\n    description=\"say hello\",\n    help=\"say hello\",\n)\nclass GreetCommand(pdpcli.Subcommand):\n    requires_plugins = False\n\n    def set_arguments(self):\n        self.parser.add_argument(\"--name\", default=\"world\")\n\n    def run(self, args):\n        print(f\"Hello, {args.name}!\")\n\n```\n\n2. To register this command, you need to create the `.pdpcli_plugins` file in which module names are listed for each line. Due to module importing order, the `--module` option is unavailable for command registration.\n```\n$ echo \"mypdp\" \u003e .pdpcli_plugins\n```\n\n3. Run the following command and get a message like below. By using the `.pdpcli_plugins` file, it is is not needed to add the `--module` option to a command line for each execution.\n```\n$ pdp greet --name altescy\nHello, altescy!\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Faltescy%2Fpdpcli","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Faltescy%2Fpdpcli","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Faltescy%2Fpdpcli/lists"}