{"id":21958097,"url":"https://github.com/lisad/phaser","last_synced_at":"2026-02-27T06:19:12.920Z","repository":{"id":221894024,"uuid":"733685844","full_name":"lisad/phaser","owner":"lisad","description":"The missing layer for complex data batch integration pipelines","archived":false,"fork":false,"pushed_at":"2025-02-25T22:05:45.000Z","size":561,"stargazers_count":13,"open_issues_count":29,"forks_count":1,"subscribers_count":3,"default_branch":"main","last_synced_at":"2025-04-23T16:24:07.008Z","etag":null,"topics":["data","data-integration","etl","etl-pipeline"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/lisad.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2023-12-19T22:28:10.000Z","updated_at":"2025-03-13T03:48:36.000Z","dependencies_parsed_at":"2024-04-13T18:47:25.167Z","dependency_job_id":"edae2fd5-cd6d-40f2-b224-b0080e55a1d2","html_url":"https://github.com/lisad/phaser","commit_stats":null,"previous_names":["lisad/phaser"],"tags_count":1,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lisad%2Fphaser","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lisad%2Fphaser/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lisad%2Fphaser/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lisad%2Fphaser/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/lisad","download_url":"https://codeload.github.com/lisad/phaser/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":250468582,"owners_count":21435511,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["data","data-integration","etl","etl-pipeline"],"created_at":"2024-11-29T08:59:40.997Z","updated_at":"2026-02-27T06:19:12.871Z","avatar_url":"https://github.com/lisad.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# phaser\n\nA library to simplify automated batch-oriented complex data integration pipelines, by \norganizing steps and column definitions into phases, and offering utilities for \ntransforms, sorting, validating and viewing changes to data. \n\n[![0 dependencies!](https://0dependencies.dev/0dependencies.svg)](https://0dependencies.dev)\n\n## Goals and Scope\n\nThis library is designed to help developers run a series of steps on _batch-oriented_,\n_record-oriented_, un-indexed data.  A batch of record-oriented data is a set of records\nthat are intended to be processed together, in which each record has more or less the same\nfields and those fields are the same type across records.  Often record-oriented data can\nbe expressed in CSV files, where the first line contains the column names.\n\nRecord-oriented data can be stored or expressed in various formats and objects including:\n\n* CSV files\n* Excel files\n* Pandas dataframes\n* JSON files, provided the JSON format is in record format (a list of dicts)\n\nIn this project, record consistency is somewhat forgiving.  The library does not insist that\neach record must have a value for every column.  Some records may not have some fields, i.e. 'sparse' data.\nSparse data may sometimes be represented in a format that isn't columnar\n(a JSON file in 'record' format might easily contain records in which only fields with values are listed).  Sparse\nrecord-oriented data should be trivial to handle in this library, although by default checkpoint\ndata will be saved in a columnar CSV that shows all the null values.\n\nThe goals of Phaser are to offer an opinionated framework for complex data pipelines with a structure that\n\n* shortens the loop on debugging where a record has the wrong data or a step is failing\n* empowers teams to work on the same code rather than only one assigned owner/expert\n* makes refactoring and extending data integration code easier\n* reduces error rates\n\nThe mechanisms that we think will help phaser meet these goals:\n\n* make it easy to start using phaser without changing everything\n* provide defaults and tools that support shortened-loop debugging\n* encourage code organized in very testable steps and testable phases, via sample code and useful features\n* make it easy to add complexity over time and move gracefully from default to custom behaviour\n* make high-level code readable in one place, as when a Phase lists all of its steps declaratively\n* tools that support visibility and control over warnings and data changes\n\n## Simple example\n\nLogic for transforming or testing data is broken into steps. A developer-defined step can operate on a \nsingle record at a time and employ simple logic to combine values, transform values or add values to the record:\n\n```python\n@row_step\ndef combine_full_name(row, **kwargs):\n    row[\"Full name\"] = f\"{row['First name']} {row['Last name']}\"\n    return row\n\n```\n\nSteps are then combined into phases and pipelines.  Developer-defined steps can be used along with\nsteps built into phaser like 'check_unique', 'sort_by' and 'filter_rows'.  Single-column logic and typing can be \nachieved with built-in Column definitions like IntColumn, FloatColumn and DateColumn.  Common tasks like \nrenaming columns and reformatting values are made particularly easy.\n\nThe following example Pipeline combines columns and steps organized into two Phases. It includes\nseveral renamed columns and values that are validated (blank=False, required=True, min_value=0.01...), four\ndeveloper-defined steps and one built-in step ('check_unique').\n\n```python\nfrom phaser import Phase, Column, FloatColumn, Pipeline\n\nclass Validator(Phase):\n    columns = [\n        Column(name=\"Employee ID\", rename=\"employeeNumber\"),\n        Column(name=\"First name\", rename=\"firstName\"),\n        Column(name=\"Last name\", rename=\"lastName\", blank=False),\n        FloatColumn(name=\"Pay rate\", min_value=0.01, rename=\"payRate\", required=True),\n        Column(name=\"Pay type\",\n               rename=\"payType\",\n               allowed_values=[\"hourly\", \"salary\", \"exception hourly\", \"monthly\", \"weekly\", \"daily\"],\n               on_error=Pipeline.ON_ERROR_DROP_ROW,\n               save=False),\n        Column(name=\"Pay period\", rename=\"paidPer\")\n    ]\n    steps = [\n        drop_rows_with_no_id_and_not_employed,\n        check_unique(\"Employee ID\")\n    ]\n\n\nclass Transformer(Phase):\n    columns = [\n        FloatColumn(name='Pay rate'),\n        FloatColumn(name=\"bonusAmount\")\n    ]\n    steps = [\n        combine_full_name,\n        calculate_annual_salary,\n        calculate_bonus_percent\n    ]\n\n\nclass EmployeeReviewPipeline(Pipeline):\n\n    phases = [Validator, Transformer]\n\n```\n\nThe full example (including steps not shown here) can be found in the tests directory of the project, \nalong with sample data.\n\nThe benefit of even such a simple pipeline expressed as two phases is that the phases can be debugged, tested and\nrun separately. A developer can run the Validator phase once then work on adding features to the Transformer phase,\nor narrow down an error in production by comparing the checkpoint output of each phase.  In addition, the code\nis readable and supports team collaboration.\n\nPhaser comes with table-sensitive diff tooling to make it very easy to develop and debug phases.  The output\nof the diff tool looks like this\nwhen viewing the pipeline results above operating on one of phaser's text fixture files:\n\n![Diff in table format with colored highlighting](https://github.com/lisad/phaser/blob/main/docs/diff-example.png?raw=true)\n\n## Advanced Example\n\nFor a real, working advanced example, see the [phaser-example](https://github.com/lisad/phaser-example) repository on GitHub.\nYou should be able to clone that repository, fetch the Boston and Seattle bike trail bike sensor data,\nand run the pipelines on the source data to get the data in a consistent format.\n\nThe pipelines in the phaser-example project demonstrate these features:\n\n* Columns that get renamed\n* Columns with allowed_values (enumerated types),\n* Dropping columns,\n* Dropping many rows (without creating many warnings),\n* Sorting,\n* Adding columns,\n* Using pandas 'aggregate' method to sum values distributed across rows, within a phaser step\n* Pivoting the data by timestamp column into long row-per-timestamp data format,\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flisad%2Fphaser","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Flisad%2Fphaser","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flisad%2Fphaser/lists"}