{"id":44906734,"url":"https://github.com/philbudden/ingest-python","last_synced_at":"2026-05-16T17:07:03.222Z","repository":{"id":312259654,"uuid":"1040166935","full_name":"philbudden/ingest-python","owner":"philbudden","description":"A Pipeline template for ingesting datasets using Pandas","archived":false,"fork":false,"pushed_at":"2025-11-27T12:01:15.000Z","size":43,"stargazers_count":0,"open_issues_count":1,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-02-18T01:39:19.542Z","etag":null,"topics":["automation","ci-cd","data-engineering","data-pipeline","declarative-workflows","elt","elt-pipeline","modular-pipelines","open-source","pandas","pytest","python","sql"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/philbudden.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2025-08-18T14:56:05.000Z","updated_at":"2026-01-20T09:48:50.000Z","dependencies_parsed_at":"2025-08-29T14:58:12.600Z","dependency_job_id":"5d67ea92-4950-4c27-b41f-8ad432d026f7","html_url":"https://github.com/philbudden/ingest-python","commit_stats":null,"previous_names":["data-savvy-solutions/ingest-small","n3ddu8/ingest-python","philipbudden/ingest-python","datasavvysol/ingest-python"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/philbudden/ingest-python","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/philbudden%2Fingest-python","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/philbudden%2Fingest-python/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/philbudden%2Fingest-python/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/philbudden%2Fingest-python/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/philbudden","download_url":"https://codeload.github.com/philbudden/ingest-python/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/philbudden%2Fingest-python/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":33111499,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-16T04:41:52.686Z","status":"ssl_error","status_checked_at":"2026-05-16T04:41:52.009Z","response_time":115,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.5:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["automation","ci-cd","data-engineering","data-pipeline","declarative-workflows","elt","elt-pipeline","modular-pipelines","open-source","pandas","pytest","python","sql"],"created_at":"2026-02-17T22:37:57.900Z","updated_at":"2026-05-16T17:07:03.218Z","avatar_url":"https://github.com/philbudden.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"[![Contributors][contributors-shield]][contributors-url]\n[![Forks][forks-shield]][forks-url]\n[![Stargazers][stars-shield]][stars-url]\n[![Issues][issues-shield]][issues-url]\n[![MIT License][license-shield]][license-url]\n\n# ingest-python\n\n**ingest-python** is an ingestion tool written in Python (relying only on locally installable packages such as Pandas) designed to extract data from source systems and persist it into SQL Server. It is intended for small datasets, where \"small\" refers to data volumes that do not require big data frameworks.\n\nThe tool supports ingesting datasets larger than available memory through **Pandas chunking**. It forms the **Extract and Load** portion of an ELT pipeline.\n\n---\n\n## How it Works\n\n`ingest-python` is structured around a **Base Class** that encapsulates core SQL Server operations.\n\nKey methods include:\n- **read_params**: Reads an `entity_params` table and parses parameters into a dictionary. See [Entity Params](#entity-params) for details.\n- **read_history**: Retrieves the maximum value of a defined `modified` field from a history table to support incremental loads.\n- **transform_data**: Aligns the source DataFrame to the target schema.\n  - Drops extra fields, adds missing fields as NULL.\n  - Adds `current_record` and `ingest_datetime` fields.\n- **write_data**: Inserts data into the target table. For incremental loads, previously active records are marked `current_record = False` when updated.\n- **write_to_history**: Logs metadata about each ingestion into a history table.\n\nEach supported source system has its own class inheriting from the Base Class.\nThese subclasses expose a single `read_data` method, which can:\n- Read all records in bulk.\n- Perform incremental loads.\n- Use chunking for large tables.\n\n---\n\n## Setup\n\n### Table Definitions\n\nEach instance requires:\n1. A **definitions** file (DDL for schemas and tables).\n2. An **entity_params** file (metadata and parameters for ingestion).\n\nAll tables must include:\n- `ingest_datetime`\n- `current_record`\n\nExample definition file (`definitions/adventureworks.py`):\n\n```python\ndef get_ddl() -\u003e dict:\n    schema = \"ods_adventureworks\"\n    definitions = {}\n    definitions[f\"{schema}_schema\"] = f\"CREATE SCHEMA {schema};\"\n    definitions[f\"{schema}_history\"] = f\"\"\"\n        CREATE TABLE [{schema}].[history](\n            [id] BIGINT NOT NULL IDENTITY(1,1) PRIMARY KEY,\n            [run_id] BIGINT NOT NULL,\n            [table_name] NVARCHAR(100) NOT NULL,\n            [start_time] DATETIME NOT NULL,\n            [end_time] DATETIME NOT NULL,\n            [time_taken] INT NOT NULL,\n            [rows_processed] INT NOT NULL,\n            [modifieddate] DATETIME NULL\n        );\n    \"\"\"\n    definitions[f\"{schema}_entity_params\"] = f\"\"\"\n        CREATE TABLE [{schema}].[entity_params](\n            [table_name] NVARCHAR(75) NOT NULL PRIMARY KEY,\n            [entity_name] NVARCHAR(75) NOT NULL,\n            [business_key] NVARCHAR(75) NOT NULL,\n            [modified_field] NVARCHAR(75) NULL,\n            [load_method] NVARCHAR(75) NOT NULL,\n            [chunksize] INT NULL,\n            [active] BIT NOT NULL\n        );\n    \"\"\"\n    definitions[f\"{schema}_Department\"] = f\"\"\"\n        CREATE TABLE [{schema}].[Department](\n            [DepartmentID] SMALLINT NOT NULL,\n            [Name] VARCHAR(256) NOT NULL,\n            [GroupName] VARCHAR(256) NOT NULL,\n            [ModifiedDate] DATETIME NOT NULL,\n            [ingest_datetime] DATETIME NOT NULL,\n            [current_record] BIT NOT NULL\n        );\n    \"\"\"\n    return definitions\n```\n### Entity Params\nEntity parameters drive ingestion behavior. Example (entity_params/adventureworks_params.py):\n```python\n# entity_params/adventureworks_params.py\ndef populate_entity_list() -\u003e dict:\n    entity_params = {\n        \"adventureworks\": \"\"\"\n            INSERT INTO [ods_adventureworks].[entity_params] (\n                table_name\n                ,entity_name\n                ,business_key\n                ,modified_field\n                ,load_method\n                ,chunksize\n                ,active\n            )\n            VALUES (\n                'Department'\n                ,'HumanResources.Department'\n                ,'DepartmentID'\n                ,'ModifiedDate'\n                ,'incremental'\n                ,NULL\n                ,1\n            )\n        ;\"\"\",\n    }\n    return entity_params\n```\n#### Parameter notes:\n- **table_name**: Target table name.\n- **entity_name**: Source system entity (e.g., schema.table).\n- **business_key**: Unique identifier (for incremental loads).\n- **modified_field**: Incrementing/change-tracking field (for incremental loads).\n- **load_method**:\n  - **incremental**: Updates only changed rows.\n  - **truncate**: Reloads the full table each run.\n  - **chunksize**: Rows per batch (NULL = default 1M rows).\n- **active**: Enables/disables ingestion for this entity.\n\n### Adding Instances to `main.py`\nEach instance must be registered in the main.py run function so that the correct class is instantiated with the appropriate configuration values.\n```python\n# main.py\nif \"adventureworks\" in instances:\n    cls_instances[\"adventureworks\"] = cls_dict[\"DBMSClass\"](\n        {\n            \"source\": cnxns[\"adventureworks\"],\n            \"target\": cnxns[\"ods\"],\n        },\n        config[\"ods\"][\"adventureworks\"],\n    )\n```\nThis ensures that when the adventureworks instance is invoked, the job uses the correct class type along with the source and target connection details from the configuration.\n\n### Configuration File\nThe config.yaml file contains all connection and job parameters. Example:\n```yaml\n# config.yaml\nparameters:\n  log_path: \"./\"\n  sql_driver: \"ODBC Driver 18 for SQL Server\"\n\nmdh:\n  database: \"mdh\"\n  orchestration: \"orchestration\"\n\nods:\n  type: \"mssql\"\n  server: sql-server\n  port: 1433\n  uid: sa\n  pwd: YourStrong@Passw0rd\n  database: \"ods\"\n  trust_cert: True\n\n  # output schemas:\n  adventureworks: \"ods_adventureworks\"\n\ndbms:\n  adventureworks_type: \"mssql\"\n  adventureworks: sql-server\n  adventureworks_port: 1433\n  adventureworks_uid: sa\n  adventureworks_pwd: YourStrong@Passw0rd\n  adventureworks_database: \"AdventureWorks2022\"\n  adventureworks_trust_cert: True\n```\n#### Parameter Explanations\n- **parameters**: general job settings\n  - **log_path**: directory for the log file.\n  - **sql_driver**: installed ODBC driver for SQL Server.\n- **mdh**: metadata hub settings\n  - **database**: name of the metadata database (must exist in SQL Server).\n  - **orchestration**: schema within the mdh database where the orchestration history table resides. Both the schema and the history table must be created, see [history table](mdhhistorytable).\n- **ods**: operational data store (raw data) settings\n  - **type**: target server type (assumed to be mssql).\n  - **server**: SQL Server hostname.\n  - **port**: SQL Server port.\n  - **uid**: username.\n  - **pwd**: password.\n  - **database**: ODS database name.\n  - **trust_cert**: whether to trust the server certificate (False recommended in production).\n  - **:instance**: schema for each registered instance (e.g., adventureworks).\n- **:system-type**: source system connection parameters (repeat for each type). Example for dbms:\n  - **:instance_type**: server type (e.g., mssql), must be supported by cnxns.\n  - **:instance**: source system hostname.\n  - **:instance_port**: port of the source system.\n  - **:instance_uid**: username.\n  - **:instance_pwd**: password.\n  - **:instance_database**: source database.\n  - **:instance_trust_cert**: certificate trust flag (use False in production).\n\n### MDH History Table\nThe history table tracks each run:\n```sql\nCREATE TABLE mdh.dbo.history (\n       [run_id] [BIGINT] NOT NULL,\n       ,[job] [VARCHAR](255) NOT NULL\n       ,[parent_id] [BIGINT] NULL,\n       ,[dttm_started] [DATETIME] NOT NULL,\n       ,[dttm_finished] [DATETIME] NULL,\n       ,[time_taken] [INT] NULL,\n       ,[run_status] [VARCHAR](9) NOT NULL\n);\n```\n#### Columns\n- **run_id**: unique identifier for the run (also logged in files).\n- **job**: name of the job (e.g., ingest).\n- **parent_id**: run ID of the orchestrating job; NULL indicates an orchestrator.\n- **dttm_started** / **dttm_finished**: start and finish times.\n- **time_taken**: duration in seconds.\n- **run_status**: \"succeeded\" or \"failed\" (failed if any entity fails).\n\n## Usage\n\n### Deploy\nOnce setup has been completed, you need to run `deploy` to setup the requisite tables in the target system, this only needs to be run once before the first run, or when additional entities have been added to an existing source. Changes to entity_params will be overwritten each time `deploy` is run. If the definition of any other existing table is amended, that table will need to be dropped before running `deploy`:\n```shell\npython deploy.py -i *\u003cinstance\u003e\n```\nYou can call as many instances as you require. Instances are named the same as the definition, so for example, to deploy definition/adventureworks.py, you'd run:\n```shell\npython deploy.py -i adventurworks\n```\n\n### Main\nRun main.py to ingest data:\n```shell\npython main.py -i *\u003cinstance\u003e\n```\nAs with `deploy` you can call as many instances as you like. The instance is named as it's set in main, for example for:\n```python\nif \"adventureworks\" in instances:\n    cls_instances[\"adventureworks\"] = cls_dict[\"DBMSClass\"](\n        {\n            \"source\": cnxns[\"adventureworks\"],\n            \"target\": cnxns[\"ods\"],\n        },\n        config[\"ods\"][\"adventureworks\"],\n    )\n```\nyou'd call:\n```shell\npython main.py -i adventurworks\n```\nFor consistency, it's suggested you use the same name as the definition.\n\n## After each run\n- The mdh history table logs each run.\n- Instance-level history tables log changes only when records are ingested.\n\n## Error Handling\n- A run may gracefully fail if an entity or instance cannot be processed.\n  - Other entities continue running, and logs capture the failure.\n  - This prevents one bad entity from stopping the full job.\n- A catastrophic fail occurs if main.py itself raises an unhandled error.\n  - In this case, the mdh history table may still list the job as “running”.\n  - Manual correction may be required.\n\n## Logging\n- Logs are written to the configured log_path.\n- After each run, review the logs if data is missing or unexpected.\n\n\u003c!-- MARKDOWN LINKS \u0026 IMAGES --\u003e\n\u003c!-- https://www.markdownguide.org/basic-syntax/#reference-style-links --\u003e\n[contributors-shield]: https://img.shields.io/github/contributors/philipbudden/ingest-python.svg?style=for-the-badge\n[contributors-url]: https://github.com/philipbudden/ingest-python/graphs/contributors\n[forks-shield]: https://img.shields.io/github/forks/philipbudden/ingest-python.svg?style=for-the-badge\n[forks-url]: https://github.com/philipbudden/ingest-python/network/members\n[stars-shield]: https://img.shields.io/github/stars/philipbudden/ingest-python.svg?style=for-the-badge\n[stars-url]: https://github.com/philipbudden/ingest-python/stargazers\n[issues-shield]: https://img.shields.io/github/issues/philipbudden/ingest-python.svg?style=for-the-badge\n[issues-url]: https://github.com/philipbudden/ingest-python/issues\n[license-shield]: https://img.shields.io/github/license/philipbudden/ingest-python.svg?style=for-the-badge\n[license-url]: https://github.com/philipbudden/ingest-python/blob/main/LICENSE\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fphilbudden%2Fingest-python","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fphilbudden%2Fingest-python","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fphilbudden%2Fingest-python/lists"}