{"id":25359555,"url":"https://github.com/monarch-initiative/koza","last_synced_at":"2026-05-07T01:13:39.976Z","repository":{"id":37159635,"uuid":"322350487","full_name":"monarch-initiative/koza","owner":"monarch-initiative","description":"Data transformation framework for LinkML data models","archived":false,"fork":false,"pushed_at":"2025-02-28T19:05:05.000Z","size":4699,"stargazers_count":50,"open_issues_count":43,"forks_count":5,"subscribers_count":22,"default_branch":"main","last_synced_at":"2025-03-30T02:03:43.919Z","etag":null,"topics":["etl","knowledge-graph","koza","linkml","monarchinitiative","obofoundry","ontology"],"latest_commit_sha":null,"homepage":"https://koza.monarchinitiative.org/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"bsd-3-clause","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/monarch-initiative.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":"CITATION.cff","codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2020-12-17T16:25:47.000Z","updated_at":"2025-03-22T07:28:41.000Z","dependencies_parsed_at":"2023-01-30T16:35:04.859Z","dependency_job_id":"a48c8e5b-1bf8-4fc9-95ba-ba5f4cbeb186","html_url":"https://github.com/monarch-initiative/koza","commit_stats":{"total_commits":224,"total_committers":7,"mean_commits":32.0,"dds":0.5357142857142857,"last_synced_commit":"d081fff53594243a0f39b84fc702d38c493553e7"},"previous_names":[],"tags_count":25,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/monarch-initiative%2Fkoza","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/monarch-initiative%2Fkoza/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/monarch-initiative%2Fkoza/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/monarch-initiative%2Fkoza/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/monarch-initiative","download_url":"https://codeload.github.com/monarch-initiative/koza/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247427005,"owners_count":20937200,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["etl","knowledge-graph","koza","linkml","monarchinitiative","obofoundry","ontology"],"created_at":"2025-02-14T21:06:27.603Z","updated_at":"2026-05-07T01:13:39.962Z","avatar_url":"https://github.com/monarch-initiative.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Koza - Knowledge Graph Transformation and Operations Toolkit\n\n[![Pyversions](https://img.shields.io/pypi/pyversions/koza.svg)](https://pypi.python.org/pypi/koza)\n[![PyPi](https://img.shields.io/pypi/v/koza.svg)](https://pypi.python.org/pypi/koza)\n![Github Action](https://github.com/monarch-initiative/koza/actions/workflows/test.yaml/badge.svg)\n\n![pupa](docs/img/pupa.png)  \n\n[**Documentation**](https://koza.monarchinitiative.org/)\n\n_Disclaimer_: Koza is in beta - we are looking for testers!\n\n## Overview\n\nKoza is a Python library and CLI tool for transforming biomedical data and performing graph operations on Knowledge Graph Exchange (KGX) files. It provides two main capabilities:\n\n### 📊 **Graph Operations** (New!)\nPowerful DuckDB-based operations for KGX knowledge graphs:\n- **Join** multiple KGX files with schema harmonization\n- **Split** files by field values with format conversion  \n- **Prune** dangling edges and handle singleton nodes\n- **Append** new data to existing databases with schema evolution\n- **Multi-format support** for TSV, JSONL, and Parquet files\n\n### 🔄 **Data Transformation** (Core)\nTransform biomedical data sources into KGX format:\n- Transform csv, json, yaml, jsonl, and xml to target formats\n- Output in [KGX format](https://github.com/biolink/kgx/blob/master/specification/kgx-format.md#kgx-format-as-tsv)\n- Write data transforms in semi-declarative Python\n- Configure source files, columns/properties, and metadata in YAML\n- Create mapping files and translation tables between vocabularies\n\n## Installation\nKoza is available on PyPi and can be installed via pip/pipx:\n```\n[pip|pipx] install koza\n```\n\n## Usage\n\n### Quick Start with Graph Operations\n\nKoza's graph operations work seamlessly across multiple KGX formats (TSV, JSONL, Parquet):\n\n```bash\n# Join multiple KGX files into a unified database\nkoza join --nodes genes.tsv pathways.jsonl --edges interactions.parquet --output merged_graph.duckdb\n\n# Prune dangling edges and handle singleton nodes\nkoza prune --database merged_graph.duckdb --keep-singletons\n\n# Append new data to existing database with schema evolution\nkoza append --database merged_graph.duckdb --nodes new_genes.tsv --edges new_interactions.jsonl\n\n# Split database by source with format conversion\nkoza split --database merged_graph.duckdb --split-on provided_by --output-format parquet\n```\n\n**NOTE: As of version 0.2.0, there is a new method for getting your ingest's `KozaApp` instance. Please see the [updated documentation](https://koza.monarchinitiative.org/Usage/configuring_ingests/#transform-code) for details.**\n\nSee the [Koza documentation](https://koza.monarchinitiative.org/) for complete usage information\n\n### Examples\n\n#### Validate\n\nGive Koza a local or remote csv file, and get some basic information (headers, number of rows)\n\n```bash\nkoza validate \\\n  --file https://raw.githubusercontent.com/monarch-initiative/koza/main/examples/data/string.tsv \\\n  --delimiter ' '\n```\n\nSending a json or jsonl formatted file will confirm if the file is valid json or jsonl\n\n```bash\nkoza validate \\\n  --file ./examples/data/ZFIN_PHENOTYPE_0.jsonl.gz \\\n  --format jsonl\n```\n\n```bash\nkoza validate \\\n  --file ./examples/data/ddpheno.json.gz \\\n  --format json\n```\n\n#### Transform\n\nRun the example ingest, \"string/protein-links-detailed\"\n```bash\nkoza transform \\\n  --source examples/string/protein-links-detailed.yaml \\\n  --global-table examples/translation_table.yaml\n\nkoza transform \\\n  --source examples/string-declarative/protein-links-detailed.yaml \\\n  --global-table examples/translation_table.yaml\n```\n\n**Note**: \n  Koza expects a directory structure as described in the above example  \n  with the source config file and transform code in the same directory: \n  ```\n  .\n  ├── ...\n  │   ├── your_source\n  │   │   ├── your_ingest.yaml\n  │   │   └── your_ingest.py\n  │   └── some_translation_table.yaml\n  └── ...\n  ```\n\n#### Graph Operations\n\nCreate and manipulate knowledge graphs from existing KGX files:\n\n```bash\n# Join heterogeneous KGX files with automatic schema harmonization\nkoza join \\\n  --nodes genes.tsv proteins.jsonl pathways.parquet \\\n  --edges gene_protein.tsv protein_pathway.jsonl \\\n  --output unified_graph.duckdb \\\n  --schema-report\n\n# Clean up graph integrity issues\nkoza prune \\\n  --database unified_graph.duckdb \\\n  --keep-singletons \\\n  --dry-run  # Preview changes before applying\n\n# Incrementally add new data with schema evolution\nkoza append \\\n  --database unified_graph.duckdb \\\n  --nodes new_genes.tsv updated_pathways.jsonl \\\n  --deduplicate \\\n  --show-progress\n\n# Export subsets with format conversion\nkoza split \\\n  --database unified_graph.duckdb \\\n  --split-on provided_by \\\n  --output-format parquet \\\n  --output-dir ./split_graphs\n```\n\n## Key Features\n\n### 🔧 **Multi-Format Support**\n- Native support for TSV, JSONL, and Parquet KGX files\n- Automatic format detection and conversion\n- Mixed-format operations in single commands\n\n### 🛡️ **Schema Flexibility**\n- Automatic schema harmonization across heterogeneous files\n- Schema evolution with backward compatibility  \n- Comprehensive schema reporting and validation\n\n### ⚡ **High Performance**\n- DuckDB-powered operations for fast bulk processing\n- Memory-efficient handling of large knowledge graphs\n- Parallel processing and streaming where possible\n\n### 🔍 **Rich CLI Experience**\n- Progress indicators for long-running operations\n- Detailed statistics and operation summaries\n- Dry-run modes for safe operation preview\n\n### 🧹 **Data Integrity**\n- Dangling edge detection and preservation\n- Duplicate detection and removal strategies\n- Non-destructive operations with data archiving","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmonarch-initiative%2Fkoza","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmonarch-initiative%2Fkoza","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmonarch-initiative%2Fkoza/lists"}