{"id":45021162,"url":"https://github.com/godalida/koala-diff","last_synced_at":"2026-02-25T22:07:13.636Z","repository":{"id":338264926,"uuid":"1157219021","full_name":"godalida/koala-diff","owner":"godalida","description":"Blazingly fast data comparison tool for Python, powered by Rust. Compare massive CSV/Parquet datasets instantly.","archived":false,"fork":false,"pushed_at":"2026-02-16T14:23:20.000Z","size":2055,"stargazers_count":4,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-02-20T04:54:37.101Z","etag":null,"topics":["csv","data-engineering","data-quality","diff","high-performance","parquet","polars","python","rust","simd"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/godalida.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-02-13T15:13:49.000Z","updated_at":"2026-02-16T14:22:31.000Z","dependencies_parsed_at":"2026-02-20T03:01:03.242Z","dependency_job_id":null,"html_url":"https://github.com/godalida/koala-diff","commit_stats":null,"previous_names":["godalida/koala-diff"],"tags_count":5,"template":false,"template_full_name":null,"purl":"pkg:github/godalida/koala-diff","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/godalida%2Fkoala-diff","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/godalida%2Fkoala-diff/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/godalida%2Fkoala-diff/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/godalida%2Fkoala-diff/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/godalida","download_url":"https://codeload.github.com/godalida/koala-diff/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/godalida%2Fkoala-diff/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":29842997,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-02-25T21:18:31.832Z","status":"ssl_error","status_checked_at":"2026-02-25T21:18:29.265Z","response_time":61,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.5:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["csv","data-engineering","data-quality","diff","high-performance","parquet","polars","python","rust","simd"],"created_at":"2026-02-19T02:30:20.433Z","updated_at":"2026-02-25T22:07:13.615Z","avatar_url":"https://github.com/godalida.png","language":"Python","readme":"\u003cp align=\"center\"\u003e\n  \u003cimg src=\"https://raw.githubusercontent.com/godalida/koala-diff/main/assets/logo.png\" alt=\"Koala Diff Logo\" width=\"200\"\u003e\n\u003c/p\u003e\n\n\u003ch1 align=\"center\"\u003eKoala Diff\u003c/h1\u003e\n\n\u003cp align=\"center\"\u003e\n  \u003cstrong\u003eBlazingly Fast Data Comparison for the Modern Stack.\u003c/strong\u003e\n\u003c/p\u003e\n\n\u003cp align=\"center\"\u003e\n  \u003cimg src=\"https://raw.githubusercontent.com/godalida/koala-diff/main/assets/report_hero.png\" alt=\"Koala Diff Report Hero\" width=\"800\"\u003e\n\u003c/p\u003e\n\n\u003cp align=\"center\"\u003e\n  \u003ca href=\"https://pypi.org/project/koala-diff/\"\u003e\n    \u003cimg src=\"https://img.shields.io/pypi/v/koala-diff?color=007FFF\" alt=\"PyPI\"\u003e\n  \u003c/a\u003e\n  \u003ca href=\"https://pepy.tech/projects/koala-diff\"\u003e\n    \u003cimg src=\"https://static.pepy.tech/personalized-badge/koala-diff?period=total\u0026units=INTERNATIONAL_SYSTEM\u0026left_color=grey\u0026right_color=BLUE\u0026left_text=downloads\" alt=\"PyPI Downloads\"\u003e\n  \u003c/a\u003e\n  \u003ca href=\"https://github.com/godalida/koala-diff/actions\"\u003e\n    \u003cimg src=\"https://img.shields.io/github/actions/workflow/status/godalida/koala-diff/CI.yml?branch=main\" alt=\"Tests\"\u003e\n  \u003c/a\u003e\n  \u003cimg src=\"https://img.shields.io/pypi/pyversions/koala-diff?color=6e42c1\" alt=\"Python Versions\"\u003e\n  \u003ca href=\"https://github.com/godalida/koala-diff/blob/main/LICENSE\"\u003e\n    \u003cimg src=\"https://img.shields.io/github/license/godalida/koala-diff?color=white\" alt=\"License\"\u003e\n  \u003c/a\u003e\n\u003c/p\u003e\n\n\u003cp align=\"center\"\u003e\n  \u003ca href=\"#-quick-start\"\u003e🚀 Quickstart\u003c/a\u003e |\n  \u003ca href=\"https://github.com/godalida/koala-diff/issues\"\u003e🚩 Issues\u003c/a\u003e |\n  \u003ca href=\"#-the-magic-benchmark\"\u003e📊 Benchmarks\u003c/a\u003e\n\u003c/p\u003e\n\n---\n\n**Koala Diff** is the \"git diff\" for your data lake. It compares massive datasets (CSV, Parquet, JSON) instantly to find added, removed, and modified rows.\n\nBuilt in **Rust** 🦀 for speed, wrapped in **Python** 🐍 for ease-of-use. It streams data to compare datasets larger than RAM and generates beautiful HTML reports.\n\n### 🚀 Why Koala Diff?\n\n*   **Zero-Copy Streaming:** Compare 100GB files on a laptop without crashing RAM.\n*   **Rust-Powered Analytics:** Go beyond row counts. Track **Value Variance**, **Null Drift**, and **Match Integrity** per column.\n*   **Professional Dashboards:** Auto-generates premium, stakeholder-ready HTML reports with status badges and join attribution.\n*   **Deep-Dive API:** Extract mismatched records as Polars DataFrames for instant remediation.\n\n---\n\n## 📈 The \"Magic\" Benchmark\n\n\u003e **\"Process 100M rows on a laptop in seconds, not minutes.\"**\n\n\u003cp align=\"center\"\u003e\n  \u003cimg src=\"https://raw.githubusercontent.com/godalida/koala-diff/main/assets/benchmark_100m.png\" alt=\"Koala Diff Benchmarks\" width=\"800\"\u003e\n\u003c/p\u003e\n\n### ⚡ Performance at a Glance\n*   **Time:** 🟦🟦 **1x** (Koala) vs 🟦🟦🟦🟦🟦 **3x** (Polars) vs 🟦🟦...🟦 **30x+** (Pandas)\n*   **RAM:** 🟩 **0.4GB** (Koala Diff) vs 🟩🟩🟩🟩🟩🟩🟩🟩 **12GB+** (Polars)\n*   **Edge:** Native Rust `XXHash64` handles massive joins locally without cluster overhead.\n\n---\n\n### 🧐 Why not just use Polars/Spark?\n\nWhile Polars and Spark are incredible for general data processing, **Koala Diff** is a specialized tool for **Data Quality \u0026 Regression**:\n\n| Feature | Polars / Spark | 🚀 Koala Diff |\n| :--- | :--- | :--- |\n| **Specialization** | General Purpose ETL | **Data Quality \u0026 Diffing** |\n| **Memory** | High (Join-heavy) | **Ultra-Low (Streaming)** |\n| **Output** | Raw DataFrames | **Pro Dashboards + Metrics** |\n| **Logic** | Manual Join/Filter code | **Out-of-the-box Analytics** |\n| **Stakeholders** | Engineer-facing | **Business-Ready Reports** |\n\n*Koala Diff doesn't replace your processing engine; it verifies that its output is correct.*\n\n---\n\n---\n\n*\u003e Benchmarks run on MacBook Pro M3 Max.*\n\n---\n\n## 🎯 Common Use Cases\n\n*   **ETL Regression Testing:** Automatically verify that your daily pipeline didn't accidentally mutate 1 million rows after a code change.\n*   **Data Migration Validation:** Ensure 100% parity when moving data between systems (e.g., Hive to Snowflake or S3 to BigQuery).\n*   **Environment Drift Detection:** Compare **Production** vs. **Staging** datasets to find out why your model is behaving differently.\n*   **Compliance Auditing:** Generate unalterable HTML snapshots of data changes for regulatory or financial reviews.\n*   **CI/CD for Data:** Run `koala-diff` in your CI pipeline to block PRs that introduce unexpected data quality regressions.\n\n---\n\n## 📦 Installation\n\n```bash\npip install koala-diff\n```\n\n## ⚡ Quick Start\n\n### 1. Generate a \"Pro\" Report\n\n```python\nfrom koala_diff import DataDiff, HtmlReporter\n\n# Initialize with primary keys\ndiffer = DataDiff(key_columns=[\"user_id\"])\n\n# Run comparison\nresult = differ.compare(\"source.parquet\", \"target.parquet\")\n\n# Generate a professional dashboard\nreporter = HtmlReporter(\"data_quality_report.html\")\nreporter.generate(result)\n```\n\n### 2. Mismatch Deep-Dive\n\nNeed to fix the data? Pull the exact differences directly into Python:\n\n```python\n# Get a Polars DataFrame of ONLY mismatched rows\nmismatch_df = differ.get_mismatch_df()\n\n# Analyze variance or push to a remediation pipeline\nprint(mismatch_df.head())\n```\n\n### 2. CLI Usage (Coming Soon)\n\n```bash\nkoala-diff production.csv staging.csv --key user_id --output report.html\n```\n\n\n\n## 🏗 Architecture\n\nKoala Diff uses a streaming hash-join algorithm implemented in Rust:\n\n1.  **Reader:** Lazy Polars scan of both datasets.\n2.  **Hasher:** XXHash64 computation of row values (SIMD optimized).\n3.  **Differ:** fast set operations to classify rows as `Added`, `Removed`, or `Modified`.\n4.  **Reporter:** Jinja2 rendering of results.\n\n## 🤝 Contributing\n\nWe welcome contributions! Whether it's a new file format reader, a performance optimization, or a documentation fix.\n\n1.  Check the [Issues](https://github.com/godalida/koala-diff/issues).\n2.  Read our [Contribution Guide](CONTRIBUTING.md).\n\n## 📄 License\n\nMIT © 2026 [godalida](https://github.com/godalida) - [KoalaDataLab](https://koaladatalab.com)\n","funding_links":[],"categories":["Data Comparison"],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgodalida%2Fkoala-diff","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fgodalida%2Fkoala-diff","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgodalida%2Fkoala-diff/lists"}