{"id":37595739,"url":"https://github.com/vupdivup/diffhouse","last_synced_at":"2026-01-20T17:03:08.124Z","repository":{"id":314887676,"uuid":"1052651155","full_name":"vupdivup/diffhouse","owner":"vupdivup","description":"diffhouse is a repository mining tool for structuring Git metadata at scale","archived":false,"fork":false,"pushed_at":"2025-11-29T12:03:08.000Z","size":1619,"stargazers_count":0,"open_issues_count":12,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2025-11-29T13:38:40.885Z","etag":null,"topics":["git","open-source","python","repository-mining","software-analysis"],"latest_commit_sha":null,"homepage":"https://vupdivup.github.io/diffhouse/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/vupdivup.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE.md","code_of_conduct":null,"threat_model":null,"audit":null,"citation":"CITATION.cff","codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2025-09-08T11:03:11.000Z","updated_at":"2025-11-26T16:26:18.000Z","dependencies_parsed_at":"2025-09-15T13:12:25.978Z","dependency_job_id":"1f59c4db-ef85-47e2-9772-047f8657c0b0","html_url":"https://github.com/vupdivup/diffhouse","commit_stats":null,"previous_names":["vupdivup/diffhouse"],"tags_count":20,"template":false,"template_full_name":null,"purl":"pkg:github/vupdivup/diffhouse","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/vupdivup%2Fdiffhouse","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/vupdivup%2Fdiffhouse/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/vupdivup%2Fdiffhouse/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/vupdivup%2Fdiffhouse/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/vupdivup","download_url":"https://codeload.github.com/vupdivup/diffhouse/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/vupdivup%2Fdiffhouse/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":28607624,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-01-20T16:10:39.856Z","status":"ssl_error","status_checked_at":"2026-01-20T16:10:39.493Z","response_time":117,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["git","open-source","python","repository-mining","software-analysis"],"created_at":"2026-01-16T09:56:08.434Z","updated_at":"2026-01-20T17:03:08.108Z","avatar_url":"https://github.com/vupdivup.png","language":"Python","readme":"# diffhouse: Repository Mining at Scale\n\n[![PyPI](https://img.shields.io/pypi/v/diffhouse)](https://pypi.org/project/diffhouse/) [![DOI](https://zenodo.org/badge/1052651155.svg)](https://doi.org/10.5281/zenodo.17368264) [![Test status](https://img.shields.io/github/actions/workflow/status/vupdivup/diffhouse/os-test.yml?label=tests\u0026branch=main)](https://github.com/vupdivup/diffhouse/actions/workflows/os-test.yml)\n\n[Documentation](https://vupdivup.github.io/diffhouse/)\n\n\u003c!-- home-start --\u003e\n\ndiffhouse is a **Python solution for structuring Git metadata**, designed to enable\nlarge-scale codebase analysis at practical speeds.\n\nKey features are:\n\n- 🚀 Fast access to commit data, file changes and more\n- 📊 Easy integration with pandas and Polars\n- 🐍 Simple-to-use Python interface\n\n## Performance\n\n\u003cp align=\"center\"\u003e\n  \u003cimg src=\"https://raw.githubusercontent.com/vupdivup/diffhouse/assets/benchmarks/benchmark_tweenjs.png\" alt=\"tweenjs/tween.js benchmark results\" width=\"480px\"\u003e\n  \u003cbr/\u003e\n  \u003cem\u003eProcessing times for \u003ca href=\"https://github.com/tweenjs/tween.js\"\u003etween.js\u003c/a\u003e. Lower is better.\u003c/em\u003e\n\u003c/p\u003e\n\nFor more details, see [benchmarks](https://vupdivup.github.io/diffhouse/benchmarks/).\n\n## Requirements\n\n\u003ctable\u003e\n    \u003ctr\u003e\n        \u003ctd\u003e\u003cstrong\u003ePython\u003c/strong\u003e\u003c/td\u003e\n        \u003ctd\u003e3.10 or higher\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n        \u003ctd\u003e\u003cstrong\u003eGit\u003c/strong\u003e\u003c/td\u003e\n        \u003ctd\u003e2.22 or higher\u003c/td\u003e\n    \u003c/tr\u003e\n\u003c/table\u003e\n\nGit also needs to be added to the system PATH.\n\n## Limitations\n\nAt its core, diffhouse is a data *extraction* tool and therefore does not calculate software metrics like code churn or cyclomatic complexity; if this is needed, take a look at [PyDriller](https://github.com/ishepard/pydriller) instead.\n\n\u003c!-- home-end --\u003e\n\n## User Guide\n\n\u003c!-- user-guide-start --\u003e\n\nThis guide aims to cover the basic use cases of diffhouse. For a full list of objects, consider reading the\n[API Reference](https://vupdivup.github.io/diffhouse/reference).\n\n### Installation\n\nInstall diffhouse from PyPI:\n\n```sh\npip install diffhouse\n```\n\n#### Optional Dependencies\n\nIf you plan to combine diffhouse with pandas or Polars, install the package with their respective extras:\n\n\u003ctable\u003e\n    \u003ctr\u003e\n        \u003ctd\u003epandas\u003c/td\u003e\n        \u003ctd\u003e\u003ccode\u003epip install diffhouse[pandas]\u003c/code\u003e\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n        \u003ctd\u003ePolars\u003c/td\u003e\n        \u003ctd\u003e\u003ccode\u003epip install diffhouse[polars]\u003c/code\u003e\u003c/td\u003e\n    \u003c/tr\u003e\n\u003c/table\u003e\n\n### Quickstart\n\n```py\nfrom diffhouse import Repo\n\nwith Repo('https://github.com/user/repo') as r:\n    for c in r.commits:\n        print(c.commit_hash[:10], c.date, c.author_email)\n\n    if len(r.branches.to_list()) \u003e 100:\n        print('🎉')\n\n    df = r.diffs.to_pandas()\n```\n\nTo start, create a [`Repo`](https://vupdivup.github.io/diffhouse/reference/repo/) instance by passing either a Git-hosting URL or a local path as its `source` argument. Next, use the `Repo` in a `with` statement to clone the source into a local, non-persistent\nlocation.\n\nInside the `with` block, you can access data through the following properties:\n\n| Property | Description | Record Type\n| --- | --- | --- |\n| [`Repo.commits`](https://vupdivup.github.io/diffhouse/reference/repo/#diffhouse.Repo.commits) | Commit history of the repository. | [`Commit`](https://vupdivup.github.io/diffhouse/reference/commit/) |\n| [`Repo.filemods`](https://vupdivup.github.io/diffhouse/reference/repo/#diffhouse.Repo.filemods) | File modifications across the commit history. | [`FileMod`](https://vupdivup.github.io/diffhouse/reference/filemod/) |\n| [`Repo.diffs`](https://vupdivup.github.io/diffhouse/reference/repo/#diffhouse.Repo.diffs) | Source code changes across the commit history. | [`Diff`](https://vupdivup.github.io/diffhouse/reference/diff/) |\n| [`Repo.branches`](https://vupdivup.github.io/diffhouse/reference/repo/#diffhouse.Repo.branches) | Branches of the repository. | [`Branch`](https://vupdivup.github.io/diffhouse/reference/branch/) |\n| [`Repo.tags`](https://vupdivup.github.io/diffhouse/reference/repo/#diffhouse.Repo.tags) | Tags of the repository. | [`Tag`](https://vupdivup.github.io/diffhouse/reference/tag/) |\n\n### Querying Results\n\nData accessors like `Repo.commits` are [`Extractor`](https://vupdivup.github.io/diffhouse/reference/extractor/) objects and can output their results in various formats:\n\n#### Looping Through Objects\n\nYou can use extractors in a `for` loop to process objects one by one. Data will be extracted on demand for memory efficiency:\n\n```py\nwith Repo('https://github.com/user/repo') as r:\n    for c in r.commits:\n        print(c.commit_hash[:10])\n        print(c.author_name)\n\n        if c.in_main:\n            break\n```\n\n`iter_dicts()` is a `for` loop alternative that yields dictionaries instead of diffhouse objects. A good use case for this is writing results into a newline-delimited JSON file:\n\n```py\nimport json\n\nwith (\n    Repo('https://github.com/user/repo') as r,\n    open('commits.jsonl', 'w') as f\n):\n    for c in r.commits.iter_dicts():\n        f.write(json.dumps(c) + '\\n')\n```\n\n#### Converting to Dataframes\n\npandas and Polars `DataFrame` APIs are supported out of the box. To convert result sets to dataframes, call the following methods:\n\n- `to_pandas()` or `pd()` for pandas\n- `to_polars()` or `pl()` for Polars\n\n```py\nwith Repo('https://github.com/user/repo') as r:\n    df1 = r.filemods.to_pandas()  # pandas\n    df2 = r.diffs.to_polars()  # Polars\n```\n\n### Preliminary Filtering\n\nYou can filter data along certain dimensions *before* processing takes place to reduce extraction time and/or network load.\n\n\u003e [!NOTE]\n\u003e Filters are a WIP feature. Additional options like date and branch filtering are planned for future releases.\n\n#### Skipping File Downloads\n\nIf no blob-level data is needed, pass `blobs=False` when creating the `Repo` to skip file downloads during cloning. Note that this will not populate:\n\n- `files_changed`, `lines_added` and `lines_deleted` fields of `Repo.commits`\n- `Repo.filemods`\n- `Repo.diffs`\n\n```py\nwith Repo('https://github.com/user/repo', blobs=False) as r:\n    for b in r.branches:\n        pass  # business as usual\n\n    r.filemods  # throws FilterError\n```\n\n\u003c!-- user-guide-end --\u003e\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fvupdivup%2Fdiffhouse","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fvupdivup%2Fdiffhouse","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fvupdivup%2Fdiffhouse/lists"}