https://github.com/vupdivup/diffhouse

diffhouse is a repository mining tool for structuring Git metadata at scale
https://github.com/vupdivup/diffhouse

git open-source python repository-mining software-analysis

Last synced: 23 days ago
JSON representation

diffhouse is a repository mining tool for structuring Git metadata at scale

Host: GitHub
URL: https://github.com/vupdivup/diffhouse
Owner: vupdivup
License: mit
Created: 2025-09-08T11:03:11.000Z (5 months ago)
Default Branch: main
Last Pushed: 2025-11-29T12:03:08.000Z (3 months ago)
Last Synced: 2025-11-29T13:38:40.885Z (2 months ago)
Topics: git, open-source, python, repository-mining, software-analysis
Language: Python
Homepage: https://vupdivup.github.io/diffhouse/
Size: 1.54 MB
Stars: 0
Watchers: 0
Forks: 0
Open Issues: 12
Metadata Files:
- Readme: README.md
- License: LICENSE.md
- Citation: CITATION.cff

Awesome Lists containing this project

README

          # diffhouse: Repository Mining at Scale

[![PyPI](https://img.shields.io/pypi/v/diffhouse)](https://pypi.org/project/diffhouse/) [![DOI](https://zenodo.org/badge/1052651155.svg)](https://doi.org/10.5281/zenodo.17368264) [![Test status](https://img.shields.io/github/actions/workflow/status/vupdivup/diffhouse/os-test.yml?label=tests&branch=main)](https://github.com/vupdivup/diffhouse/actions/workflows/os-test.yml)

[Documentation](https://vupdivup.github.io/diffhouse/)

diffhouse is a **Python solution for structuring Git metadata**, designed to enable

large-scale codebase analysis at practical speeds.

Key features are:

- 🚀 Fast access to commit data, file changes and more

- 📊 Easy integration with pandas and Polars

- 🐍 Simple-to-use Python interface

## Performance



  

  


  Processing times for tween.js. Lower is better.



For more details, see [benchmarks](https://vupdivup.github.io/diffhouse/benchmarks/).

## Requirements

    

        Python

        3.10 or higher

    

    

        Git

        2.22 or higher

    

Git also needs to be added to the system PATH.

## Limitations

At its core, diffhouse is a data *extraction* tool and therefore does not calculate software metrics like code churn or cyclomatic complexity; if this is needed, take a look at [PyDriller](https://github.com/ishepard/pydriller) instead.

## User Guide

This guide aims to cover the basic use cases of diffhouse. For a full list of objects, consider reading the

[API Reference](https://vupdivup.github.io/diffhouse/reference).

### Installation

Install diffhouse from PyPI:

```sh

pip install diffhouse

```

#### Optional Dependencies

If you plan to combine diffhouse with pandas or Polars, install the package with their respective extras:

    

        pandas

        pip install diffhouse[pandas]

    

    

        Polars

        pip install diffhouse[polars]

    

### Quickstart

```py

from diffhouse import Repo

with Repo('https://github.com/user/repo') as r:

    for c in r.commits:

        print(c.commit_hash[:10], c.date, c.author_email)

    if len(r.branches.to_list()) > 100:

        print('🎉')

    df = r.diffs.to_pandas()

```

To start, create a [`Repo`](https://vupdivup.github.io/diffhouse/reference/repo/) instance by passing either a Git-hosting URL or a local path as its `source` argument. Next, use the `Repo` in a `with` statement to clone the source into a local, non-persistent

location.

Inside the `with` block, you can access data through the following properties:

| Property | Description | Record Type

| --- | --- | --- |

| [`Repo.commits`](https://vupdivup.github.io/diffhouse/reference/repo/#diffhouse.Repo.commits) | Commit history of the repository. | [`Commit`](https://vupdivup.github.io/diffhouse/reference/commit/) |

| [`Repo.filemods`](https://vupdivup.github.io/diffhouse/reference/repo/#diffhouse.Repo.filemods) | File modifications across the commit history. | [`FileMod`](https://vupdivup.github.io/diffhouse/reference/filemod/) |

| [`Repo.diffs`](https://vupdivup.github.io/diffhouse/reference/repo/#diffhouse.Repo.diffs) | Source code changes across the commit history. | [`Diff`](https://vupdivup.github.io/diffhouse/reference/diff/) |

| [`Repo.branches`](https://vupdivup.github.io/diffhouse/reference/repo/#diffhouse.Repo.branches) | Branches of the repository. | [`Branch`](https://vupdivup.github.io/diffhouse/reference/branch/) |

| [`Repo.tags`](https://vupdivup.github.io/diffhouse/reference/repo/#diffhouse.Repo.tags) | Tags of the repository. | [`Tag`](https://vupdivup.github.io/diffhouse/reference/tag/) |

### Querying Results

Data accessors like `Repo.commits` are [`Extractor`](https://vupdivup.github.io/diffhouse/reference/extractor/) objects and can output their results in various formats:

#### Looping Through Objects

You can use extractors in a `for` loop to process objects one by one. Data will be extracted on demand for memory efficiency:

```py

with Repo('https://github.com/user/repo') as r:

    for c in r.commits:

        print(c.commit_hash[:10])

        print(c.author_name)

        if c.in_main:

            break

```

`iter_dicts()` is a `for` loop alternative that yields dictionaries instead of diffhouse objects. A good use case for this is writing results into a newline-delimited JSON file:

```py

import json

with (

    Repo('https://github.com/user/repo') as r,

    open('commits.jsonl', 'w') as f

):

    for c in r.commits.iter_dicts():

        f.write(json.dumps(c) + '\n')

```

#### Converting to Dataframes

pandas and Polars `DataFrame` APIs are supported out of the box. To convert result sets to dataframes, call the following methods:

- `to_pandas()` or `pd()` for pandas

- `to_polars()` or `pl()` for Polars

```py

with Repo('https://github.com/user/repo') as r:

    df1 = r.filemods.to_pandas()  # pandas

    df2 = r.diffs.to_polars()  # Polars

```

### Preliminary Filtering

You can filter data along certain dimensions *before* processing takes place to reduce extraction time and/or network load.

> [!NOTE]

> Filters are a WIP feature. Additional options like date and branch filtering are planned for future releases.

#### Skipping File Downloads

If no blob-level data is needed, pass `blobs=False` when creating the `Repo` to skip file downloads during cloning. Note that this will not populate:

- `files_changed`, `lines_added` and `lines_deleted` fields of `Repo.commits`

- `Repo.filemods`

- `Repo.diffs`

```py

with Repo('https://github.com/user/repo', blobs=False) as r:

    for b in r.branches:

        pass  # business as usual

    r.filemods  # throws FilterError

```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/vupdivup/diffhouse

Awesome Lists containing this project

README