Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/erikbern/git-of-theseus
Analyze how a Git repo grows over time
https://github.com/erikbern/git-of-theseus
author-statistics git python repository-management
Last synced: 2 days ago
JSON representation
Analyze how a Git repo grows over time
- Host: GitHub
- URL: https://github.com/erikbern/git-of-theseus
- Owner: erikbern
- License: apache-2.0
- Created: 2016-09-13T03:29:40.000Z (over 8 years ago)
- Default Branch: master
- Last Pushed: 2023-11-25T17:04:17.000Z (about 1 year ago)
- Last Synced: 2024-10-29T15:39:16.135Z (2 months ago)
- Topics: author-statistics, git, python, repository-management
- Language: Python
- Homepage: https://erikbern.com/2016/12/05/the-half-life-of-code.html
- Size: 2.38 MB
- Stars: 2,536
- Watchers: 26
- Forks: 85
- Open Issues: 15
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
- awesome-discoveries - git-of-theseus - analyze how a Git repo grows over time _(`Python`)_ (Git and Version Control Systems)
README
[![pypi badge](https://img.shields.io/pypi/v/git-of-theseus.svg?style=flat)](https://pypi.python.org/pypi/git-of-theseus)
Some scripts to analyze Git repos. Produces cool looking graphs like this (running it on [git](https://github.com/git/git) itself):
![git](https://raw.githubusercontent.com/erikbern/git-of-theseus/master/pics/git-git.png)
Installing
----------Run `pip install git-of-theseus`
Running
-------First, you need to run `git-of-theseus-analyze ` (see `git-of-theseus-analyze --help` for a bunch of config). This will analyze a repository and might take quite some time.
After that, you can generate plots! Some examples:
1. Run `git-of-theseus-stack-plot cohorts.json` will create a stack plot showing the total amount of code broken down into cohorts (what year the code was added)
1. Run `git-of-theseus-line-plot authors.json --normalize` will show a plot of the % of code contributed by the top 20 authors
1. Run `git-of-theseus-survival-plot survival.json`You can run `--help` to see various options.
If you want to plot multiple repositories, have to run `git-of-theseus-analyze` separately for each project and store the data in separate directories using the `--outdir` flag. Then you can run `git-of-theseus-survival-plot ` (optionally with the `--exp-fit` flag to fit an exponential decay)
Help
----`AttributeError: Unknown property labels` – upgrade matplotlib if you are seeing this. `pip install matplotlib --upgrade`
Some pics
---------Survival of a line of code in a set of interesting repos:
![git](https://raw.githubusercontent.com/erikbern/git-of-theseus/master/pics/git-projects-survival.png)
This curve is produced by the `git-of-theseus-survival-plot` script and shows the *percentage of lines in a commit that are still present after x years*. It aggregates it over all commits, no matter what point in time they were made. So for *x=0* it includes all commits, whereas for *x>0* not all commits are counted (because we would have to look into the future for some of them). The survival curves are estimated using [Kaplan-Meier](https://en.wikipedia.org/wiki/Kaplan%E2%80%93Meier_estimator).
You can also add an exponential fit:
![git](https://raw.githubusercontent.com/erikbern/git-of-theseus/master/pics/git-projects-survival-exp-fit.png)
Linux – stack plot:
![git](https://raw.githubusercontent.com/erikbern/git-of-theseus/master/pics/git-linux.png)
This curve is produced by the `git-of-theseus-stack-plot` script and shows the total number of lines in a repo broken down into cohorts by the year the code was added.
Node – stack plot:
![git](https://raw.githubusercontent.com/erikbern/git-of-theseus/master/pics/git-node.png)
Rails – stack plot:
![git](https://raw.githubusercontent.com/erikbern/git-of-theseus/master/pics/git-rails.png)
Tensorflow – stack plot:
![git](https://raw.githubusercontent.com/erikbern/git-of-theseus/master/pics/git-tensorflow.png)
Rust – stack plot:
![git](https://raw.githubusercontent.com/erikbern/git-of-theseus/master/pics/git-rust.png)
Plotting other stuff
--------------------`git-of-theseus-analyze` will write `exts.json`, `cohorts.json` and `authors.json`. You can run `git-of-theseus-stack-plot authors.json` to plot author statistics as well, or `git-of-theseus-stack-plot exts.json` to plot file extension statistics. For author statistics, you might want to create a [.mailmap](https://git-scm.com/docs/gitmailmap) file in the root directory of the repository to deduplicate authors. If you need to create a .mailmap file the following command can list the distinct author-email combinations in a repository:
Mac / Linux
```shell
git log --pretty=format:"%an %ae" | sort | uniq
```Windows Powershell
```powershell
git log --pretty=format:"%an %ae" | Sort-Object | Select-Object -Unique
```For instance, here's the author statistics for [Kubernetes](https://github.com/kubernetes/kubernetes):
![git](https://raw.githubusercontent.com/erikbern/git-of-theseus/master/pics/git-kubernetes-authors.png)
You can also normalize it to 100%. Here's author statistics for Git:
![git](https://raw.githubusercontent.com/erikbern/git-of-theseus/master/pics/git-git-authors-normalized.png)
Other stuff
-----------[Markovtsev Vadim](https://twitter.com/tmarkhor) implemented a very similar analysis that claims to be 20%-6x faster than Git of Theseus. It's named [Hercules](https://github.com/src-d/hercules) and there's a great [blog post](https://web.archive.org/web/20180918135417/https://blog.sourced.tech/post/hercules.v4/) about all the complexity going into the analysis of Git history.