{"id":27881834,"url":"https://github.com/src-d/identity-matching","last_synced_at":"2026-03-12T13:32:41.170Z","repository":{"id":62865964,"uuid":"188056309","full_name":"src-d/identity-matching","owner":"src-d","description":"source{d} extension to match Git signatures to real people.","archived":false,"fork":false,"pushed_at":"2019-11-12T21:33:45.000Z","size":899,"stargazers_count":17,"open_issues_count":5,"forks_count":13,"subscribers_count":6,"default_branch":"master","last_synced_at":"2025-05-05T05:05:29.801Z","etag":null,"topics":["identity-matching","personal-info"],"latest_commit_sha":null,"homepage":"","language":"Go","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"gpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/src-d.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":"CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2019-05-22T14:33:55.000Z","updated_at":"2021-04-13T21:15:35.000Z","dependencies_parsed_at":"2022-11-08T06:15:26.610Z","dependency_job_id":null,"html_url":"https://github.com/src-d/identity-matching","commit_stats":null,"previous_names":["src-d/eee-identity-matching"],"tags_count":6,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/src-d%2Fidentity-matching","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/src-d%2Fidentity-matching/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/src-d%2Fidentity-matching/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/src-d%2Fidentity-matching/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/src-d","download_url":"https://codeload.github.com/src-d/identity-matching/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":252442486,"owners_count":21748451,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["identity-matching","personal-info"],"created_at":"2025-05-05T05:05:35.335Z","updated_at":"2025-12-15T01:07:37.231Z","avatar_url":"https://github.com/src-d.png","language":"Go","readme":"# Identity Matching source{d} Extension\n\n[![Travis build status](https://travis-ci.com/src-d/identity-matching.svg?token=WzaxY77NzbmrefwxuhAh\u0026branch=master)](https://travis-ci.com/src-d/identity-matching) [![Code coverage](https://codecov.io/github/src-d/identity-matching/coverage.svg)](https://codecov.io/gh/src-d/identity-matching) [![Docker pulls](https://img.shields.io/docker/pulls/srcd/identity_matching.svg)](https://hub.docker.com/r/srcd/identity_matching) [![Go Report Card](https://goreportcard.com/badge/github.com/src-d/identity-matching?branch=master)](https://goreportcard.com/report/github.com/src-d/identity-matching) [![GPL 3.0 license](https://img.shields.io/badge/License-GPL%203.0-blue.svg)](https://opensource.org/licenses/GPL-3.0)\n\nMatch different identities of the same person using 🤖. Extension for [source{d}](https://github.com/src-d/sourced-ce).\n\n[Overview](#overview) • [How To Use](#how-to-use) • [Science](#science) • [Contributions](#contributions) • [License](#license)\n\n## Overview\n\nPeople are using different e-mails and names (aka identities) when they commit their work to git. \nE-mails can be corporate, personal, special like users.noreply.github.com, etc. \nNames can be with Surname or without, with typos, no name, etc. \nThus to get precise information about developer it is required to gather their identities \nand separate them from another person identities. That's what we call Identity Matching.\n\n\u003cp align=\"center\"\u003e\n  \u003cimg src=\"docs/assets/idmatching_graph.png\" alt=\"Identity graph\"/\u003e\n\u003c/p\u003e\n\n## How To Use\n\n**Right now no pre-built binaries are available.**\nPlease refer to [How to build from source code](#how-to-build-from-source-code) section to build an executable.\n\nRun `match-identities --help` to see all the parameters that you can configure. \n\nThere are two use cases supported for `match-identities`.\n1. [With gitbase](#use-with-gitbase)\n2. [Without gitbase](#use-without-gitbase)\n\nIn both cases, the output identity table is saved as a Parquet file.\nRead more in the [Output format](#output-format) section.\n\n### Use with gitbase\n\n`match-identities` is supposed to be used with [gitbase](https://github.com/src-d/gitbase). \nFirst of all, make sure you have a gitbase instance running with all the repositories you are going to analyze.\nPlease refer to the [gitbase](https://github.com/src-d/gitbase) documentation to get more information. \n\nUsage example:\n```\nmatch-identities --output matched_identities.parquet\n```\n\nThe credentials can be configured with the `--host`, `--port`, `--user` and `--password` flags. \n\nFor example, the following SQL gitbase query will return the identities of each commit author:\n```sql\nSELECT DISTINCT repository_id, commit_author_name, commit_author_email\nFROM commits;\n```\n\nIf you want to cache the gitbase output you can use the `--cache` flag. \nAfter the identities are fetched from gitbase, the matching process is run. \nRead [Science](#Science) section to learn more.\n\n### Use without gitbase\nIf you run `match-identities` with the `--cache` option enabled you get a `csv` file with the cached [gitbase](https://github.com/src-d/gitbase) output.\nBesides, if you already have a list of identities it is possible to run `match-identities` without gitbase involved.\nCreate a CSV file with the columns `repo`, `email` and `name`, then feed it to the `--cache` parameter.\n\nUsage Example:\n```\nmatch-identities \\\n    --cache path/to/csv/file.csv \\\n    --output matched_identities.parquet\n```\n\n### Output format \nOnce the algorithm finishes to merge identities, you get a table with 4 columns: \n1. `id` (`int64`) -- unique identifier of the person with the corresponding identity. \n2. `email` (`utf8`) -- e-mail of the identity.\n3. `name` (`utf8`) -- name of the identity.\n4. `repo` (`utf8`) -- repository of the commit.\n\n\nThe columns `email`, `name` and `repo` may contain empty values which means no constraints.\nFor example, let's consider this output identity table:\n```\nid,email,name,repo\n1,alice@gmail.com,\"\",\"\"\n1,\"\",alice,\"\"\n2,bob@gmail.com,\"\",\"\"\n2,\"\",bob,\"\"\n2,bob@inbox.com,\"\",\"\"\n2,\"\",no-name,bob/bobs-project\n```\n\nThere are two developers. \nLet's name them Alice (with id 1) and Bob (with id 2). \nWhen we analyze a commit with `alice@gmail.com` as author email, then the author is Alice.\nThe repository and author name are ignored since the author email is the most reliable way to define an identity.\nOn the other hand, when we analyze a commit with `alice` as an author name, then the author is Alice for whatever combination of email and repository.\nSame for Bob, although he uses two different email addresses `bob@gmail.com` and `bob@inbox.com`.\nIf we come across a commit with the `no-name` author name in `bob/bobs-project` repository then it is Bob's. \n\n### Convert parquet to CSV\n\nIt is possible to convert the output parquet file to CSV using the python script in the `research` directory:\n```bash \npython3 ./research/parquet2csv.py matched_identities.parquet\n```\nThe result will be saved as `matched_identities.csv`.\nPlease note that pyspark must be installed. \n\n### External matching option\n\nIf the organization is using GitHub, Gitlab or Bitbucket, it is possible to use their API to match identities by emails. In that case, 2 columns are added and filled for every email in the table: the `External id provider` and the `External id` itself.\n\n## How to build\n\n```bash\ngit clone https://github.com/src-d/identity-matching\ncd identity-matching\nmake build\n```\n\nYou'll see two directories with Linux and Macos binaries inside the `build` directory. \n\n## Science\n\nThere are two stages to match identities. \nThe first is the precomputation which is run once on the whole dataset and remains unchanged during the subsequent steps. \nThe second is the matching itself.\n1. Precomputation:\n    1. Gather 2 lists of the most popular names and emails (by frequencies) on the whole dataset.\n    2. Gather 2 lists of emails and names that will be ignored (aka blacklists) on the whole dataset.\n       They are non-human identities and usually related to CI, bots, etc.\n2. Analysis:\n   1. Gather the list of triplets `{email, name, repository}` from all the commits using gitbase.\n   2. Remove any triplet whose name or email belongs to the blacklists. \n   3. Merge identities with the same e-mail if it doesn't belong to the list of popular emails created in 1.1.\n   4. Merge identities with the same name if it doesn't belong to the list of popular names created in 1.1.\n      When the name belongs to this list we replace it with the following tuple `(name, repository)`. \n   5. Save the resulting identity table in the desired output format.\n\n\u003cp align=\"center\"\u003e\n  \u003cimg src=\"docs/assets/idmatching.png\" alt=\"Identity matching diagram\"/\u003e\n\u003c/p\u003e\n\nThere is a Design Document (or a Blueprint, or whatever else you are used to call project documentation) which goes into more detail:\n[link](https://docs.google.com/document/d/1oNo_rS5mHqEVk_yug8hbMWIpQaJeOUYZitR3jWnHJCs/edit#heading=h.qhzm4nnshexd).\n\n## Contributions\n\n...are welcome! See [CONTRIBUTING](CONTRIBUTING.md) and [code of conduct](CODE_OF_CONDUCT.md).\n\n## License\n\nGPL 3.0, see [LICENSE](LICENSE). Y u no Apache/MIT? [Read here.](https://github.com/src-d/guide/blob/master/engineering/licensing.md#licence)\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsrc-d%2Fidentity-matching","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fsrc-d%2Fidentity-matching","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsrc-d%2Fidentity-matching/lists"}