{"id":26279676,"url":"https://github.com/alephdata/followthemoney-compare","last_synced_at":"2025-07-18T10:32:54.677Z","repository":{"id":49588522,"uuid":"361766453","full_name":"alephdata/followthemoney-compare","owner":"alephdata","description":"followthemoney-compare","archived":false,"fork":false,"pushed_at":"2023-03-20T21:52:05.000Z","size":1789,"stargazers_count":2,"open_issues_count":3,"forks_count":1,"subscribers_count":8,"default_branch":"main","last_synced_at":"2025-05-07T03:03:52.105Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/alephdata.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2021-04-26T13:47:00.000Z","updated_at":"2023-01-12T05:50:11.000Z","dependencies_parsed_at":"2025-05-07T03:03:55.486Z","dependency_job_id":"e22ba6ac-990b-492d-b2a5-1dbe60870c8e","html_url":"https://github.com/alephdata/followthemoney-compare","commit_stats":null,"previous_names":[],"tags_count":14,"template":false,"template_full_name":null,"purl":"pkg:github/alephdata/followthemoney-compare","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/alephdata%2Ffollowthemoney-compare","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/alephdata%2Ffollowthemoney-compare/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/alephdata%2Ffollowthemoney-compare/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/alephdata%2Ffollowthemoney-compare/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/alephdata","download_url":"https://codeload.github.com/alephdata/followthemoney-compare/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/alephdata%2Ffollowthemoney-compare/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":265505760,"owners_count":23778580,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2025-03-14T14:15:59.484Z","updated_at":"2025-07-18T10:32:54.645Z","avatar_url":"https://github.com/alephdata.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Follow The Money: Compare\n\n\u003e Tools and models for comparing followthemoney entities\n\n\n## Overview\n\nThis repo provides the tools necessary to pre-process and train models to power\na cross-reference system on top of `followthemoney`. It was built with a tight\nintegration with [aleph](https://github.com/alephdata/aleph) in mind, however\nthis repo is aleph agnostic.\n\nCurrently, there are three main components to this system:\n\n- Exporting training data\n- Creating preprocessing filters (optional)\n- Creating the training data\n- Training a model\n\nThey are explained in further detail below.\n\n\n## Installation\n\nInstallation is done through pipy. To install the minimal dependencies for\nmodel evaluation, run\n\n```\n$ pip install followthemoney-compare\n```\n\nIf you intend to train a model or do any model development, you should install\nthe development dependencies as well,\n\n```\n$ pip install followthemoney-compare[dev]\n```\n\nIn addition, a Dockerfile is provided (which defaults to a minimal\nfollowthemoney-compare installation) to simplify system dependencies.\n\n\n## Pre-built models\n\nPre-built models and word frequency objects are available on OCCRP's public\ndata site. The URLs are:\n\n- https://public.data.occrp.org/develop/models/word-frequencies/word_frequencies.zip\n- https://public.data.occrp.org/develop/models/xref/glm_bernoulli_2e_wf-v0.4.1.pkl\n\nThe word_frequencies.zip archive should be unzipped and the envvar\n`FTM_COMPARE_FREQUENCIES_DIR` should be set with the path to the unzipped data.\n\nThe model file can be loaded with pickle and used immediately. This pre-built\nmodel achives the following accuracy-precision-recall on a dataset build from\nhttps://aleph.occrp.org/,\n\n![prebuilt evaluation](https://public.data.occrp.org/develop/models/xref/glm_bernoulli_2e_wf-v0.4.1.png)\n\n### Exporting Training Data\n\nThe initial data feeding this system comes from the aleph profile system. In\nthis system, users see proposed entity matches and decide whether the two\nentities are indeed the same or not. Using the aleph profile API endpoint\n(`/api/2/entitysets?filter:type=profile\u0026filter:collection_id=\u003ccollection_id\u003e`)\nor by using the aleph profile export utility (`$ aleph dump-profiles`), you can\nexport these user decisions into JSON format.\n\nThis JSON data includes a profile ID, the two entities being compared, which\ncollections they originate from and the user decision regarding their\nsimilarity. If multiple positive matches all have the same profile ID, we can\nconsider all of the entities to be the same. As a result, many judgements on\none profile generally gives more training data than the same number of\njudgements on different profiles.\n\nIn addition to this human labeled data, you can optionally provide a list of\nentities that can be used to create smarter pre-processing filters to clean the\ndata. This is done by exporting raw entities out of aleph and making sure that\nthe entities have a `collection_id` field (depending on your export method,\nthis may have to be added manually).\n\n\n### Creating preprocessing filters (optional)\n\nIn order to reduce noise in the entity properties, we calculate an approximate\n[TF-IDF](https://en.wikipedia.org/wiki/Tf%E2%80%93idf) using a [count-min\nsketch](https://en.wikipedia.org/wiki/Count%E2%80%93min_sketch). Using this\nsystem, we are able to weight each token by how \"informative\" it is and help\nthe resulting models from focusing on very common tokens (tokens like common\nlast names, or the term \"corporation\" for companies).\n\nTo make this possible, the subcommand `$ followthemoney-compare\ncreate-word-frequency` is used. It takes in a flat file of entities (including\ntheir `collection_id`), tokenizes the `name` property and accumulates counts\nfor token frequency for all entities, token frequency per schema and number of\ncollections that token was seen in.\n\nWhen creating these structures, you can decide how large the acceptable error\nis for the approximate TF-IDF. The confidence and error-rate has been tuned to\ngive reasonable results on the scale of data that OCCRP's Aleph installation\nprovides. In this case, each structure is ~8MB and gives 0.01% error 99.95% of\nthe time. The error rates and confidence level can be tuned for the amount of\ndata you intend on using in order to adjust the size of the resulting structure.\n\nThe `create-word-frequency` subcommand saves the resulting counts into a\ndirectory structure containing the count-min sketches. A path to this directory\nshould be saved in your `FTM_COMPARE_FREQUENCIES_DIR` environment variable (it\ndefaults to \"./data/word_frequencies/\").\n\n```\n$ cat ./data/entities.json | \\\n    followthemoney-compare create-word-frequency ./data/word-frequency/\n```\n\n\n### Creating the training data\n\nIn order to speed up training, all entity comparison features that the model\nuses are pre-computed and saved into a pandas data frame. In order to create\nthis data frame, run the `$ followthemoney-compare create-data` subcommand. This\nwill use the count-min sketch filters calculated in the previous step if they\nare available (if not, a UserWarning will be issued to make sure you know!).\n\nNote that the progress bar while doing this step can be pretty jumpy if you have\nlarge profiles. Be patient with this step as it can take upwards of an hour to\ncomplete. If you find yourself constantly rebuilding the training data (ie: if\nyou are tuning the model features), this may phase is ripe for optimization.\n\n```\n$ export FTM_COMPARE_FREQUENCIES_DIR=\"./data/word-frequency\"  # optional\n$ followthemoney-compare create-data \\\n    ./data/profiles-export/ ./data/training-data.pkl\n```\n\n\n### Training a model\n\nAll models can be trained using the same CLI. In order to see the available\nmodels, run the command `$ followthemoney-compare list-models`. Currently, the\n`glm_bernoulli_2e` model performs best, particularly on entities that can have\ndifferent levels of completeness.\n\n```\n$ export FTM_COMPARE_FREQUENCIES_DIR=\"./data/word-frequency\"  # optional\n$ followthemoney-compare train \\\n    --plot \"./data/models/glm_bernoulli_2e.png\" \\\n    glm_bernoulli_2e \\\n    ./data/training-data.pkl \\\n    \"./data/models/glm_bernoulli_2e.pkl\"\n```\n\nOnce trained, the optional parameter `--plot` will create a \naccuracy/precision/recall curve for the resulting model which can be used for\ndiagnostics.\n\nThe resulting model can be loaded using `pickle` or the\n`followthemoney_compare.models.GLMBernouli2EEvaluate.load_pickles` method. This\nmodel file is a reduced version of the trained model which is ideal for fast\nevaluation with minimal dependencies and resource overhead. However, it also\nlacks diagnostic and intermediary variables used for the training of the model.\nAs a result, when creating a new model type it is probably best to train the\nmodels using the python API and to only use the CLI tool when training a known\nmodel.\n\nEvaluation of the resulting evaluation object is quite simple and flexible. It\nprovides the method:\n\n- predict(): returns True / False representing whether the arguments are or\n  aren't matches\n- predict_proba(): return a probability from (0, 1) that the arguments are\n  matches\n- predict_std(): return a standard deviation, or confidence, of the prediction\n  (higher means less confidence)\n- predict_proba_std(): returns both the match probability and the standard\n  deviation faster than calling both methods individually (not all models have\n  this)\n\nThe arguments to these functions can take the following forms:\n\n- DataFrame: a DataFrame in the same format as the one returned by the\n  `create-data` command\n- dict: a dictionary from the output of\n  `followthemoney_compare.compare.scores()`\n- list of proxy pairs: A tuple of two `followthemoney.proxy.EntityProxy`\n  objects or a list of these pairs.\n\n\n## Model Descriptions\n\n### Sample Weighting\n\nIn order to help alleviate potential noise in our training data, each sample is\nweighted. The weights have two contributions: the user weight and the sample\nweight.\n\nThe user weight applies a weight to all judgements made by a user based on how\nmany judgements they submitted. This weighting prefers users who have made 100+\nsubmissions and gradually down-weights users who have made substantially less\n(code in `followthemoney_compare.lib.utils user_weight()`\n\nThe sample weight looks at the potential information content in the entity\npairing. It down-weights samples who are trivially the same or trivially\ndifferent (ie: two entities where all properties are exactly the same or\ncompletely different). It does this by taking the average score from\n`compare.scores()` and down-sampling entities that are far from an average score\nof 0.25 - 0.7 (code in `followthemoney_compare.lib.utils.pair_weight()`.\n\nThe product of these two weights create a sample's effective weight which is\nused in the models.\n\n### GLM Bernoulli 2E\n\nThis model uses [pymc3](https://docs.pymc.io/) to fit a model using\n[MCMC](https://en.wikipedia.org/wiki/Markov_chain_Monte_Carlo). As input, the\nmodel uses the output of `followthemoney_compare.compare.scores`, which\ncompares followthemoney property groups between two entities, in addition to\nthe auxiliary variables which show how many properties are shared by both\nentities and how many are just in one entity.\n\nThe following is a list of features used in the model. The value for `name`,\nfor example, is the numerical value from (0, 1) from\n`followthemoney_compare.compare.scores` representing the similarity of both\nentities \"name\" properties.\n\n- name\n- country\n- date\n- identifier\n- address\n- phone\n- email\n- iban\n- url\n- pct_share_prop: percentage of possible properties shared by the two entities\n- pct_miss_prop: percentage of possible properties that only one entity has\n- pct_share_prop^2\n- name * pct_share_prop\n- name^2\n- pct_share_prop * pct_miss_prop\n- pct_miss_prop^2\n- name * identifier\n- country * pct_share_prop\n- identifier^2\n- identifier * pct_miss_prop\n- date^2\n- address^2\n\nAll these features are fed into a logistic regression with a bias and fit using\nthe sample weights to help remove noise.\n\nWhen a model is trained using this method using the CLI, a summary of the MCMC\nprocess is displayed before exiting. Some things to look for to make sure the\nmodel performed well:\n\n- The SD (standard deviation) of the parameters should be low. Any variables\n  with a high standard deviation were not particularly useful for the\n  classification and should be reconsidered\n- The `bulk_essi` field should be reasonably high. This field shows the\n  effective number of samples used to fit this parameter. If it is quite low,\n  then your data isn't well represented by the model or the training data is\n  too noisy.\n- Inspect the accuracy-precision-recall curve and make sure the model is\n  sensible.\n\n\n## Improvements\n\n- [ ] Parallelize training data creation\n- [ ] Better test/train split (stratified group sampling on collection id?\n      k-folds?)\n- [ ] Better feature engineering or deep learning models?\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Falephdata%2Ffollowthemoney-compare","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Falephdata%2Ffollowthemoney-compare","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Falephdata%2Ffollowthemoney-compare/lists"}