{"id":16598325,"url":"https://github.com/robinl/array_tf_ideas","last_synced_at":"2025-03-07T04:59:00.726Z","repository":{"id":225684080,"uuid":"766564099","full_name":"RobinL/array_tf_ideas","owner":"RobinL","description":null,"archived":false,"fork":false,"pushed_at":"2024-03-04T09:49:07.000Z","size":26,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-01-17T05:44:13.668Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/RobinL.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-03-03T16:03:36.000Z","updated_at":"2024-03-03T17:07:04.000Z","dependencies_parsed_at":"2024-03-03T18:25:01.596Z","dependency_job_id":"fb44ab31-f3a2-42df-b5ba-7d443184788d","html_url":"https://github.com/RobinL/array_tf_ideas","commit_stats":null,"previous_names":["robinl/array_tf_ideas"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/RobinL%2Farray_tf_ideas","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/RobinL%2Farray_tf_ideas/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/RobinL%2Farray_tf_ideas/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/RobinL%2Farray_tf_ideas/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/RobinL","download_url":"https://codeload.github.com/RobinL/array_tf_ideas/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":242332574,"owners_count":20110345,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-10-12T00:08:22.674Z","updated_at":"2025-03-07T04:59:00.687Z","avatar_url":"https://github.com/RobinL.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"The main Splink API allows term frequency adjustments to be applied to any column, but the term frequncy adjustments are based on exact matches on the column (see [here](https://github.com/moj-analytical-services/splink/issues/2006#issuecomment-1975101233)).\n\nThey are designed to be applied to columns such as first name, so that e.g. `Robin` vs `Robin` gets a higher match weight than `John` vs `John`.\n\nIt is harder to conceieve of how term frequency adjustments should work in the case of array based columns, because we're typically looking for array intersections as opposed to exact matches. But we want term frequency adjustment to be based on token frequencies.\n\n## Proposal\n\nA fully working example of the following proposal can be found [here](https://github.com/RobinL/array_tf_ideas/blob/main/splink_with_arr.py). A script that obtains the data and then performs a step by step derivation of the cleaning and array reduction steps can be found [here](https://github.com/RobinL/array_tf_ideas/blob/main/arr_idea.py).\n\nThe following outlines the steps:\n\nConsider for example the task of matching company names. We may for example have:\n\n`POSEIPORT MARINA MGT. LIMITED`\nvs\n`POSEIPORT MARINA MANAGEMENT LTD`\n\nWe want the match score to account for the match on the highly unusual token `POSEIPORT`, and the somewhat unusual term `MARINA`. The other tokens are common and less important.\n\nWe could clean and tokenise these to an array like:\n\n```\n┌─────────────────────────────────┬──────────────────────────────────────┐\n│           CompanyName           │        company_name_tokenised        │\n│             varchar             │              varchar[]               │\n├─────────────────────────────────┼──────────────────────────────────────┤\n│ POSEIPORT MARINA MGT. LIMITED   │ [POSEIPORT, MARINA, MGT, LIMITED]    │\n│ POSEIPORT MARINA MANAGEMENT LTD │ [POSEIPORT, MARINA, MANAGEMENT, LTD] │\n└─────────────────────────────────┴──────────────────────────────────────┘\n```\n\nWe could then transform the array to include details of term frequencies like:\n\n```\n┌─────────────────────────────────┬───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐\n│           CompanyName           │                                                                                                                   token_relative_frequency_arr                                                                                                                    │\n│             varchar             │                                                                                                        struct(token varchar, relative_frequency double)[]                                                                                                         │\n├─────────────────────────────────┼───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤\n│ POSEIPORT MARINA MGT. LIMITED   │ [{'token': POSEIPORT, 'relative_frequency': 6.122199093914534e-05}, {'token': MGT, 'relative_frequency': 3.061099546957267e-05}, {'token': MARINA, 'relative_frequency': 0.00021427696828700869}, {'token': LIMITED, 'relative_frequency': 0.20246112403575364}]  │\n│ POSEIPORT MARINA MANAGEMENT LTD │ [{'token': POSEIPORT, 'relative_frequency': 6.122199093914534e-05}, {'token': MANAGEMENT, 'relative_frequency': 0.04717154401861148}, {'token': LTD, 'relative_frequency': 0.09572058283335375}, {'token': MARINA, 'relative_frequency': 0.00021427696828700869}] │\n└─────────────────────────────────┴───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘\n\n```\n\ni.e. `POSEIPORT MARINA MGT. LIMITED `\n\nbecomes\n\n```\n[\n{'token': POSEIPORT, 'relative_frequency': 6.122199093914534e-05},\n{'token': MGT, 'relative_frequency': 3.061099546957267e-05},\n{'token': MARINA, 'relative_frequency': 0.00021427696828700869},\n{'token': LIMITED, 'relative_frequency': 0.20246112403575364}\n]\n```\n\n## Use in Splink\n\nThe above data manipulation would be done as preprocessing steps before bringing the data into Splink.\n\nThey ensure that Splink has the raw information it needs to account for token frequencies when comparing the values.\n\nHow can we write comparisons which take account of token frequencies?\n\nThe following is just an idea - there's probably room for improvement but it does work.\n\n**Step 1:**\n\nTake an array intersect on the `token_relative_frequency_arr` column.\n\nResult:\n\n```\n[\n{'token': POSEIPORT, 'relative_frequency': 6.122199093914534e-05},\n{'token': MARINA, 'relative_frequency': 0.00021427696828700869},\n]\n```\n\n**Step 2:**\n\nPerform an array reduce, multiplying the `relative_frequency` column:\n\nCalculation: `1 * 6.122199093914534e-05 * 0.00021427696828700869`, where 1 is the starting value for the reduce\nResult: `1.311846261093478e-08`\n\nThe comparison levels could then be set up as something like:\n\n```\n  ├─-- Comparison: CompanyName\n    │    ├─-- ComparisonLevel: Exact match on full string with term frequency adjustments\n    │    ├─-- ComparisonLevel: array reduction of intersection of token_relative_frequency_arr  \u003c 1e-10\n    │    ├─-- ComparisonLevel: array reduction of intersection of token_relative_frequency_arr  \u003c 1e-8\n    │    ├─-- ComparisonLevel: array reduction of intersection of token_relative_frequency_arr  \u003c 1e-5\n    │    ├─-- ComparisonLevel: all other\n```\n\nAn example of the full sql for the comparison is:\n\n```\nLIST_REDUCE(\n  LIST_PREPEND(\n    1.0,\n    LIST_TRANSFORM(\n      FILTER(\n        token_relative_frequency_arr_l,\n        y -\u003e ARRAY_CONTAINS(\n          ARRAY_INTERSECT(\n            LIST_TRANSFORM(token_relative_frequency_arr_l, x -\u003e x.token),\n            LIST_TRANSFORM(token_relative_frequency_arr_r, x -\u003e x.token)\n          ),\n          y.token\n        )\n      ),\n      x -\u003e x.relative_frequency\n    )\n  ),\n  (p, q) -\u003e p * q\n) \u003c 0.000001\n```\n\nA couple of notes on this statement:\n\n- `ARRAY_INTERSECT` does not work on a `struct` so I had to workaround\n- `ARRAY_REDUCE` needs a starting value hence `LIST_PREPEND(1.0)`\n\nIf array intersect did work on structs (which it might in future duckdb released) this could be phrased as:\n\n```\nLIST_REDUCE(\n  LIST_PREPEND(\n    1.0,\n    LIST_TRANSFORM(\n          ARRAY_INTERSECT(\n            token_relative_frequency_arr_l,\n            token_relative_frequency_arr_r\n          ),\n      ),\n      x -\u003e x.relative_frequency\n    )\n  ),\n  (p, q) -\u003e p * q\n) \u003c 0.000001\n```\n\n## Splink results\n\nA first go at this  [here](https://github.com/RobinL/array_tf_ideas/blob/main/splink_with_arr.py) using Companies House data seems to give sensible results\n\n![image](https://github.com/RobinL/array_tf_ideas/assets/2608005/7428e257-03b9-404e-b501-71fe65626b2c)\n\nHere are some companies that match with match_probability = 0.9\n\nThese are not true matches (partly because I'm deduping a dataset that does not contain duplicates!), but it shows how this technique seems to be working pretty well\n\n![image](https://github.com/RobinL/array_tf_ideas/assets/2608005/80175f99-f22d-46b2-baf3-b92b250e080a)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Frobinl%2Farray_tf_ideas","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Frobinl%2Farray_tf_ideas","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Frobinl%2Farray_tf_ideas/lists"}