{"id":16598316,"url":"https://github.com/robinl/clustering_in_sql","last_synced_at":"2025-03-07T04:58:46.564Z","repository":{"id":257528982,"uuid":"858540217","full_name":"RobinL/clustering_in_sql","owner":"RobinL","description":null,"archived":false,"fork":false,"pushed_at":"2024-09-17T17:45:59.000Z","size":35,"stargazers_count":1,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-01-17T05:44:14.275Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/RobinL.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-09-17T04:44:03.000Z","updated_at":"2024-09-17T17:46:02.000Z","dependencies_parsed_at":"2024-11-16T12:32:36.707Z","dependency_job_id":"7996d438-3fea-40c0-a858-6305e70e4e01","html_url":"https://github.com/RobinL/clustering_in_sql","commit_stats":null,"previous_names":["robinl/clustering_in_sql"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/RobinL%2Fclustering_in_sql","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/RobinL%2Fclustering_in_sql/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/RobinL%2Fclustering_in_sql/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/RobinL%2Fclustering_in_sql/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/RobinL","download_url":"https://codeload.github.com/RobinL/clustering_in_sql/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":242332574,"owners_count":20110345,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-10-12T00:08:20.880Z","updated_at":"2025-03-07T04:58:46.541Z","avatar_url":"https://github.com/RobinL.png","language":"Python","readme":"# clustering_in_sql\n\nOur initial implementation came from the paper \"In-database connected component analysis\" by Harald Bögeholz, Michael Brand, and Radu-Alexandru Todor (https://arxiv.org/pdf/1802.09478.pdf).\n\n\u003e 'begin by choosing for each vertex (node) a representatative by picking the vertex \u003e with the minimum id amongst itself and its neighbours'\n\u003e\n\u003e i.e. attach neighbours to nodes and find the minumum\n\u003e\n\u003e Note that, since the edges always have the lower id on the left hand side we only \u003e need to join on unique_id_r, and pick unique_id_lThat is to say, we only bother to \u003e find neighbours that are smaller than the node\n\u003e\n\u003e i.e if we want to get the neighbours of node D, we don't need to get both D-\u003eC and \u003e D-\u003eE, we only need to bother getting D-\u003eC, because we know in advance this will have \u003e a lower minimum that D-\u003eE\n\n\nThe problem we habe in Splink at the moment is that we have a large number of iterations - up to about 42.\n\n[In-database connected component analysis](https://arxiv.org/pdf/1802.09478) proposes a new algo called randomzied contraction\n\n\u003e The key difference is that the simple breadth-first approach can have a linear number of iterations in the worst case, while the randomized contraction algorithm achieves a logarithmic expected number of iterations through clever use of randomization\n\nSo a ‘chain’ type graph of A -\u003e B -\u003e C -\u003e D -\u003e E takes 4 iterations with our current approach.   More generally, if there are n links in the chain, it takes n iterations\n\nWith the new appraoch it’s much faster.  For example, on 10,000 links in the chain, the new algo converges in 15 iterations\n\nThis is a potential improvement to Splink.","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Frobinl%2Fclustering_in_sql","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Frobinl%2Fclustering_in_sql","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Frobinl%2Fclustering_in_sql/lists"}