{"id":15617958,"url":"https://github.com/schemaitat/polars_sim","last_synced_at":"2025-10-13T19:31:07.270Z","repository":{"id":257808219,"uuid":"863692442","full_name":"schemaitat/polars_sim","owner":"schemaitat","description":"Fast approximate joins on string columns for polars dataframes.","archived":false,"fork":false,"pushed_at":"2024-10-23T20:58:16.000Z","size":116,"stargazers_count":13,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-01-30T17:38:46.734Z","etag":null,"topics":["cosine-similarity","join","polars","rust","sparse-matrices"],"latest_commit_sha":null,"homepage":"","language":"Rust","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/schemaitat.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE.md","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-09-26T18:31:39.000Z","updated_at":"2025-01-25T22:30:13.000Z","dependencies_parsed_at":null,"dependency_job_id":"5d4a2f30-68bd-4546-ad4a-006a52fb610f","html_url":"https://github.com/schemaitat/polars_sim","commit_stats":null,"previous_names":["schemaitat/polars_sim"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/schemaitat/polars_sim","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/schemaitat%2Fpolars_sim","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/schemaitat%2Fpolars_sim/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/schemaitat%2Fpolars_sim/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/schemaitat%2Fpolars_sim/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/schemaitat","download_url":"https://codeload.github.com/schemaitat/polars_sim/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/schemaitat%2Fpolars_sim/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":279016920,"owners_count":26085888,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-10-13T02:00:06.723Z","response_time":61,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["cosine-similarity","join","polars","rust","sparse-matrices"],"created_at":"2024-10-03T08:00:53.167Z","updated_at":"2025-10-13T19:31:06.928Z","avatar_url":"https://github.com/schemaitat.png","language":"Rust","readme":"# polars_sim\n\n## Description\n\nImplements an **approximate join** of two polars dataframes based on string columns.\n\n\nRight now, we use a fixed vectorization, which is applied on the fly and eventually\nused in a sparse matrix multiplication combined with a top-n selection. This produces\nthe cosine similarities of the individual string pairs.\n\nThe `join_sim` function is similar to a left join or `join_asof` but for strings instead of timestamps.\n\n## Installation\n\n```bash\npip install polars_sim\n```\n\n## Development\n\nWe use [uv](https://docs.astral.sh/uv/) for python package management. Furthermore, you need rust to be installed, see [install rust](https://www.rust-lang.org/tools/install). You won't need to activate an enviroment by yourself at any point. This is handled by uv. To get started, run\n```bash\n# install python dependencies and compile the rust code\nmake install \n# run tests\nmake test\n```\n\n## Usage\n\n```python\nimport polars as pl\nimport polars_sim as ps\n\ndf_left = pl.DataFrame(\n    {\n        \"name\": [\"alice\", \"bob\", \"charlie\", \"david\"],\n    }\n)\n\ndf_right = pl.DataFrame(\n    {\n        \"name\": [\"ali\", \"alice in wonderland\", \"bobby\", \"tom\"],\n    }\n)\n\ndf = ps.join_sim(\n    df_left,\n    df_right,\n    on=\"name\",\n    top_n=4,\n)\n\nshape: (3, 3)\n┌───────┬──────────┬─────────────────────┐\n│ name  ┆ sim      ┆ name_right          │\n│ ---   ┆ ---      ┆ ---                 │\n│ str   ┆ f32      ┆ str                 │\n╞═══════╪══════════╪═════════════════════╡\n│ alice ┆ 0.57735  ┆ ali                 │\n│ alice ┆ 0.522233 ┆ alice in wonderland │\n│ bob   ┆ 0.57735  ┆ bobby               │\n└───────┴──────────┴─────────────────────┘\n```\n\n# Performance\n\nA benchmark can be executed with `make run-bench`. \nIn general, the performance heavily depends on the length of the dataframes.\nBy default, the computation is parallelized over the left dataframe. However, serveral benchmarks \nshowed that if the right dataframe is much bigger than the left dataframe and no normalization is applied, it is faster to parallelize over the right dataframe. \n\nIf no normalization is applied, the performance is usually better since the a small uint type will\nbe used for the sparse matrix multiplication, e.g. u16. Otherwise, all types will be of 32 bit size.\n\n# References\n\nThe implementation is based on an algorithm used in [sparse_dot_topn](https://github.com/ing-bank/sparse_dot_topn), which itself is an improvement of the scipy sparse matrix multiplication.\n","funding_links":[],"categories":["Recently Updated","Libraries/Packages/Scripts"],"sub_categories":["[Oct 01, 2024](/content/2024/10/01/README.md)","Polars plugins"],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fschemaitat%2Fpolars_sim","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fschemaitat%2Fpolars_sim","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fschemaitat%2Fpolars_sim/lists"}