https://github.com/schemaitat/polars_sim

Fast approximate joins on string columns for polars dataframes.
https://github.com/schemaitat/polars_sim

cosine-similarity join polars rust sparse-matrices

Last synced: 5 months ago
JSON representation

Fast approximate joins on string columns for polars dataframes.

Host: GitHub
URL: https://github.com/schemaitat/polars_sim
Owner: schemaitat
License: mit
Created: 2024-09-26T18:31:39.000Z (9 months ago)
Default Branch: main
Last Pushed: 2024-10-23T20:58:16.000Z (8 months ago)
Last Synced: 2024-10-24T08:34:34.908Z (8 months ago)
Topics: cosine-similarity, join, polars, rust, sparse-matrices
Language: Rust
Homepage:
Size: 113 KB
Stars: 6
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE.md

Awesome Lists containing this project

trackawesomelist - polars\_sim (⭐4) - Polars plugin that implements fast approximate joins on string columns for polars dataframes by [@schemaitat](https://github.com/schemaitat). (Recently Updated / [Oct 01, 2024](/content/2024/10/01/README.md))
awesome-polars - polars_sim - Polars plugin that implements fast approximate joins on string columns for polars dataframes by [@schemaitat](https://github.com/schemaitat). (Libraries/Packages/Scripts / Polars plugins)

README

        # polars_sim

## Description

Implements an **approximate join** of two polars dataframes based on string columns.

Right now, we use a fixed vectorization, which is applied on the fly and eventually

used in a sparse matrix multiplication combined with a top-n selection. This produces

the cosine similarities of the individual string pairs.

The `join_sim` function is similar to a left join or `join_asof` but for strings instead of timestamps.

## Installation

```bash

pip install polars_sim

```

## Development

We use [uv](https://docs.astral.sh/uv/) for python package management. Furthermore, you need rust to be installed, see [install rust](https://www.rust-lang.org/tools/install). You won't need to activate an enviroment by yourself at any point. This is handled by uv. To get started, run

```bash

# install python dependencies and compile the rust code

make install 

# run tests

make test

```

## Usage

```python

import polars as pl

import polars_sim as ps

df_left = pl.DataFrame(

    {

        "name": ["alice", "bob", "charlie", "david"],

    }

)

df_right = pl.DataFrame(

    {

        "name": ["ali", "alice in wonderland", "bobby", "tom"],

    }

)

df = ps.join_sim(

    df_left,

    df_right,

    on="name",

    top_n=4,

)

shape: (3, 3)

┌───────┬──────────┬─────────────────────┐

│ name  ┆ sim      ┆ name_right          │

│ ---   ┆ ---      ┆ ---                 │

│ str   ┆ f32      ┆ str                 │

╞═══════╪══════════╪═════════════════════╡

│ alice ┆ 0.57735  ┆ ali                 │

│ alice ┆ 0.522233 ┆ alice in wonderland │

│ bob   ┆ 0.57735  ┆ bobby               │

└───────┴──────────┴─────────────────────┘

```

# Performance

A benchmark can be executed with `make run-bench`. 

In general, the performance heavily depends on the length of the dataframes.

By default, the computation is parallelized over the left dataframe. However, serveral benchmarks 

showed that if the right dataframe is much bigger than the left dataframe and no normalization is applied, it is faster to parallelize over the right dataframe. 

If no normalization is applied, the performance is usually better since the a small uint type will

be used for the sparse matrix multiplication, e.g. u16. Otherwise, all types will be of 32 bit size.

# References

The implementation is based on an algorithm used in [sparse_dot_topn](https://github.com/ing-bank/sparse_dot_topn), which itself is an improvement of the scipy sparse matrix multiplication.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/schemaitat/polars_sim

Awesome Lists containing this project

README