https://github.com/nickcrews/mismo
The SQL/Ibis powered sklearn of record linkage
https://github.com/nickcrews/mismo
deduplication duckdb entity-resolution ibis python record-linkage sql
Last synced: 7 days ago
JSON representation
The SQL/Ibis powered sklearn of record linkage
- Host: GitHub
- URL: https://github.com/nickcrews/mismo
- Owner: NickCrews
- License: lgpl-3.0
- Created: 2022-06-10T07:12:41.000Z (almost 4 years ago)
- Default Branch: main
- Last Pushed: 2024-10-29T22:52:54.000Z (over 1 year ago)
- Last Synced: 2024-10-30T00:47:36.006Z (over 1 year ago)
- Topics: deduplication, duckdb, entity-resolution, ibis, python, record-linkage, sql
- Language: Python
- Homepage: https://nickcrews.github.io/mismo/
- Size: 3.67 MB
- Stars: 14
- Watchers: 1
- Forks: 3
- Open Issues: 29
-
Metadata Files:
- Readme: README.md
- Contributing: docs/contributing.md
- License: LICENSE.txt
- Code of conduct: CODE_OF_CONDUCT.md
Awesome Lists containing this project
README
# Mismo
[](https://pypi.org/project/mismo)
[](https://pypi.org/project/mismo)
The SQL/Ibis powered sklearn of record linkage.
Still in alpha stage. Breaking changes will happen frequently
and with no warning. Once things are more stabilized I
will come up with a stability policy. Any suggestions as
to how you want the API to look like would be greatly appreciated.
I do use this in my work, so at least I do decent job of
ensuring correctness.
-----
## Goals
Mismo tries to be the sklearn of record linkage, backed by the scalability
and power of SQL and [Ibis](https://ibis-project.org/). It is made of many small
data structures and functions, each with a well-defined and standard API
that allows them to be composed together and extended easily.
None of the other record linkage packages I have seen, such as
[Splink](https://github.com/moj-analytical-services/splink),
[Dedupe](https://www.github.com/dedupeio/dedupe), or
[Record Linkage Toolkit](https://github.com/J535D165/recordlinkage),
had all of these properties, so I decided to make my own.
See [Goals and Alternatives](https://nickcrews.github.io/mismo/concepts/goals_and_alternatives)
for a more detailed discussion of the goals of Mismo and how it compares to other
record linkage packages.
## Features
- Supports larger-than-memory datasets, executed on powerful SQL engines.
Use DuckDB for prototyping and for jobs up to maybe ~10M records,
or Spark or other distributed backends for larger tasks, without
needing to change your code!
- Use the clean, strong-typed, pythonic, Dataframe APIs of [Ibis](https://ibis-project.org/).
- Small, modular functions and data structures that are easy to plug together
and extend.
- Layered API: Use top-level APIs if your task is common enough that it is
supported out of the box.
## Installation
[`mismo` is available on PyPI](https://pypi.org/project/mismo/).
I try to publish semver'ed releases after most changes.
If I forget to do this, then there are also[prereleases on PyPI](https://pypi.org/project/mismo/#history).
These are published every week by a github action using the HEAD commit of this repo.
You can also install directly from a branch or a specific commit from github:
```console
uv pip install "mismo[viz] @ git+https://github.com/NickCrews/mismo@"
```
## Examples
See the [example notebook](https://nickcrews.github.io/mismo/examples/patent_deduplication).
## Documentation
See the [documentation](https://nickcrews.github.io/mismo).
## Contributing
See the [contributing guide](https://nickcrews.github.io/mismo/contributing/).
## License
`mismo` is distributed under the terms of the
[LGPL-3.0-or-later](https://spdx.org/licenses/LGPL-3.0-or-later.html) license.