Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

https://github.com/ppw0/minhash

find similar text files quickly
https://github.com/ppw0/minhash

jaccard-similarity minhash mining-massive-datasets python

Last synced: about 2 months ago
JSON representation

find similar text files quickly

Host: GitHub
URL: https://github.com/ppw0/minhash
Owner: ppw0
License: apache-2.0
Created: 2018-09-24T08:36:18.000Z (almost 6 years ago)
Default Branch: master
Last Pushed: 2021-05-09T11:28:17.000Z (about 3 years ago)
Last Synced: 2024-02-05T09:04:46.837Z (5 months ago)
Topics: jaccard-similarity, minhash, mining-massive-datasets, python
Language: Python
Homepage:
Size: 53.7 KB
Stars: 6
Watchers: 0
Forks: 1
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Lists

project-awesome - ppw0/minhash - find similar text files quickly (Python)

README

        # minhash

- `group.py` helper module

- `seqmatch.py` uses sequence matching to find similar text files

    - `seqmatch_m.py` is the multithreaded version

- `minhash.py` uses MinHash and Jaccard similarity estimation algorithms, based on http://mccormickml.com/2015/06/12/minhash-tutorial-with-python-code/ 

    - `minhash_m.py` is the multiprocessing version, with shared memory access and simplified data structures (Unix only)

        - `minhash_m_init.py` is Windows-compatible; uses an initializer to preserve global variable states https://medium.com/@rvprasad/data-and-chunk-sizes-matter-when-using-multiprocessing-pool-map-in-python-5023c96875ef

        - `minhash_dss.py` uses delimited search space

        - `minhash_v.py` is the vectorized version; the use of an initializer avoids synchronization stalls and memory bubbles with large inputs. Unix only, as changes to the NumPy array do not get carried back to the main process in Windows

# lessons learned

- a simple heuristic can significantly decrease running time while still finding a large majority of similar files

- if the input iterator is very large, using `map` will lead to a segmentation fault

- if a function returns results faster than they can be consumed, using it with `imap` and `imap_unordered` will cause a memory bubble

    - https://stackoverflow.com/questions/40922526/memory-usage-steadily-growing-for-multiprocessing-pool-imap-unordered

    - https://stackoverflow.com/questions/9862091/iteration-over-pool-imap-unordered

- `pypy3` is significantly faster than `python3`, but it did not support Numba

- minimize your inputs before passing them off to a function (e.g. grouping results uses `connected_components` from `networkx` which had high space complexity)

- vectorize, if possible

# measurements

a random selection of non-duplicate text files of varying sizes (0-1767kb):

|        script       |  interpreter  | 1500       | 5000     |   10000   |   20000    |    50000   |

| :------------------ | ------------- | :--------- | :------- | :-------- | :--------- | :--------- |

| `seqmatch.py`       | `python3`     | `2503.13s` |          |           |            |            |

| `seqmatch_m.py`     | `python3`     | `658.70s`  |          |           |            |            |

| `minhash.py`        | `pypy3`       | `10.52s`   | `67.52s` | `266.74s` | `1060.77s` | `8278.96s` |

| `minhash_m.py`      | `pypy3`       | `7.87s`    | `51.07s` | `190.73s` | `859.66s`  | `4094.37s` |

| `minhash_m_init.py` | `pypy3`       | `7.56s`    | `52.22s` | `191.57s` | `680.1s`   | `5070.27s` |

| `minhash_dss.py`    | `pypy3`       | `5.33s`    | `22.19s` | `75.49s`  | `189.22s`  | `1019.88s` |

| `minhash_v.py`      | `python3`     | `3.06s`    | `4.19s`  | `6.19s`   | `13.51s`   | `59.42s`   |

![measurements](https://github.com/ppw0/minhash/blob/master/img/measurements.png)