https://github.com/ppw0/minhash

find similar text files quickly
https://github.com/ppw0/minhash

jaccard-similarity minhash mining-massive-datasets python

Last synced: 8 months ago
JSON representation

find similar text files quickly

Host: GitHub
URL: https://github.com/ppw0/minhash
Owner: ppw0
License: apache-2.0
Created: 2018-09-24T08:36:18.000Z (over 7 years ago)
Default Branch: master
Last Pushed: 2021-05-09T11:28:17.000Z (almost 5 years ago)
Last Synced: 2024-08-08T00:43:15.353Z (over 1 year ago)
Topics: jaccard-similarity, minhash, mining-massive-datasets, python
Language: Python
Homepage:
Size: 53.7 KB
Stars: 6
Watchers: 0
Forks: 2
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

          # minhash

- `group.py` helper module

- `seqmatch.py` uses sequence matching to find similar text files

    - `seqmatch_m.py` is the multithreaded version

- `minhash.py` uses MinHash and Jaccard similarity estimation algorithms, based on http://mccormickml.com/2015/06/12/minhash-tutorial-with-python-code/ 

    - `minhash_m.py` is the multiprocessing version, with shared memory access and simplified data structures (Unix only)

        - `minhash_m_init.py` is Windows-compatible; uses an initializer to preserve global variable states https://medium.com/@rvprasad/data-and-chunk-sizes-matter-when-using-multiprocessing-pool-map-in-python-5023c96875ef

        - `minhash_dss.py` uses delimited search space

        - `minhash_v.py` is the vectorized version; the use of an initializer avoids synchronization stalls and memory bubbles with large inputs. Unix only, as changes to the NumPy array do not get carried back to the main process in Windows

# lessons learned

- a simple heuristic can significantly decrease running time while still finding a large majority of similar files

- if the input iterator is very large, using `map` will lead to a segmentation fault

- if a function returns results faster than they can be consumed, using it with `imap` and `imap_unordered` will cause a memory bubble

    - https://stackoverflow.com/questions/40922526/memory-usage-steadily-growing-for-multiprocessing-pool-imap-unordered

    - https://stackoverflow.com/questions/9862091/iteration-over-pool-imap-unordered

- `pypy3` is significantly faster than `python3`, but it did not support Numba

- minimize your inputs before passing them off to a function (e.g. grouping results uses `connected_components` from `networkx` which had high space complexity)

- vectorize, if possible

# measurements

a random selection of non-duplicate text files of varying sizes (0-1767kb):

|        script       |  interpreter  | 1500       | 5000     |   10000   |   20000    |    50000   |

| :------------------ | ------------- | :--------- | :------- | :-------- | :--------- | :--------- |

| `seqmatch.py`       | `python3`     | `2503.13s` |          |           |            |            |

| `seqmatch_m.py`     | `python3`     | `658.70s`  |          |           |            |            |

| `minhash.py`        | `pypy3`       | `10.52s`   | `67.52s` | `266.74s` | `1060.77s` | `8278.96s` |

| `minhash_m.py`      | `pypy3`       | `7.87s`    | `51.07s` | `190.73s` | `859.66s`  | `4094.37s` |

| `minhash_m_init.py` | `pypy3`       | `7.56s`    | `52.22s` | `191.57s` | `680.1s`   | `5070.27s` |

| `minhash_dss.py`    | `pypy3`       | `5.33s`    | `22.19s` | `75.49s`  | `189.22s`  | `1019.88s` |

| `minhash_v.py`      | `python3`     | `3.06s`    | `4.19s`  | `6.19s`   | `13.51s`   | `59.42s`   |

![measurements](https://github.com/ppw0/minhash/blob/master/img/measurements.png)

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/ppw0/minhash

Awesome Lists containing this project

README