Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/ppw0/minhash
find similar text files quickly
https://github.com/ppw0/minhash
jaccard-similarity minhash mining-massive-datasets python
Last synced: 17 days ago
JSON representation
find similar text files quickly
- Host: GitHub
- URL: https://github.com/ppw0/minhash
- Owner: ppw0
- License: apache-2.0
- Created: 2018-09-24T08:36:18.000Z (about 6 years ago)
- Default Branch: master
- Last Pushed: 2021-05-09T11:28:17.000Z (over 3 years ago)
- Last Synced: 2024-08-08T00:43:15.353Z (4 months ago)
- Topics: jaccard-similarity, minhash, mining-massive-datasets, python
- Language: Python
- Homepage:
- Size: 53.7 KB
- Stars: 6
- Watchers: 0
- Forks: 2
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
- project-awesome - ppw0/minhash - find similar text files quickly (Python)
README
# minhash
- `group.py` helper module
- `seqmatch.py` uses sequence matching to find similar text files
- `seqmatch_m.py` is the multithreaded version
- `minhash.py` uses MinHash and Jaccard similarity estimation algorithms, based on http://mccormickml.com/2015/06/12/minhash-tutorial-with-python-code/
- `minhash_m.py` is the multiprocessing version, with shared memory access and simplified data structures (Unix only)
- `minhash_m_init.py` is Windows-compatible; uses an initializer to preserve global variable states https://medium.com/@rvprasad/data-and-chunk-sizes-matter-when-using-multiprocessing-pool-map-in-python-5023c96875ef
- `minhash_dss.py` uses delimited search space
- `minhash_v.py` is the vectorized version; the use of an initializer avoids synchronization stalls and memory bubbles with large inputs. Unix only, as changes to the NumPy array do not get carried back to the main process in Windows# lessons learned
- a simple heuristic can significantly decrease running time while still finding a large majority of similar files
- if the input iterator is very large, using `map` will lead to a segmentation fault
- if a function returns results faster than they can be consumed, using it with `imap` and `imap_unordered` will cause a memory bubble
- https://stackoverflow.com/questions/40922526/memory-usage-steadily-growing-for-multiprocessing-pool-imap-unordered
- https://stackoverflow.com/questions/9862091/iteration-over-pool-imap-unordered
- `pypy3` is significantly faster than `python3`, but it did not support Numba
- minimize your inputs before passing them off to a function (e.g. grouping results uses `connected_components` from `networkx` which had high space complexity)
- vectorize, if possible# measurements
a random selection of non-duplicate text files of varying sizes (0-1767kb):| script | interpreter | 1500 | 5000 | 10000 | 20000 | 50000 |
| :------------------ | ------------- | :--------- | :------- | :-------- | :--------- | :--------- |
| `seqmatch.py` | `python3` | `2503.13s` | | | | |
| `seqmatch_m.py` | `python3` | `658.70s` | | | | |
| `minhash.py` | `pypy3` | `10.52s` | `67.52s` | `266.74s` | `1060.77s` | `8278.96s` |
| `minhash_m.py` | `pypy3` | `7.87s` | `51.07s` | `190.73s` | `859.66s` | `4094.37s` |
| `minhash_m_init.py` | `pypy3` | `7.56s` | `52.22s` | `191.57s` | `680.1s` | `5070.27s` |
| `minhash_dss.py` | `pypy3` | `5.33s` | `22.19s` | `75.49s` | `189.22s` | `1019.88s` |
| `minhash_v.py` | `python3` | `3.06s` | `4.19s` | `6.19s` | `13.51s` | `59.42s` |![measurements](https://github.com/ppw0/minhash/blob/master/img/measurements.png)