Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

https://github.com/src-d/wmd-relax

Calculates Word Mover's Distance Insanely Fast
https://github.com/src-d/wmd-relax

Last synced: 2 months ago
JSON representation

Calculates Word Mover's Distance Insanely Fast

Host: GitHub
URL: https://github.com/src-d/wmd-relax
Owner: src-d
License: other
Created: 2017-03-16T14:56:00.000Z (over 7 years ago)
Default Branch: master
Last Pushed: 2023-08-17T14:53:09.000Z (10 months ago)
Last Synced: 2024-03-27T13:06:43.230Z (3 months ago)
Language: Python
Homepage:
Size: 143 KB
Stars: 457
Watchers: 20
Forks: 84
Open Issues: 19
Metadata Files:
- Readme: README.md
- Contributing: CONTRIBUTING.md
- License: LICENSE.md
- Code of conduct: CODE_OF_CONDUCT.md

Lists

awesome-machine-learning-on-source-code - wmd-relax - Python package which finds nearest neighbors at Word Mover's Distance. (Software)
awesome-machine-learning-on-source-code - wmd-relax - Python package which finds nearest neighbors at Word Mover's Distance. (Software)
awesome-stars - src-d/wmd-relax - Calculates Word Mover's Distance Insanely Fast (Python)
awesome-machine-learning-on-source-code - wmd-relax - Python package which finds nearest neighbors at Word Mover's Distance. (Software)
awesome-machine-learning-on-source-code - wmd-relax - Python package which finds nearest neighbors at Word Mover's Distance. (Software)
awesome-machine-learning-on-source-code - wmd-relax - Python package which finds nearest neighbors at Word Mover's Distance. (Software)

README

        Fast Word Mover's Distance [![Build Status](https://travis-ci.com/src-d/wmd-relax.svg?branch=master)](https://travis-ci.com/src-d/wmd-relax) [![PyPI](https://img.shields.io/pypi/v/wmd.svg)](https://pypi.python.org/pypi/wmd) [![codecov](https://codecov.io/github/src-d/wmd-relax/coverage.svg)](https://codecov.io/gh/src-d/wmd-relax)

==========================

Calculates Word Mover's Distance as described in

[From Word Embeddings To Document Distances](http://www.cs.cornell.edu/~kilian/papers/wmd_metric.pdf)

by Matt Kusner, Yu Sun, Nicholas Kolkin and Kilian Weinberger.



The high level logic is written in Python, the low level functions related to

linear programming are offloaded to the bundled native extension. The native

extension can be built as a generic shared library not related to Python at all.

**Python 2.7 and older are not supported.** The heavy-lifting is done by

[google/or-tools](https://github.com/google/or-tools).

### Installation

```

pip3 install wmd

```

Tested on Linux and macOS.

### Usage

You should have the embeddings numpy array and the nbow model - that is,

every sample is a weighted set of items, and every item is embedded.

```python

import numpy

from wmd import WMD

embeddings = numpy.array([[0.1, 1], [1, 0.1]], dtype=numpy.float32)

nbow = {"first":  ("#1", [0, 1], numpy.array([1.5, 0.5], dtype=numpy.float32)),

        "second": ("#2", [0, 1], numpy.array([0.75, 0.15], dtype=numpy.float32))}

calc = WMD(embeddings, nbow, vocabulary_min=2)

print(calc.nearest_neighbors("first"))

```

```

[('second', 0.10606599599123001)]

```

`embeddings` must support `__getitem__` which returns an item by it's

identifier; particularly, `numpy.ndarray` matches that interface.

`nbow` must be iterable - returns sample identifiers - and support

`__getitem__` by those identifiers which returns tuples of length 3.

The first element is the human-readable name of the sample, the

second is an iterable with item identifiers and the third is `numpy.ndarray`

with the corresponding weights. All numpy arrays must be float32. The return

format is the list of tuples with sample identifiers and relevancy

indices (lower the better).

It is possible to use this package with [spaCy](https://github.com/explosion/spaCy):

```python

import spacy

import wmd

nlp = spacy.load('en_core_web_md')

nlp.add_pipe(wmd.WMD.SpacySimilarityHook(nlp), last=True)

doc1 = nlp("Politician speaks to the media in Illinois.")

doc2 = nlp("The president greets the press in Chicago.")

print(doc1.similarity(doc2))

```

Besides, see another [example](spacy_example.py) which finds similar Wikipedia

pages.

### Building from source

Either build it as a Python package:

```

pip3 install git+https://github.com/src-d/wmd-relax

```

or use CMake:

```

git clone --recursive https://github.com/src-d/wmd-relax

cmake -D CMAKE_BUILD_TYPE=Release .

make -j

```

Please note the `--recursive` flag for `git clone`. This project uses source{d}'s

fork of [google/or-tools](https://github.com/google/or-tools) as the git submodule.

### Tests

Tests are in `test.py` and use the stock `unittest` package.

### Documentation

```

cd doc

make html

```

The files are in `doc/doxyhtml` and `doc/html` directories.

### Contributions

...are welcome! See [CONTRIBUTING](CONTRIBUTING.md) and [code of conduct](CODE_OF_CONDUCT.md).

### License

[Apache 2.0](LICENSE.md)

#### README {#ignore_this_doxygen_anchor}