https://github.com/rtmigo/gifts_py

Search for most relevant documents containing words from query. Pure Python implementation without dependencies
https://github.com/rtmigo/gifts_py

cosine-similarity full-text-search information-retrieval python text-mining tf-idf

Last synced: 7 months ago
JSON representation

Search for most relevant documents containing words from query. Pure Python implementation without dependencies

Host: GitHub
URL: https://github.com/rtmigo/gifts_py
Owner: rtmigo
License: mit
Created: 2022-01-07T13:02:59.000Z (almost 4 years ago)
Default Branch: staging
Last Pushed: 2022-01-09T22:47:38.000Z (over 3 years ago)
Last Synced: 2025-01-21T15:32:19.366Z (9 months ago)
Topics: cosine-similarity, full-text-search, information-retrieval, python, text-mining, tf-idf
Language: Python
Homepage:
Size: 53.7 KB
Stars: 1
Watchers: 2
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

          # [gifts](https://github.com/rtmigo/gifts_py)

Searching for elements that have the common features with the query.

```python3

query = ['A', 'B']

elements = [

    ['N', 'A', 'M'],  # common features: 'A'

    ['C', 'B', 'A'],  # common features: 'A', 'B'  

    ['X', 'Y']  # no common features

]

```

In this case, the search with return `['C', 'B', 'A']` and `['N', 'A',

'M']` in that particular order.

## Use for full-text search

Finding documents that contain words from the query.

```python3

from gifts import SmoothFts

fts = SmoothFts()

fts.add(["wait", "mister", "postman"],

        doc_id="doc1")

fts.add(["please", "mister", "postman", "look", "and", "see"],

        doc_id="doc2")

fts.add(["oh", "yes", "wait", "a", "minute", "mister", "postman"],

        doc_id="doc3")

# print IDs of documents in which at least one word of the query occurs, 

# starting with the most relevant matches

for doc_id in fts.search(['postman', 'wait']):

    print(doc_id)

```

## Use for abstract data mining

In the examples above, the words were literally words as strings. But they can

be any objects suitable as `dict` keys.

```python3

from gifts import SmoothFts

fts = SmoothFts()

fts.add([3, 1, 4, 1, 5, 9, 2], doc_id="doc1")

fts.add([6, 5, 3, 5], doc_id="doc2")

fts.add([8, 9, 7, 9, 3, 2], doc_id="doc3")

for doc_id in fts.search([5, 3, 7]):

    print(doc_id)

```

## Implementation details

When ranking the results, the algorithm takes into account::

- the number of matching words

- the rarity of such words in the database

- the frequency of occurrence of words in the document

### SmoothFts

```python3

from gifts import SmoothFts

```

It uses logarithmic [tf-idf](https://en.wikipedia.org/wiki/Tf%E2%80%93idf) for

weighting the words

and [cosine similarity](https://en.wikipedia.org/wiki/Cosine_similarity)

for scoring the matches.

### SimpleFts

```python3

from gifts import SimpleFts

```

Minimalistic approach: weigh, multiply, compare. This object is noticeably

faster than `SmoothFts`.

## Install

### pip

```bash

pip3 install git+https://github.com/rtmigo/gifts_py#egg=gifts

```

### setup.py

```python3

install_requires = [

    "gifts@ git+https://github.com/rtmigo/gifts_py"

]

```

## See also

The [skifts](https://github.com/rtmigo/skifts_py#readme) package 

does the same search, but uses [scikit-learn](https://scikit-learn.org) and 

[numpy](https://numpy.org/) for better performance. It is literally hundreds 

of times faster.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/rtmigo/gifts_py

Awesome Lists containing this project

README