https://github.com/rtmigo/gifts_py
Search for most relevant documents containing words from query. Pure Python implementation without dependencies
https://github.com/rtmigo/gifts_py
cosine-similarity full-text-search information-retrieval python text-mining tf-idf
Last synced: 7 months ago
JSON representation
Search for most relevant documents containing words from query. Pure Python implementation without dependencies
- Host: GitHub
- URL: https://github.com/rtmigo/gifts_py
- Owner: rtmigo
- License: mit
- Created: 2022-01-07T13:02:59.000Z (almost 4 years ago)
- Default Branch: staging
- Last Pushed: 2022-01-09T22:47:38.000Z (over 3 years ago)
- Last Synced: 2025-01-21T15:32:19.366Z (9 months ago)
- Topics: cosine-similarity, full-text-search, information-retrieval, python, text-mining, tf-idf
- Language: Python
- Homepage:
- Size: 53.7 KB
- Stars: 1
- Watchers: 2
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# [gifts](https://github.com/rtmigo/gifts_py)
Searching for elements that have the common features with the query.
```python3
query = ['A', 'B']elements = [
['N', 'A', 'M'], # common features: 'A'
['C', 'B', 'A'], # common features: 'A', 'B'
['X', 'Y'] # no common features
]
```In this case, the search with return `['C', 'B', 'A']` and `['N', 'A',
'M']` in that particular order.## Use for full-text search
Finding documents that contain words from the query.
```python3
from gifts import SmoothFtsfts = SmoothFts()
fts.add(["wait", "mister", "postman"],
doc_id="doc1")fts.add(["please", "mister", "postman", "look", "and", "see"],
doc_id="doc2")fts.add(["oh", "yes", "wait", "a", "minute", "mister", "postman"],
doc_id="doc3")# print IDs of documents in which at least one word of the query occurs,
# starting with the most relevant matches
for doc_id in fts.search(['postman', 'wait']):
print(doc_id)
```## Use for abstract data mining
In the examples above, the words were literally words as strings. But they can
be any objects suitable as `dict` keys.```python3
from gifts import SmoothFtsfts = SmoothFts()
fts.add([3, 1, 4, 1, 5, 9, 2], doc_id="doc1")
fts.add([6, 5, 3, 5], doc_id="doc2")
fts.add([8, 9, 7, 9, 3, 2], doc_id="doc3")for doc_id in fts.search([5, 3, 7]):
print(doc_id)
```## Implementation details
When ranking the results, the algorithm takes into account::
- the number of matching words
- the rarity of such words in the database
- the frequency of occurrence of words in the document### SmoothFts
```python3
from gifts import SmoothFts
```It uses logarithmic [tf-idf](https://en.wikipedia.org/wiki/Tf%E2%80%93idf) for
weighting the words
and [cosine similarity](https://en.wikipedia.org/wiki/Cosine_similarity)
for scoring the matches.### SimpleFts
```python3
from gifts import SimpleFts
```Minimalistic approach: weigh, multiply, compare. This object is noticeably
faster than `SmoothFts`.## Install
### pip
```bash
pip3 install git+https://github.com/rtmigo/gifts_py#egg=gifts
```### setup.py
```python3
install_requires = [
"gifts@ git+https://github.com/rtmigo/gifts_py"
]
```## See also
The [skifts](https://github.com/rtmigo/skifts_py#readme) package
does the same search, but uses [scikit-learn](https://scikit-learn.org) and
[numpy](https://numpy.org/) for better performance. It is literally hundreds
of times faster.