Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/rtmigo/skifts_py
Search for the most relevant documents containing words from a query. Uses Scikit-learn and Numpy
https://github.com/rtmigo/skifts_py
cosine-similarity information-retrieval numpy python scikit-learn text-mining tf-idf
Last synced: about 2 months ago
JSON representation
Search for the most relevant documents containing words from a query. Uses Scikit-learn and Numpy
- Host: GitHub
- URL: https://github.com/rtmigo/skifts_py
- Owner: rtmigo
- License: mit
- Created: 2022-01-08T21:50:30.000Z (about 3 years ago)
- Default Branch: staging
- Last Pushed: 2022-05-06T23:14:57.000Z (over 2 years ago)
- Last Synced: 2024-11-15T10:56:37.818Z (2 months ago)
- Topics: cosine-similarity, information-retrieval, numpy, python, scikit-learn, text-mining, tf-idf
- Language: Python
- Homepage: https://pypi.org/project/skifts/
- Size: 43.9 KB
- Stars: 0
- Watchers: 2
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
[![PyPI version shields.io](https://img.shields.io/pypi/v/skifts.svg)](https://pypi.python.org/pypi/skifts/)
[![Generic badge](https://img.shields.io/badge/Python-3.6+-blue.svg)](#)
[![Generic badge](https://img.shields.io/badge/OS-Linux%20|%20macOS%20|%20Windows-blue.svg)](#)# [skifts](https://github.com/rtmigo/skifts_py#readme)
Search for the most relevant documents containing words from the query.
```python3
query = ['A', 'B']documents = [
['N', 'A', 'M'], # matching features: 'A'
['C', 'B', 'A'], # matching features: 'A', 'B'
['X', 'Y'] # no matching features
]
```The search with return `['C', 'B', 'A']` and `['N', 'A', 'M']` in that
particular order.It's not necessarily about text. Words are just any `str` instances. Documents
are unordered collections of these `str`. We search for documents considering
frequency, rarity and match accuracy.## Install
```bash
pip3 install skifts
```Other options
### From GitHub (staging branch)
```bash
pip3 install git+https://github.com/rtmigo/skifts_py#egg=skifts
```## Use for full-text search
Finding documents that contain words from the query.
```python3
from skifts import SkiFts# three documents, one per row
documents = [
["wait", "mister", "postman"],
["please", "mister", "postman", "look", "and", "see"],
["oh", "yes", "wait", "a", "minute", "mister", "postman"]
]fts = SkiFts(documents)
# find and print the most relevant documents:
for doc_index in fts.search(['postman', 'wait']):
print(documents[doc_index])
```Words inside the `documents` list are considered ready-made feature identifiers.
If your text needs preprocessing or stemming, this should be done separately.The ranking takes into account the frequency of words in the document and the
rarity of words in the corpus. The word order in the document and the distance
between words do not matter.## Implementation details
The search uses the [scikit-learn](https://scikit-learn.org) library, which
ranks documents using [tf-idf](https://en.wikipedia.org/wiki/Tf%E2%80%93idf) and
[cosine similarity](https://en.wikipedia.org/wiki/Cosine_similarity).## See also
The [gifts](https://github.com/rtmigo/gifts_py#readme) package implements the
same search, but in pure Python with no binary dependencies.