https://github.com/rtmigo/skifts_py

Search for the most relevant documents containing words from a query. Uses Scikit-learn and Numpy
https://github.com/rtmigo/skifts_py

cosine-similarity information-retrieval numpy python scikit-learn text-mining tf-idf

Last synced: 3 months ago
JSON representation

Search for the most relevant documents containing words from a query. Uses Scikit-learn and Numpy

Host: GitHub
URL: https://github.com/rtmigo/skifts_py
Owner: rtmigo
License: mit
Created: 2022-01-08T21:50:30.000Z (over 3 years ago)
Default Branch: staging
Last Pushed: 2022-05-06T23:14:57.000Z (about 3 years ago)
Last Synced: 2025-03-06T16:16:55.796Z (4 months ago)
Topics: cosine-similarity, information-retrieval, numpy, python, scikit-learn, text-mining, tf-idf
Language: Python
Homepage: https://pypi.org/project/skifts/
Size: 43.9 KB
Stars: 0
Watchers: 2
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

        [![PyPI version shields.io](https://img.shields.io/pypi/v/skifts.svg)](https://pypi.python.org/pypi/skifts/)

[![Generic badge](https://img.shields.io/badge/Python-3.6+-blue.svg)](#)

[![Generic badge](https://img.shields.io/badge/OS-Linux%20|%20macOS%20|%20Windows-blue.svg)](#)

# [skifts](https://github.com/rtmigo/skifts_py#readme)

Search for the most relevant documents containing words from the query.

```python3

query = ['A', 'B']

documents = [

    ['N', 'A', 'M'],  # matching features: 'A'

    ['C', 'B', 'A'],  # matching features: 'A', 'B'  

    ['X', 'Y']  # no matching features

]

```

The search with return `['C', 'B', 'A']` and `['N', 'A', 'M']` in that

particular order.

It's not necessarily about text. Words are just any `str` instances. Documents

are unordered collections of these `str`. We search for documents considering

frequency, rarity and match accuracy.

## Install

```bash

pip3 install skifts

```

  Other options

### From GitHub (staging branch)

```bash

pip3 install git+https://github.com/rtmigo/skifts_py#egg=skifts

```

## Use for full-text search

Finding documents that contain words from the query.

```python3

from skifts import SkiFts

# three documents, one per row

documents = [

    ["wait", "mister", "postman"],

    ["please", "mister", "postman", "look", "and", "see"],

    ["oh", "yes", "wait", "a", "minute", "mister", "postman"]

]

fts = SkiFts(documents)

# find and print the most relevant documents:

for doc_index in fts.search(['postman', 'wait']):

    print(documents[doc_index])

```

Words inside the `documents` list are considered ready-made feature identifiers.

If your text needs preprocessing or stemming, this should be done separately.

The ranking takes into account the frequency of words in the document and the

rarity of words in the corpus. The word order in the document and the distance

between words do not matter.

## Implementation details

The search uses the [scikit-learn](https://scikit-learn.org) library, which

ranks documents using [tf-idf](https://en.wikipedia.org/wiki/Tf%E2%80%93idf) and

[cosine similarity](https://en.wikipedia.org/wiki/Cosine_similarity).

## See also

The [gifts](https://github.com/rtmigo/gifts_py#readme) package implements the

same search, but in pure Python with no binary dependencies.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/rtmigo/skifts_py

Awesome Lists containing this project

README