https://github.com/willf/inverted_index
A simple in memory inverted index in Python
https://github.com/willf/inverted_index
python search-engine
Last synced: 8 months ago
JSON representation
A simple in memory inverted index in Python
- Host: GitHub
- URL: https://github.com/willf/inverted_index
- Owner: willf
- License: bsd-2-clause
- Created: 2016-07-28T19:18:56.000Z (almost 10 years ago)
- Default Branch: master
- Last Pushed: 2016-08-02T22:27:14.000Z (almost 10 years ago)
- Last Synced: 2025-04-10T20:13:01.166Z (about 1 year ago)
- Topics: python, search-engine
- Language: Python
- Size: 25.4 KB
- Stars: 15
- Watchers: 2
- Forks: 2
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
Inverted Index
==============
A simple in-memory inverted index system, with a modest query language.
i = inverted_index.Index()
i.index(1, "this is the day they give babies away with half a pound of tea")
i.index(1, "if you know any ladies who need any babies just send them round to ")
i.index(2, "babies are born in the circle of the sun")
results, err = i.query("babies")
print(results)
{1,2}
results, err = i.query("babies AND ladies")
print(results)
{1}
i.index(3, "WHERE ARE THE BABIES", tokenizer=lambda s:s.lower().split())
results, err = i.query("babies")
print(results)
{1,2,3}
i.unindex(3)
results, err = i.query("babies")
print(results)
{1,2}
Any hashable object can be the "document", and a tokenizer can be specified to tokenize the
text to index. There are also `add_token` and `add_tokens` methods to directly index on individual
tokens.
The query language is very simple: it understands AND and OR, NOT, and parentheses. For example:
term OR term
term AND term OR term
(term AND term) OR term
NOT term
NOT term AND (term OR term)
`AND`, `OR`, and `NOT` have equal precedence, so use parentheses to disambiguate.
I'm pretty sure you don't want to use this in production code :)