Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/elyase/locasticsearch
Serverless full text search in Python
https://github.com/elyase/locasticsearch
elasticsearch full-text-search python serverless sqlite
Last synced: 3 months ago
JSON representation
Serverless full text search in Python
- Host: GitHub
- URL: https://github.com/elyase/locasticsearch
- Owner: elyase
- License: mit
- Created: 2020-05-30T15:03:38.000Z (over 4 years ago)
- Default Branch: master
- Last Pushed: 2022-09-12T12:43:31.000Z (about 2 years ago)
- Last Synced: 2024-07-07T10:23:41.434Z (4 months ago)
- Topics: elasticsearch, full-text-search, python, serverless, sqlite
- Language: Python
- Size: 417 KB
- Stars: 8
- Watchers: 3
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Locasticsearch
Serverless full text search in Python> ⚠️ **dormant status**: 🚧 🚧
Locasticsearch provides serverless full text search powered by [sqlite full text search capabilities](https://www.sqlite.org/fts5.html) but trying to be compatible with (a subset of) the elasticsearch API.
That way you can comfortably develop your text search appplication without needing to set up services and smoothly transition to Elasticsearch for scale or more features without changing your code.
That said, if you are only doing basic search operations within the subset supported by this library, and dont have a lot of documents (~million) that would justify going for a cluster deployment, Locasticsearch [can be a faster](benchmarks) alternative to Elasticsearch.
## Getting started
```
from locasticsearch import Locasticsearch
from datetime import datetimees = Locasticsearch()
doc = {
"author": "kimchy",
"text": "Elasticsearch: cool. bonsai cool.",
"timestamp": datetime(2010, 10, 10, 10, 10, 10),
}
res = es.index(index="test-index", doc_type="tweet", id=1, body=doc)res = es.get(index="test-index", doc_type="tweet", id=1)
print(res["_source"])# this will get ignored in Locasticsearch
es.indices.refresh(index="test-index")res = es.search(index="test-index", body={"query": {"match_all": {}}})
print("Got %d Hits:" % res["hits"]["total"]["value"])
for hit in res["hits"]["hits"]:
print("%(timestamp)s %(author)s: %(text)s" % hit["_source"])
```We are also adding a simplified API that can be converted to Elasticsearch.
## Features
- 💯% local, no server management
- ✨ Lightweight pure python, no external dependencies
- ⚡ Super fast searches thanks to [sqlite full text search capabilities](https://www.sqlite.org/fts5.html)
- 🔗 No lock in. Thanks to the API compatiblity with the official client, you can smoothly transition to Elasticsearch for scale or more features without changing your code.## Install
```bash
pip install locasticsearch
```## To use or not to use
You should NOT use Locasticsearch if:
- you are deploying a security sensitive application. Locasticsearch code is very prone to SQL injection attacks. This should improve in future releases.
- Your searches are more complicated than what you would find in a 5 min Elasticsearch tutorial. Elasticsearch has a huge API and it is very unlikely that we can support even a sizable portion of that.
- You hate buggy libraries. Locasticsearch is a very young project so bugs are guaranteed. You can check the tests to see if your needs are covered.You should use Locasticsearch if:
- you dont want a docker or an elasticsearch service using precious resources in your laptop
- you only need basic text search and Elasticsearch would be overkill
- you want very easy deployments that only involve pip installs
- using Java from a python program makes you feel dirty## Next steps
- [ ] Add a real query DSL parsing
- [ ] Bulk indexing / scan
- [ ] Add simplified non ES compatible interface for easy JSON ingestion, querying
- [ ] Document supported vs unsupported query types## Comparison to similar libraries
Some quick thoughts about existing tools, feel free to add/comment:
### [whoosh](https://whoosh.readthedocs.io/en/latest/intro.html)
The most full featured **pure python** text search library by far:
- 👍 Supports highlight, analyzers, query expansion, several ranking functions, ...
- 👎 Unmaintained for a long time though might see a revival at https://github.com/whoosh-community/whoosh
- 👍 Pure python so doesn't scale as well (still fast enough for small/medium datasets)### [elasticsearch](https://www.elastic.co)
The big champion of full text search. This is what you should be using in production:
- 👍 Lots of features to accomodate any use case
- 👍 Battle tested, scalable, performant
- 👎 Non python native: more complex to deploy/integrate with python project for easy use cases### [tantivy](https://github.com/tantivy-search/tantivy-py)
This is a good recommendation for local full text search if you dont care about elastic search API compatibility
- 👍 Simple to set up and use: `pip install tantivy`
- 👍 Fast rust based engine
- 👎 DSL/library lock in, no elastic search API### [pyserini](https://github.com/castorini/pyserini/)
Though not pure python, pyserini is a good compromise if you want something local and scalable:
- 👍 Acess to Lucene from within Python (via [pyjnius](https://github.com/kivy/pyjnius) Java bridge)
- 👍 Serverless / local deployment
- 👎 DSL/library lock in
- 👎 Extra JAVA runtime### [django haystack](https://django-haystack.readthedocs.io/en/master/)
Django Haystack provides an unified API that allows you to plug in different search backends (such as Solr, Elasticsearch, Whoosh, Xapian, etc.) without having to modify your code:
- 👍 Many features, boosting, highlight, autocomplete (some backend dependent though)
- 👍 Possibility to switch backends
- 👎 DSL/library lock in
- 👎 Despite supporting several backends, Whoosh is the only one that is python native.### [xapian](https://xapian.org/docs/bindings/python/)
- 👍 Very fast and full featured (C++)
- 👎 No pip installable (needs system level compilation)
- 👎 The python bindings and the documentation are not that user friendly### [gensim](https://radimrehurek.com/gensim/)
While gensim focuses on topic modeling you can use `TfidfModel` and `SparseMatrixSimilarity` for text search. That said this is doesnt use an inverted index (linear search) so it has limited scalability.
- 👍 Unique features such as approximate search
- 👎 Focus is on topic modeling, so no intuitive APIs for full text ingestion/search
- 👎 Doesn't support inverted indexes search (mostly full scan and approximate)### [peewee](http://docs.peewee-orm.com/en/latest/)
Peewee is actually a more general ORM but offers abstractions to use full text search on Sqlite:
- 👍 Support for full text search using several SQL backends (no elasticsearch though)
- 👍 Custom ranking and analyzer functions
- 👎 No elasticsearch compatible API