Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/netzkolchose/elastipy
python elasticsearch query module for easily accessing nested aggregations and such
https://github.com/netzkolchose/elastipy
analytics backend console elasticsearch elasticsearch-queries nested-aggregations pandas-dataframe
Last synced: 3 months ago
JSON representation
python elasticsearch query module for easily accessing nested aggregations and such
- Host: GitHub
- URL: https://github.com/netzkolchose/elastipy
- Owner: netzkolchose
- License: other
- Created: 2020-12-23T03:03:27.000Z (about 4 years ago)
- Default Branch: development
- Last Pushed: 2022-05-30T11:55:13.000Z (over 2 years ago)
- Last Synced: 2024-10-10T15:12:57.226Z (3 months ago)
- Topics: analytics, backend, console, elasticsearch, elasticsearch-queries, nested-aggregations, pandas-dataframe
- Language: Python
- Homepage:
- Size: 6.79 MB
- Stars: 3
- Watchers: 4
- Forks: 0
- Open Issues: 2
-
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- License: LICENSE.txt
Awesome Lists containing this project
README
## elastipy
A python wrapper to make elasticsearch queries and aggregations more fun.
Tested with python 3.6 and 3.10 and elasticsearch 7 and 8.
[![test](https://github.com/netzkolchose/elastipy/actions/workflows/tests.yml/badge.svg)](https://github.com/netzkolchose/elastipy/actions/workflows/tests.yml)
[![Coverage Status](https://coveralls.io/repos/github/netzkolchose/elastipy/badge.svg?branch=development)](https://coveralls.io/github/netzkolchose/elastipy?branch=development)
[![Documentation Status](https://readthedocs.org/projects/elastipy/badge/?version=latest)](https://elastipy.readthedocs.io/en/latest/?badge=latest)Learn more at [elastipy.readthedocs.io](https://elastipy.readthedocs.io/en/latest/).
In comparison to [elasticsearch-dsl](https://github.com/elastic/elasticsearch-dsl-py)
this library provides:
- typing and IDE-based auto-completion for search and aggregation parameters.
- some convenient data access to responses of nested bucketed aggregations and metrics
(also supporting [pandas](https://github.com/pandas-dev/pandas))#### contents
- [installation](#installation)
- [requirements](#requirements)
- quickref
- [aggregations](#aggregations)
- [metrics](#nested-aggregations-and-metrics)
- [query](#queries)
- [exporting](#exporting)
- [testing](#testing)
- [development](#development)---
### installation
To install elastipy using the elasticsearch 8+ backend:
```shell script
pip install elastipy
```If you target the elasticsearch 7 version, do:
```shell script
pip install 'elasticsearch<8'
pip install elastipy
```#### requirements
One thing is, of course, to [install elasticsearch](https://www.elastic.co/guide/en/elasticsearch/reference/current/install-elasticsearch.html).
- **elastipy** itself requires [elasticsearch-py](https://github.com/elastic/elasticsearch-py)
- doc building is listed in [docs/requirements.txt](docs/requirements.txt) and mainly
consists of sphinx with the readthedocs theme.
- generating the interface and running the tests and notebooks is listed in
[requirements.txt](requirements.txt) and contains pyyaml and coverage as well as the
usual stack of jupyter, scipy, matplotlib, ..### configuration
By default an [elasticsearch](https://www.elastic.co/guide/en/elasticsearch/reference/current/elasticsearch-intro.html) host is expected at `localhost:9200`. There are currently two ways
to specify a different connection.```python
from elasticsearch import Elasticsearch
from elastipy import Search# Use an explicit Elasticsearch client (or compatible class)
client = Elasticsearch(
hosts=[{"host": "localhost", "port": 9200}],
http_auth=("user", "pwd")
)# create a Search using the specified client
s = Search(index="bla", client=client)# can also be done later
s = s.client(client)
```Check the Elasticsearch [API reference](https://elasticsearch-py.readthedocs.io/en/v7.10.1/api.html#elasticsearch) for all the parameters.
We can also set a default client at the program start:
```python
from elastipy import connectionsconnections.set("default", client)
# .. or as parameters, they get converted to an Elasticsearch client
connections.set("default", {"hosts": [{"host": "localhost", "port": 9200}]})# get a client
connections.get("default")
```
Different connections can be specified with the *alias* name:
```python
connections.set("special", {"hosts": [{"host": "special", "port": 1234}]})s = Search(client="special")
s.get_client()
```
### aggregations
More details can be found in the [tutorial](https://elastipy.readthedocs.io/en/latest/tutorial.html).
```python
# get a search object
s = Search(index="world")# create an Aggregation class connected to the Search
agg = s.agg_date_histogram(calendar_interval="1w")
# (for date-specific aggregations we can leave out the 'field' parameter
# it falls back to Search.timestamp_field which is "timestamp" by default)# submit the whole request
s.execute()# access the response
list(agg.keys())
```['1999-12-27T00:00:00.000Z',
'2000-01-03T00:00:00.000Z',
'2000-01-10T00:00:00.000Z',
'2000-01-17T00:00:00.000Z']```python
list(agg.values())
```[21, 77, 60, 42]
Without a [metric](https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-metrics.html) these numbers are the document counts.
Above example as a one-liner:
```python
Search(index="world").agg_date_histogram(calendar_interval="1w").execute().to_dict()
```{'1999-12-27T00:00:00.000Z': 21,
'2000-01-03T00:00:00.000Z': 77,
'2000-01-10T00:00:00.000Z': 60,
'2000-01-17T00:00:00.000Z': 42}### nested aggregations and metrics
```python
s = Search(index="world")# the first parameter is the name of the aggregation
# (if omitted it will be "a0", "a1", aso..)
agg = s \
.agg_terms("occasion", field="occasion") \
.agg_rare_terms("rare-excuses", field="excuse", max_doc_count=2) \
.metric_avg("avg-length", field="conversation_length") \
.metric_max("max-length", field="conversation_length") \
.execute()
```The `rare_terms` aggregation is nested into the `terms` aggregation and
the metrics are siblings nested inside `rare_terms`.`keys()`, `values()`, `items()` and `to_dict()` all operate on the current aggregation.
For bucket aggregations they typically show the `doc_count` value.'```python
agg.to_dict()
```{('dinner', 'my mouth is too dry'): 1,
('dinner', "i can't reach the spoon"): 2}The `rows()`, `dict_rows()` and `dump.table()` methods operate on the whole aggregation branch:
```python
list(agg.dict_rows())
```[{'occasion': 'dinner',
'occasion.doc_count': 200,
'rare-excuses': 'my mouth is too dry',
'rare-excuses.doc_count': 1,
'avg-length': 163.0,
'max-length': 163.0},
{'occasion': 'dinner',
'occasion.doc_count': 200,
'rare-excuses': "i can't reach the spoon",
'rare-excuses.doc_count': 2,
'avg-length': 109.5,
'max-length': 133.0}]```python
agg.dump.table(colors=False)
```occasion │ occasion.doc_count │ rare-excuses │ rare-excuses.doc_count │ avg-length │ max-length
─────────┼────────────────────┼─────────────────────────┼────────────────────────┼──────────────┼─────────────
dinner │ 200 │ my mouth is too dry │ 1 ██████████▌ │ 163.0 ██████ │ 163.0 ██████
dinner │ 200 │ i can't reach the spoon │ 2 ████████████████████ │ 109.5 ████ │ 133.0 ████▉### queries
```python
from elastipy import querys = Search(index="prog-world")
# chaining means AND
s = s \
.term(field="category", value="programming") \
.term("usage", "widely-used")# also can use operators
s = s & (
query.Term("topic", "yet-another-api")
| query.Term("topic", "yet-another-operator-overload")
)# .query() replaces the current query
s = s.query(query.MatchAll())languages_per_country = s.agg_terms(field="country").agg_terms(field="language").execute()
languages_per_country.to_dict()
```{('IT', 'PHP'): 28,
('IT', 'Python'): 24,
('IT', 'C++'): 21,
('ES', 'C++'): 29,
('ES', 'Python'): 22,
('ES', 'PHP'): 18,
('US', 'PHP'): 23,
('US', 'Python'): 20,
('US', 'C++'): 15}### exporting
There is a small helper to export stuff to elasticsearch.
```python
from elastipy import Exporterclass MyExporter(Exporter):
INDEX_NAME = "my-index"
# mapping can be defined here
# it will be sent to elasticsearch before the first document is exported
MAPPINGS = {
"properties": {
"some_field": {"type": "text"},
}
}count, errors = MyExporter().export_list(a_lot_of_objects)
print(f"expored {count} objects, errors: {errors}")
```expored 1000 objects, errors: []
It uses bulk requests and is very fast, supports document transformation and
control over id and sub-index of documents.```python
import datetimeclass MyExporter(Exporter):
INDEX_NAME = "my-index-*"
MAPPINGS = {
"properties": {
"some_field": {"type": "text"},
"group": {"type": "keyword"},
"id": {"type": "keyword"},
"timestamp": {"type": "date"},
}
}# if each document has a unique id value we can use it
# as the elasticsearch id as well. That way we do not
# create documents twice when exporting them again.
# Their data just gets updated.
def get_document_id(self, es_data):
return es_data["id"]
# we can bucket documents into separate indices
def get_document_index(self, es_data):
return self.index_name().replace("*", es_data["group"])
# here we can adjust or add some data before it gets exported.
# it's also possible to split the data into several documents
# by yielding or returning a list
def transform_document(self, data):
data = data.copy()
data["timestamp"] = datetime.datetime.now()
return dataMyExporter().export_list(a_lot_of_objects)
```(1000, [])
If we are tired enough we can call:
```python
MyExporter().delete_index()
```True
This will actually delete all sub-indices because there's this wildcard `*` in the `INDEX_NAME`.
**More examples can be found [here](examples).**
### testing
To run the tests call:
```shell script
python test.py
````To include testing against a live elasticsearch:
```shell script
python test.py --live
```To change **localhost:9200** to something different
pass any arguments as json:
```shell script
python test.py --live --elasticsearch '{"hosts": [{"host": "127.0.0.5", "port": 1200}], "http_auth": ["user", "password"]}'
```The live tests will create new indices and immediately destroy them afterwards.
They are prefixed with **elastipy---unittest-**To check the coverage of the tests add `-c` or `-m` flags.
`-m` will add the missing line numbers to the summary.### development
The methods for **queries** and **aggregations** as well as the **query
classes** are auto-generated from [yaml files](definition).
They include all parameters, default values and documentation.#### Add a missing query or aggregation
1. Create a yaml file with the name of it in one of the sub-directories
in `definition/query` or `definition/aggregation`.
The sub-directories in `query/` are just for tidiness and
follow the nesting in the sidebar of the official
[documentation](https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl.html).
The three directories below `aggregation/` actually define the
aggregation type `bucket`, `metric` or `pipeline`.2. Create the python code via
```shell script
# in project root
python generate_interfaces.py
```
This will update the files:
- [elastipy/query/generated_classes.py](elastipy/query/generated_classes.py)
- [elastipy/query/generated_interface.py](elastipy/query/generated_interface.py)
- [elastipy/aggregation/generated_interface.py](elastipy/aggregation/generated_interface.py)
The sphinx documentation will collect the respective documentation
from these files.#### Update the example and tutorial notebooks
1. Do some changes or add a new notebook (and keep main
`requirements.txt` up to date).2. Execute:
```shell script
python run_doc_notebooks.py --execute
```
This will convert the notebooks to `.rst` files into the [docs/](docs/) directory.
The [docs/quickref.ipynb](docs/quickref.ipynb) notebook will even be rendered
as markdown into this README.
The `-e`/`--execute` flag is required for proper doc building. For debugging
purposes it can be omitted in which case the current notebook state is
rendered.
3. Run
```shell script
cd docs/
pip install -r requirements.txt
make clean && make html
```
and inspect the results in
[docs/_build/html/index.html](docs/_build/html/index.html).Before committing changes run
```shell script
pip install pre-commit
pre-commit install
```
This will install a pre-commit hook from
[.pre-commit-config.yaml](.pre-commit-config.yaml)
that clears the output of all notebooks.
Since the interesting ones are already rendered to the document pages, i just
think this is more tidy and releases one from cleaning up the execution state
of notebooks by hand before committing.
Generally, i'm stuck with *restructuredtext* for the docstrings although
besides the `:param:` syntax i find it simply repellent.
It still has the most supported toolchain it seems.