{"id":13677482,"url":"https://github.com/elastic/eland","last_synced_at":"2025-04-14T00:51:14.430Z","repository":{"id":260746828,"uuid":"191316757","full_name":"elastic/eland","owner":"elastic","description":"Python Client and Toolkit for DataFrames, Big Data, Machine Learning and ETL in Elasticsearch","archived":false,"fork":false,"pushed_at":"2025-04-03T13:56:24.000Z","size":21882,"stargazers_count":666,"open_issues_count":86,"forks_count":107,"subscribers_count":206,"default_branch":"main","last_synced_at":"2025-04-06T22:01:11.222Z","etag":null,"topics":["big-data","data-analysis","dataframe","dataframes","eland","elasticsearch","etl","lightgbm","machine-learning","pandas","python","scikit-learn","time-series-forecasting"],"latest_commit_sha":null,"homepage":"https://eland.readthedocs.io","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/elastic.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.rst","contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE.txt","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2019-06-11T07:24:06.000Z","updated_at":"2025-04-03T13:56:27.000Z","dependencies_parsed_at":"2024-11-02T10:02:58.618Z","dependency_job_id":"5249762b-1ddb-4d47-82f2-ed021b3de154","html_url":"https://github.com/elastic/eland","commit_stats":{"total_commits":497,"total_committers":42,"mean_commits":"11.833333333333334","dds":0.7967806841046278,"last_synced_commit":"f79180be4206bb40ff6ebe93d0d2d7f0707b60de"},"previous_names":["elastic/eland"],"tags_count":40,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/elastic%2Feland","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/elastic%2Feland/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/elastic%2Feland/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/elastic%2Feland/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/elastic","download_url":"https://codeload.github.com/elastic/eland/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248804773,"owners_count":21164131,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["big-data","data-analysis","dataframe","dataframes","eland","elasticsearch","etl","lightgbm","machine-learning","pandas","python","scikit-learn","time-series-forecasting"],"created_at":"2024-08-02T13:00:42.908Z","updated_at":"2025-04-14T00:51:14.407Z","avatar_url":"https://github.com/elastic.png","language":"Python","funding_links":[],"categories":["Python"],"sub_categories":[],"readme":"\u003cdiv align=\"center\"\u003e\n  \u003ca href=\"https://github.com/elastic/eland\"\u003e\n    \u003cimg src=\"https://raw.githubusercontent.com/elastic/eland/main/docs/sphinx/logo/eland.png\" width=\"30%\"\n      alt=\"Eland\" /\u003e\n  \u003c/a\u003e\n\u003c/div\u003e\n\u003cbr /\u003e\n\u003cdiv align=\"center\"\u003e\n  \u003ca href=\"https://pypi.org/project/eland\"\u003e\u003cimg src=\"https://img.shields.io/pypi/v/eland.svg\" alt=\"PyPI Version\"\u003e\u003c/a\u003e\n  \u003ca href=\"https://anaconda.org/conda-forge/eland\"\u003e\u003cimg src=\"https://img.shields.io/conda/vn/conda-forge/eland\"\n      alt=\"Conda Version\"\u003e\u003c/a\u003e\n  \u003ca href=\"https://pepy.tech/project/eland\"\u003e\u003cimg src=\"https://static.pepy.tech/badge/eland\" alt=\"Downloads\"\u003e\u003c/a\u003e\n  \u003ca href=\"https://pypi.org/project/eland\"\u003e\u003cimg src=\"https://img.shields.io/pypi/status/eland.svg\"\n      alt=\"Package Status\"\u003e\u003c/a\u003e\n  \u003ca href=\"https://buildkite.com/elastic/eland\"\u003e\u003cimg src=\"https://badge.buildkite.com/d92340e800bc06a7c7c02a71b8d42fcb958bd18c25f99fe2d9.svg\" alt=\"Build Status\"\u003e\u003c/a\u003e\n  \u003ca href=\"https://github.com/elastic/eland/blob/main/LICENSE.txt\"\u003e\u003cimg src=\"https://img.shields.io/pypi/l/eland.svg\"\n      alt=\"License\"\u003e\u003c/a\u003e\n  \u003ca href=\"https://eland.readthedocs.io\"\u003e\u003cimg\n      src=\"https://readthedocs.org/projects/eland/badge/?version=latest\" alt=\"Documentation Status\"\u003e\u003c/a\u003e\n\u003c/div\u003e\n\n## About\n\nEland is a Python Elasticsearch client for exploring and  analyzing data in Elasticsearch with a familiar\nPandas-compatible API.\n\nWhere possible the package uses existing Python APIs and data structures to make it easy to switch between numpy,\npandas, or scikit-learn to their Elasticsearch powered equivalents. In general, the data resides in Elasticsearch and\nnot in memory, which allows Eland to access large datasets stored in Elasticsearch.\n\nEland also provides tools to upload trained machine learning models from common libraries like\n[scikit-learn](https://scikit-learn.org), [XGBoost](https://xgboost.readthedocs.io),  and\n[LightGBM](https://lightgbm.readthedocs.io) into Elasticsearch.\n\n## Getting Started\n\nEland can be installed from [PyPI](https://pypi.org/project/eland) with Pip:\n\n```bash\n$ python -m pip install eland\n```\n\nIf using Eland to upload NLP models to Elasticsearch install the PyTorch extras:\n```bash\n$ python -m pip install 'eland[pytorch]'\n```\n\nEland can also be installed from [Conda Forge](https://anaconda.org/conda-forge/eland) with Conda:\n\n```bash\n$ conda install -c conda-forge eland\n```\n\n### Compatibility\n\n- Supports Python 3.9, 3.10, 3.11, 3.12 and Pandas 1.5\n- Supports Elasticsearch 8+ clusters, recommended 8.16 or later for all features to work.\n  If you are using the NLP with PyTorch feature make sure your Eland minor version matches the minor \n  version of your Elasticsearch cluster. For all other features it is sufficient for the major versions\n  to match.\n- You need to install the appropriate version of PyTorch to import an NLP model. Run `python -m pip\n  install 'eland[pytorch]'` to install that version.\n  \n\n### Prerequisites\n\nUsers installing Eland on Debian-based distributions may need to install prerequisite packages for the transitive\ndependencies of Eland:\n\n```bash\n$ sudo apt-get install -y \\\n  build-essential pkg-config cmake \\\n  python3-dev libzip-dev libjpeg-dev\n```\n\nNote that other distributions such as CentOS, RedHat, Arch, etc. may require using a different package manager and\nspecifying different package names. \n\n### Docker\n\nIf you want to use Eland without installing it just to run the available scripts, use the Docker\nimage.\nIt can be used interactively:\n\n```bash\n$ docker run -it --rm --network host docker.elastic.co/eland/eland\n```\n\nRunning installed scripts is also possible without an interactive shell, e.g.:\n\n```bash\n$ docker run -it --rm --network host \\\n    docker.elastic.co/eland/eland \\\n    eland_import_hub_model \\\n      --url http://host.docker.internal:9200/ \\\n      --hub-model-id elastic/distilbert-base-cased-finetuned-conll03-english \\\n      --task-type ner\n```\n\n### Connecting to Elasticsearch \n\nEland uses the [Elasticsearch low level client](https://elasticsearch-py.readthedocs.io) to connect to Elasticsearch. \nThis client supports a range of [connection options and authentication options](https://elasticsearch-py.readthedocs.io/en/stable/api.html#elasticsearch). \n\nYou can pass either an instance of `elasticsearch.Elasticsearch` to Eland APIs\nor a string containing the host to connect to:\n\n```python\nimport eland as ed\n\n# Connecting to an Elasticsearch instance running on 'http://localhost:9200'\ndf = ed.DataFrame(\"http://localhost:9200\", es_index_pattern=\"flights\")\n\n# Connecting to an Elastic Cloud instance\nfrom elasticsearch import Elasticsearch\n\nes = Elasticsearch(\n    cloud_id=\"cluster-name:...\",\n    basic_auth=(\"elastic\", \"\u003cpassword\u003e\")\n)\ndf = ed.DataFrame(es, es_index_pattern=\"flights\")\n```\n\n## DataFrames in Eland\n\n`eland.DataFrame` wraps an Elasticsearch index in a Pandas-like API\nand defers all processing and filtering of data to Elasticsearch\ninstead of your local machine. This means you can process large\namounts of data within Elasticsearch from a Jupyter Notebook\nwithout overloading your machine.\n\n➤ [Eland DataFrame API documentation](https://eland.readthedocs.io/en/latest/reference/dataframe.html)\n\n➤ [Advanced examples in a Jupyter Notebook](https://eland.readthedocs.io/en/latest/examples/demo_notebook.html)\n\n```python\n\u003e\u003e\u003e import eland as ed\n\n\u003e\u003e\u003e # Connect to 'flights' index via localhost Elasticsearch node\n\u003e\u003e\u003e df = ed.DataFrame('http://localhost:9200', 'flights')\n\n# eland.DataFrame instance has the same API as pandas.DataFrame\n# except all data is in Elasticsearch. See .info() memory usage.\n\u003e\u003e\u003e df.head()\n   AvgTicketPrice  Cancelled  ... dayOfWeek           timestamp\n0      841.265642      False  ...         0 2018-01-01 00:00:00\n1      882.982662      False  ...         0 2018-01-01 18:27:00\n2      190.636904      False  ...         0 2018-01-01 17:11:14\n3      181.694216       True  ...         0 2018-01-01 10:33:28\n4      730.041778      False  ...         0 2018-01-01 05:13:00\n\n[5 rows x 27 columns]\n\n\u003e\u003e\u003e df.info()\n\u003cclass 'eland.dataframe.DataFrame'\u003e\nIndex: 13059 entries, 0 to 13058\nData columns (total 27 columns):\n #   Column              Non-Null Count  Dtype         \n---  ------              --------------  -----         \n 0   AvgTicketPrice      13059 non-null  float64       \n 1   Cancelled           13059 non-null  bool          \n 2   Carrier             13059 non-null  object        \n...      \n 24  OriginWeather       13059 non-null  object        \n 25  dayOfWeek           13059 non-null  int64         \n 26  timestamp           13059 non-null  datetime64[ns]\ndtypes: bool(2), datetime64[ns](1), float64(5), int64(2), object(17)\nmemory usage: 80.0 bytes\nElasticsearch storage usage: 5.043 MB\n\n# Filtering of rows using comparisons\n\u003e\u003e\u003e df[(df.Carrier==\"Kibana Airlines\") \u0026 (df.AvgTicketPrice \u003e 900.0) \u0026 (df.Cancelled == True)].head()\n     AvgTicketPrice  Cancelled  ... dayOfWeek           timestamp\n8        960.869736       True  ...         0 2018-01-01 12:09:35\n26       975.812632       True  ...         0 2018-01-01 15:38:32\n311      946.358410       True  ...         0 2018-01-01 11:51:12\n651      975.383864       True  ...         2 2018-01-03 21:13:17\n950      907.836523       True  ...         2 2018-01-03 05:14:51\n\n[5 rows x 27 columns]\n\n# Running aggregations across an index\n\u003e\u003e\u003e df[['DistanceKilometers', 'AvgTicketPrice']].aggregate(['sum', 'min', 'std'])\n     DistanceKilometers  AvgTicketPrice\nsum        9.261629e+07    8.204365e+06\nmin        0.000000e+00    1.000205e+02\nstd        4.578263e+03    2.663867e+02\n```\n\n## Machine Learning in Eland\n\n### Regression and classification\n\nEland allows transforming trained regression and classification models from scikit-learn, XGBoost, and LightGBM\nlibraries to be serialized and used as an inference model in Elasticsearch.\n\n➤ [Eland Machine Learning API documentation](https://eland.readthedocs.io/en/latest/reference/ml.html)\n\n➤ [Read more about Machine Learning in Elasticsearch](https://www.elastic.co/guide/en/machine-learning/current/ml-getting-started.html)\n\n```python\n\u003e\u003e\u003e from sklearn import datasets\n\u003e\u003e\u003e from xgboost import XGBClassifier\n\u003e\u003e\u003e from eland.ml import MLModel\n\n# Train and exercise an XGBoost ML model locally\n\u003e\u003e\u003e training_data = datasets.make_classification(n_features=5)\n\u003e\u003e\u003e xgb_model = XGBClassifier(booster=\"gbtree\")\n\u003e\u003e\u003e xgb_model.fit(training_data[0], training_data[1])\n\n\u003e\u003e\u003e xgb_model.predict(training_data[0])\n[0 1 1 0 1 0 0 0 1 0]\n\n# Import the model into Elasticsearch\n\u003e\u003e\u003e es_model = MLModel.import_model(\n    es_client=\"http://localhost:9200\",\n    model_id=\"xgb-classifier\",\n    model=xgb_model,\n    feature_names=[\"f0\", \"f1\", \"f2\", \"f3\", \"f4\"],\n)\n\n# Exercise the ML model in Elasticsearch with the training data\n\u003e\u003e\u003e es_model.predict(training_data[0])\n[0 1 1 0 1 0 0 0 1 0]\n```\n\n### NLP with PyTorch\n\nFor NLP tasks, Eland allows importing PyTorch trained BERT models into Elasticsearch. Models can be either plain PyTorch\nmodels, or supported [transformers](https://huggingface.co/transformers) models from the\n[Hugging Face model hub](https://huggingface.co/models).\n\n```bash\n$ eland_import_hub_model \\\n  --url http://localhost:9200/ \\\n  --hub-model-id elastic/distilbert-base-cased-finetuned-conll03-english \\\n  --task-type ner \\\n  --start\n```\n\nThe example above will automatically start a model deployment. This is a\ngood shortcut for initial experimentation, but for anything that needs\ngood throughput you should omit the `--start` argument from the Eland\ncommand line and instead start the model using the ML UI in Kibana.\nThe `--start` argument will deploy the model with one allocation and one\nthread per allocation, which will not offer good performance. When starting\nthe model deployment using the ML UI in Kibana or the Elasticsearch\n[API](https://www.elastic.co/guide/en/elasticsearch/reference/current/start-trained-model-deployment.html)\nyou will be able to set the threading options to make the best use of your\nhardware.\n\n```python\n\u003e\u003e\u003e import elasticsearch\n\u003e\u003e\u003e from pathlib import Path\n\u003e\u003e\u003e from eland.common import es_version\n\u003e\u003e\u003e from eland.ml.pytorch import PyTorchModel\n\u003e\u003e\u003e from eland.ml.pytorch.transformers import TransformerModel\n\n\u003e\u003e\u003e es = elasticsearch.Elasticsearch(\"http://elastic:mlqa_admin@localhost:9200\")\n\u003e\u003e\u003e es_cluster_version = es_version(es)\n\n# Load a Hugging Face transformers model directly from the model hub\n\u003e\u003e\u003e tm = TransformerModel(model_id=\"elastic/distilbert-base-cased-finetuned-conll03-english\", task_type=\"ner\", es_version=es_cluster_version)\nDownloading: 100%|██████████| 257/257 [00:00\u003c00:00, 108kB/s]\nDownloading: 100%|██████████| 954/954 [00:00\u003c00:00, 372kB/s]\nDownloading: 100%|██████████| 208k/208k [00:00\u003c00:00, 668kB/s] \nDownloading: 100%|██████████| 112/112 [00:00\u003c00:00, 43.9kB/s]\nDownloading: 100%|██████████| 249M/249M [00:23\u003c00:00, 11.2MB/s]\n\n# Export the model in a TorchScrpt representation which Elasticsearch uses\n\u003e\u003e\u003e tmp_path = \"models\"\n\u003e\u003e\u003e Path(tmp_path).mkdir(parents=True, exist_ok=True)\n\u003e\u003e\u003e model_path, config, vocab_path = tm.save(tmp_path)\n\n# Import model into Elasticsearch\n\u003e\u003e\u003e ptm = PyTorchModel(es, tm.elasticsearch_model_id())\n\u003e\u003e\u003e ptm.import_model(model_path=model_path, config_path=None, vocab_path=vocab_path, config=config)\n100%|██████████| 63/63 [00:12\u003c00:00,  5.02it/s]\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Felastic%2Feland","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Felastic%2Feland","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Felastic%2Feland/lists"}