{"id":19174620,"url":"https://github.com/equalitie/esretriever","last_synced_at":"2025-07-06T14:36:48.533Z","repository":{"id":80787739,"uuid":"263368938","full_name":"equalitie/esretriever","owner":"equalitie","description":"A small library that uses PySpark to get data from Elastic Search - used in Baskerville","archived":false,"fork":false,"pushed_at":"2023-05-22T21:40:32.000Z","size":914,"stargazers_count":1,"open_issues_count":2,"forks_count":1,"subscribers_count":4,"default_branch":"master","last_synced_at":"2025-01-04T01:36:41.134Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/equalitie.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2020-05-12T14:59:51.000Z","updated_at":"2020-06-04T19:54:14.000Z","dependencies_parsed_at":null,"dependency_job_id":"21c399ed-a78c-4132-ad66-3f11f5839576","html_url":"https://github.com/equalitie/esretriever","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/equalitie%2Fesretriever","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/equalitie%2Fesretriever/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/equalitie%2Fesretriever/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/equalitie%2Fesretriever/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/equalitie","download_url":"https://codeload.github.com/equalitie/esretriever/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":240254182,"owners_count":19772386,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-09T10:18:32.525Z","updated_at":"2025-02-23T00:44:12.902Z","avatar_url":"https://github.com/equalitie.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Retrieve Logs from ElasticSearch using Spark\n\n## Pre-requisites\n\n- Python 3\n- Java 8+ needs to be in place (and in PATH) for Spark (Pyspark version 2.2+) to work [src](https://spark.apache.org/docs/2.2.0/)\n\n## Set-up\n\n- Create a virtual env and activate it with `source /path/to/venv/bin/activate`\n- `cd /path/to/setup.py`\n- `pip install -e .` to install `es-retriever` (note: `-e` flag is important to be able to change the configuration. This will be changed in the future)\n\n### Run tests\n`python -m pytest`\n\n## Usage\n\nNOTE: While the script is running, `localhost:4040` displays information about what\nspark does.\n\n### Retrieving data\n\nUnder `es-pyspark-retriever/src/es_retriever/examples` there is a [simple example](/src/es_retriever/examples/simple_retrieve.py) retrieving data for one day for a single hour and an [example](/src/es_retriever/examples/week_retrieve.py) that retrieves data for one week.\n\nConfiguration information should be given as parameters when initializing the `Config` object, for instance `Config('localhost', 'user', 'password', 'deflect.log', 'deflect_access')`\n\n#### Common part in all examples is the creation of the Config and EsStorage instances:\n```python\nfrom datetime import datetime\n\nfrom es_retriever.config import Config\nfrom es_retriever.es.storage import EsStorage\n\n\n# create a configuration instance\nconfig = Config(SERVER, 'user', 'password', 'deflect.log', 'deflect_access')\n\n# create an EsStorage instance with the given configuration\nstorage = EsStorage(\n   config\n)\n```\n\n#### Common - storing data\n\n```python\n# note: the following will create multiple files - fileparts\n# write to a file as json\ndf.write.mode('overwrite').json('somefilename')\n# or as csv\ndf.write.mode('overwrite').csv('somefilename')\n```\n\nTo save to a **single** file instead of multiparts:\n```\ndf.coalesce(1).write.mode('overwrite').json('full/path/to/file')\n```\n\nMode can be:\n    * `append`: Append contents to existing data.\n    * `overwrite`: Overwrite existing data.\n    * `error` or `errorifexists`: Throw an exception if data already exists.\n    * `ignore`: Silently ignore this operation if data already exists.\n\n#### Example #1 get data for a day\n\nGet data for 1 day from `11-04-18` to `12-4-18`:\n\n```python\n# the dates we need to look for\nsince = datetime(2018, 4, 11)\nuntil = datetime(2018, 4, 12)\n\n# get the data for the period of time\ndf = storage.get(since, until)\n```\n\n#### Example #2 get data for an hour of a day\nIf hours / minutes / seconds are specified in since / until, the filtering of the data\nwill take that into account.\n```python\n# the dates we need to look for\nsince = datetime(2018, 4, 11, 10, 00, 00)\nuntil = datetime(2018, 4, 11, 11, 00, 00)\n\n# get the data for the period of time\ndf = storage.get(since, until)\n\n```\n\n#### Example #3 get data for an hour of a day (using hour-of-day field in logs)\n\n```python\n# the dates we need to look for\nsince = datetime(2018, 4, 11, 10, 00, 00)\nuntil = datetime(2018, 4, 11, 11, 00, 00)\n\n# hour filter:\nhour_filter = (\n(F.col('@timestamp') \u003e= since) \u0026 (F.col('@timestamp') \u003c= until)\n)\n# get the data for the period of time\ndf = storage.get(since, until, hour_filter)\n\n# or use the hour-of-day field in the logs:\nhour_filter = (\n(F.col('hour-of-day') \u003e= 10) \u0026 (F.col('hour-of-day') \u003c= 11)\n)\n\n# get the data for the period of time\ndf = storage.get(since, until, hour_filter)\n```\n\n#### Example #4 get data for a week\n```python\n# the dates we need to look for\nsince = datetime(2018, 4, 11)\nuntil = datetime(2018, 4, 18)\n\n# get the data for the period of time\ndf = storage.get(since, until)\n\n```\n#### Example #5 get data for a week - one hour for each day\n##### One way - have every day saved in a different file:\n```python\n# define the start day: 01-04-2018\nsince = datetime(2018, 4, 1, 10, 00, 00)\nuntil = datetime(2018, 4, 1, 11, 00, 00)\n\n# iterate for a week\nfor day_no in range(7):\n    # keep the name for convenience\n    file_name = '{}-{year}.{month}.{day}.{hour}.{minute}.{second}'.format(\n        config.es_base_index,\n        year=since.year,\n        month=str(since.month).zfill(2),\n        day=str(since.day).zfill(2),\n        hour=str(since.hour).zfill(2),\n        minute=str(since.minute).zfill(2),\n        second=str(since.second).zfill(2)\n    )\n\n    # get the data for the period of time\n    df = storage.get(\n        since,\n        until\n    )\n\n    print 'Retrieving data for {}'.format(file_name)\n\n    # save\n    df.write.mode('overwrite').json(file_name)\n\n    # increment since and until by one day\n    since = since + timedelta(days=1)\n    until = until + timedelta(days=1)\n\n```\n\n##### A simpler way - have all days saved in one file:\n```python\n# define the start day:\nsince = datetime(2018, 4, 1)\nuntil = datetime(2018, 4, 7)\n\n# get only 10 to 11 am\nhour_filter = (\n(F.col('hour-of-day') \u003e= 10) \u0026 (F.col('hour-of-day') \u003c= 11)\n)\n\n# get the data for the period of time\ndf = storage.get(\n    since,\n    until,\n    hour_filter\n)\n\n# save\ndf.write.mode('overwrite').json('data_for_week_somenumber')\n```\n\nA bit more complex hour filter:\n```python\n# define the start day:\nsince = datetime(2018, 4, 1)\nuntil = datetime(2018, 4, 7)\n\n# get only 10 to 11 am and 15 to 16 pm\nhours_filter = (\n(F.col('hour-of-day') \u003e= 10) \u0026 (F.col('hour-of-day') \u003c= 11) \u0026\n(F.col('hour-of-day') \u003e= 15) \u0026 (F.col('hour-of-day') \u003c= 16)\n)\n\n# get the data for the period of time\ndf = storage.get(\n    since,\n    until,\n    hours_filter\n)\n```\n\n#### Example #6 regex filters:\n\nTo filter the rows where some field matches a regex expression:\n```\ndf = df.filter(df[\"field / column name\"].rlike('regex expression'))\n```\n\n#### CLI example\n\nWhen the es-pyspark-retriever package is installed, it also installs an `esretrieve` command. To use this tool, you first need to create a configuration file in `~/opsdash` with the following format :\n```\n[OpsDash]\nuser: USERNAME HERE\npassword: PASSWORD HERE\n```\n\nThe options so far:\n```\nScript to retrieve data fromElasticSearch using Pyspark\n\npositional arguments:\n  since                 The start date in the format YYYY-MM-DD HH:MM:SS, e.g.\n                        2018-01-01 00:00:00\n  until                 The end date in the format YYYY-MM-DD HH:MM:SS, e.g.\n                        2018-01-02 00:00:00\n\noptional arguments:\n  -h, --help            show this help message and exit\n  -f [FILTER [FILTER ...]], --filter [FILTER [FILTER ...]]\n                        extra filters separated by space, e.g. to get data for\n                        a specific dnet and a specific ip, use dnet=somednet\n                        client_ip=95.11.1.111\n  -rf [REGEX_FILTER [REGEX_FILTER ...]], --regex_filter [REGEX_FILTER [REGEX_FILTER ...]]\n                        extra regex filters separated by space, e.g.\n                        dnet=\"some_valid_regex\"\n                        client_ip=\"95\\.11\\.\\d{1,3}\\.111\"\n  -sf SQL_FILTER, --sql_filter SQL_FILTER\n                        SQL like filter for more complex quering, e.g.\n                        dnet=\"somednet\" OR day-of-week\u003e=2\n  -o OUTPUT, --output OUTPUT\n                        The full path and name of the file to save the result\n                        to, defaults to current working dir/base_index\n  -b, --banjax          Query Banjax logs instead of Deflect logs\n```\n\nTo get data from deflect.log from \"2018-04-23 10:00:00\" to \"2018-04-23 11:00:00\"\nwhere dnet equals somednet and client_ip equals 95.111.111.111\n```\npython simple_retrieve_cli.py \"2018-04-23 10:00:00\" \"2018-04-23 11:00:00\" --f dnet=somednet client_ip=95.111.111.111 -o /some/path/to/file\n```\n\nSample output of the above:\n```\n18/04/24 15:47:38 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable\n18/04/24 15:47:38 WARN Utils: Your hostname, SOMEHOSTNAME resolves to a loopback address: 127.0.1.1; using SOMEOTHERIP instead (on interface SOMEINTERFACE)\n18/04/24 15:47:38 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address\nGetting data from 2018-04-23 10:00:00 to 2018-04-23 11:00:00, filter(s) Column\u003c((dnet = somednet) AND (client_ip = 95.111.111.111))\u003e\nroot\n |-- @timestamp: timestamp (nullable = true)\n |-- field1: string (nullable = true)\n |-- field2: string (nullable = true)\n  ...\n  ...\n  ...\n\nStarting data retrieval and saving to /some/path/to/file\n\n```\n\n\n### Loading data\nThe common steps are:\n1. Import necessary\n2. Create a configuration instance\n3. Create a spark instance\n4. Use `spark.read.json(path to folder)` or `spark.read.csv(path to folder)`\nto load the data\n\n```python\nfrom es_retriever.config import Config\nfrom es_retriever.spark import get_or_create_spark_session\n\n\nconfig = Config(SERVER, 'user', 'password', 'deflect.log', 'deflect_access')\n# create a spark session using the configuration\nspark = get_or_create_spark_session(config)\n\n# read all files that start with some-folder-name-previously-stored-with-spark\n# - e.g. when you have different files for different days, this will load them\n# all into one dataframe\n# note: lazy loading and will spill to disc if ram not enough\ndf = spark.read.json('/path/to/some-folder-name-previously-stored-with-spark*')\n```\n#### Process after loading:\n##### Cache the dataframe if not loading from file\nSince spark operates lazily, we need to cache the dataframe once we get the data if we need to perform stuff on it.\nE.g.\n```python\ndf = storage.get(\n    since,\n    until,\n    hours_filter\n).filter(...)\n.select(...)\n.cache() # after filters\n\n# then:\nprint df.count()\n# if we do not cache the data will be fetched twice, once for load and filtering and once on count\n\n```\nIn general, it is a good practice to *save the data first* and then, on a separate session, load from file and perform actions on the data.\n\n##### Print the dataframe schema\n```python\ndf.printSchema()\n\u003e\u003e\u003e\nroot\n |-- @timestamp: string (nullable = true)\n |-- field1: string (nullable = true)\n |-- field2: string (nullable = true)\n ...\n```\n\n##### How many logs are there?\n```python\ndf.count()\n\u003e\u003e\u003e1276263\n```\n\n##### Get only two columns:\n```python\ndf.select('@timestamp', 'field1').show()\n\u003e\u003e\u003e\n+--------------------+--------------------+\n|          @timestamp|              field1|\n+--------------------+--------------------+\n|2018-04-01T10:36:...|             data...|\n|2018-04-01T10:36:...|             data...|\n|2018-04-01T10:36:...|             data...|\n|2018-04-01T10:34:...|             data...|\n|2018-04-01T10:36:...|             data...|\n|2018-04-01T10:36:...|             data...|\n|2018-04-01T10:36:...|             data...|\n|2018-04-01T10:36:...|             data...|\n|2018-04-01T10:36:...|             data...|\n|2018-04-01T10:36:...|             data...|\n|2018-04-01T10:36:...|             data...|\n|2018-04-01T10:36:...|             data...|\n|2018-04-01T10:36:...|             data...|\n|2018-04-01T10:36:...|             data...|\n|2018-04-01T10:36:...|             data...|\n|2018-04-01T10:36:...|             data...|\n|2018-04-01T10:36:...|             data...|\n|2018-04-01T10:34:...|             data...|\n|2018-04-01T10:36:...|             data...|\n|2018-04-01T10:36:...|             data...|\n+--------------------+--------------------+\nonly showing top 20 rows\n```\n\n##### Get the rows that contain a specific string\n```python\nspecific_df = df.select(\"*\").where(df.field1 == 'hello')\n\nspecific_df.show()  # will output the results\n\n```\n\nA more specific example:\n```python\nspecific_df = df.select(\"*\").where(\n    (df.content_type == 'image/jpeg') \u0026 (df.dnet == 'somednet')\n)\n\nspecific_df.show()  # will output the results\n\n```\n\n## Docker\nTo run the `simple_retrieve_cli.py`:\n- Create a `.env` file according to the `dot_env_example`\n- Modify `docker-compose.yaml` according to the options and filters explained in the [CLI example](#cli-example) section\n\n```\ncommand: python simple_retrieve_cli.py ${ELK_HOST} es-index es-index-type \"start datetime\" \"end datetime\" -u ${ELK_USER} -p -o /data/nameofthefolder\n# for example:\ncommand: python simple_retrieve_cli.py ${ELK_HOST} test.log web_access \"2019-09-01 00:00:00\" \"2019-09-01 01:00:00\" -u ${ELK_USER} -p -o /data/testdata\n```\nThe example above will get all data between \"2019-09-01 00:00:00\" and \"2019-09-01 01:00:00\" and it will store them under `/data/testdata_YYYY_MM_dd` folder.\nIf the data range spans over more than a single day, then the data will be stored in separate folders `/data/testdata_YYYY_MM_day1`, `/data/testdata_YYYY_MM_day2` etc.\n\n- Run: `docker-compose run --rm --service-ports es_retriever` - you will be prompted for your elastic search password. \nThis process will probably take a lot of time to complete, depending on the date range and the criteria given, so it is better if this is run using e.g. `screen`.\nYou can also check the spark console under `localhost:4040`\n\n\n## Limitations and Future Work\n- Better docstrings\n- Unittests to be added\n- EsStorage assumes time based indices\n- Config assumes ssl for es\n- Currently, `geoip` and `tags` need to be filtered out otherwise save fails.\n- Getting one years' data with 1h sampling (different hour each time)\nneeds some groupping for the requests to be efficient.\n[note](https://stackoverflow.com/a/33537029/3433323)\n\n\n\u003ca rel=\"license\" href=\"http://creativecommons.org/licenses/by/4.0/\"\u003e\n\u003cimg alt=\"Creative Commons Licence\" style=\"border-width:0\" src=\"https://i.creativecommons.org/l/by/4.0/80x15.png\" /\u003e\u003c/a\u003e\u003cbr /\u003e\nThis work is copyright (c) 2020, eQualit.ie inc., and is licensed under a \u003ca rel=\"license\" href=\"http://creativecommons.org/licenses/by/4.0/\"\u003eCreative Commons Attribution 4.0 International License\u003c/a\u003e.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fequalitie%2Fesretriever","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fequalitie%2Fesretriever","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fequalitie%2Fesretriever/lists"}