https://github.com/equalitie/esretriever

A small library that uses PySpark to get data from Elastic Search - used in Baskerville
https://github.com/equalitie/esretriever
Last synced: 12 months ago
JSON representation
A small library that uses PySpark to get data from Elastic Search - used in Baskerville
Host: GitHub
URL: https://github.com/equalitie/esretriever
Owner: equalitie
License: other
Created: 2020-05-12T14:59:51.000Z (about 6 years ago)
Default Branch: master
Last Pushed: 2023-05-22T21:40:32.000Z (about 3 years ago)
Last Synced: 2025-01-04T01:36:41.134Z (over 1 year ago)
Language: Python
Size: 893 KB
Stars: 1
Watchers: 4
Forks: 1
Open Issues: 2
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project

README

          # Retrieve Logs from ElasticSearch using Spark

## Pre-requisites

- Python 3

- Java 8+ needs to be in place (and in PATH) for Spark (Pyspark version 2.2+) to work [src](https://spark.apache.org/docs/2.2.0/)

## Set-up

- Create a virtual env and activate it with `source /path/to/venv/bin/activate`

- `cd /path/to/setup.py`

- `pip install -e .` to install `es-retriever` (note: `-e` flag is important to be able to change the configuration. This will be changed in the future)

### Run tests

`python -m pytest`

## Usage

NOTE: While the script is running, `localhost:4040` displays information about what

spark does.

### Retrieving data

Under `es-pyspark-retriever/src/es_retriever/examples` there is a [simple example](/src/es_retriever/examples/simple_retrieve.py) retrieving data for one day for a single hour and an [example](/src/es_retriever/examples/week_retrieve.py) that retrieves data for one week.

Configuration information should be given as parameters when initializing the `Config` object, for instance `Config('localhost', 'user', 'password', 'deflect.log', 'deflect_access')`

#### Common part in all examples is the creation of the Config and EsStorage instances:

```python

from datetime import datetime

from es_retriever.config import Config

from es_retriever.es.storage import EsStorage

# create a configuration instance

config = Config(SERVER, 'user', 'password', 'deflect.log', 'deflect_access')

# create an EsStorage instance with the given configuration

storage = EsStorage(

   config

)

```

#### Common - storing data

```python

# note: the following will create multiple files - fileparts

# write to a file as json

df.write.mode('overwrite').json('somefilename')

# or as csv

df.write.mode('overwrite').csv('somefilename')

```

To save to a **single** file instead of multiparts:

```

df.coalesce(1).write.mode('overwrite').json('full/path/to/file')

```

Mode can be:

    * `append`: Append contents to existing data.

    * `overwrite`: Overwrite existing data.

    * `error` or `errorifexists`: Throw an exception if data already exists.

    * `ignore`: Silently ignore this operation if data already exists.

#### Example #1 get data for a day

Get data for 1 day from `11-04-18` to `12-4-18`:

```python

# the dates we need to look for

since = datetime(2018, 4, 11)

until = datetime(2018, 4, 12)

# get the data for the period of time

df = storage.get(since, until)

```

#### Example #2 get data for an hour of a day

If hours / minutes / seconds are specified in since / until, the filtering of the data

will take that into account.

```python

# the dates we need to look for

since = datetime(2018, 4, 11, 10, 00, 00)

until = datetime(2018, 4, 11, 11, 00, 00)

# get the data for the period of time

df = storage.get(since, until)

```

#### Example #3 get data for an hour of a day (using hour-of-day field in logs)

```python

# the dates we need to look for

since = datetime(2018, 4, 11, 10, 00, 00)

until = datetime(2018, 4, 11, 11, 00, 00)

# hour filter:

hour_filter = (

(F.col('@timestamp') >= since) & (F.col('@timestamp') <= until)

)

# get the data for the period of time

df = storage.get(since, until, hour_filter)

# or use the hour-of-day field in the logs:

hour_filter = (

(F.col('hour-of-day') >= 10) & (F.col('hour-of-day') <= 11)

)

# get the data for the period of time

df = storage.get(since, until, hour_filter)

```

#### Example #4 get data for a week

```python

# the dates we need to look for

since = datetime(2018, 4, 11)

until = datetime(2018, 4, 18)

# get the data for the period of time

df = storage.get(since, until)

```

#### Example #5 get data for a week - one hour for each day

##### One way - have every day saved in a different file:

```python

# define the start day: 01-04-2018

since = datetime(2018, 4, 1, 10, 00, 00)

until = datetime(2018, 4, 1, 11, 00, 00)

# iterate for a week

for day_no in range(7):

    # keep the name for convenience

    file_name = '{}-{year}.{month}.{day}.{hour}.{minute}.{second}'.format(

        config.es_base_index,

        year=since.year,

        month=str(since.month).zfill(2),

        day=str(since.day).zfill(2),

        hour=str(since.hour).zfill(2),

        minute=str(since.minute).zfill(2),

        second=str(since.second).zfill(2)

    )

    # get the data for the period of time

    df = storage.get(

        since,

        until

    )

    print 'Retrieving data for {}'.format(file_name)

    # save

    df.write.mode('overwrite').json(file_name)

    # increment since and until by one day

    since = since + timedelta(days=1)

    until = until + timedelta(days=1)

```

##### A simpler way - have all days saved in one file:

```python

# define the start day:

since = datetime(2018, 4, 1)

until = datetime(2018, 4, 7)

# get only 10 to 11 am

hour_filter = (

(F.col('hour-of-day') >= 10) & (F.col('hour-of-day') <= 11)

)

# get the data for the period of time

df = storage.get(

    since,

    until,

    hour_filter

)

# save

df.write.mode('overwrite').json('data_for_week_somenumber')

```

A bit more complex hour filter:

```python

# define the start day:

since = datetime(2018, 4, 1)

until = datetime(2018, 4, 7)

# get only 10 to 11 am and 15 to 16 pm

hours_filter = (

(F.col('hour-of-day') >= 10) & (F.col('hour-of-day') <= 11) &

(F.col('hour-of-day') >= 15) & (F.col('hour-of-day') <= 16)

)

# get the data for the period of time

df = storage.get(

    since,

    until,

    hours_filter

)

```

#### Example #6 regex filters:

To filter the rows where some field matches a regex expression:

```

df = df.filter(df["field / column name"].rlike('regex expression'))

```

#### CLI example

When the es-pyspark-retriever package is installed, it also installs an `esretrieve` command. To use this tool, you first need to create a configuration file in `~/opsdash` with the following format :

```

[OpsDash]

user: USERNAME HERE

password: PASSWORD HERE

```

The options so far:

```

Script to retrieve data fromElasticSearch using Pyspark

positional arguments:

  since                 The start date in the format YYYY-MM-DD HH:MM:SS, e.g.

                        2018-01-01 00:00:00

  until                 The end date in the format YYYY-MM-DD HH:MM:SS, e.g.

                        2018-01-02 00:00:00

optional arguments:

  -h, --help            show this help message and exit

  -f [FILTER [FILTER ...]], --filter [FILTER [FILTER ...]]

                        extra filters separated by space, e.g. to get data for

                        a specific dnet and a specific ip, use dnet=somednet

                        client_ip=95.11.1.111

  -rf [REGEX_FILTER [REGEX_FILTER ...]], --regex_filter [REGEX_FILTER [REGEX_FILTER ...]]

                        extra regex filters separated by space, e.g.

                        dnet="some_valid_regex"

                        client_ip="95\.11\.\d{1,3}\.111"

  -sf SQL_FILTER, --sql_filter SQL_FILTER

                        SQL like filter for more complex quering, e.g.

                        dnet="somednet" OR day-of-week>=2

  -o OUTPUT, --output OUTPUT

                        The full path and name of the file to save the result

                        to, defaults to current working dir/base_index

  -b, --banjax          Query Banjax logs instead of Deflect logs

```

To get data from deflect.log from "2018-04-23 10:00:00" to "2018-04-23 11:00:00"

where dnet equals somednet and client_ip equals 95.111.111.111

```

python simple_retrieve_cli.py "2018-04-23 10:00:00" "2018-04-23 11:00:00" --f dnet=somednet client_ip=95.111.111.111 -o /some/path/to/file

```

Sample output of the above:

```

18/04/24 15:47:38 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable

18/04/24 15:47:38 WARN Utils: Your hostname, SOMEHOSTNAME resolves to a loopback address: 127.0.1.1; using SOMEOTHERIP instead (on interface SOMEINTERFACE)

18/04/24 15:47:38 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address

Getting data from 2018-04-23 10:00:00 to 2018-04-23 11:00:00, filter(s) Column<((dnet = somednet) AND (client_ip = 95.111.111.111))>

root

 |-- @timestamp: timestamp (nullable = true)

 |-- field1: string (nullable = true)

 |-- field2: string (nullable = true)

  ...

  ...

  ...

Starting data retrieval and saving to /some/path/to/file

```

### Loading data

The common steps are:

1. Import necessary

2. Create a configuration instance

3. Create a spark instance

4. Use `spark.read.json(path to folder)` or `spark.read.csv(path to folder)`

to load the data

```python

from es_retriever.config import Config

from es_retriever.spark import get_or_create_spark_session

config = Config(SERVER, 'user', 'password', 'deflect.log', 'deflect_access')

# create a spark session using the configuration

spark = get_or_create_spark_session(config)

# read all files that start with some-folder-name-previously-stored-with-spark

# - e.g. when you have different files for different days, this will load them

# all into one dataframe

# note: lazy loading and will spill to disc if ram not enough

df = spark.read.json('/path/to/some-folder-name-previously-stored-with-spark*')

```

#### Process after loading:

##### Cache the dataframe if not loading from file

Since spark operates lazily, we need to cache the dataframe once we get the data if we need to perform stuff on it.

E.g.

```python

df = storage.get(

    since,

    until,

    hours_filter

).filter(...)

.select(...)

.cache() # after filters

# then:

print df.count()

# if we do not cache the data will be fetched twice, once for load and filtering and once on count

```

In general, it is a good practice to *save the data first* and then, on a separate session, load from file and perform actions on the data.

##### Print the dataframe schema

```python

df.printSchema()

>>>

root

 |-- @timestamp: string (nullable = true)

 |-- field1: string (nullable = true)

 |-- field2: string (nullable = true)

 ...

```

##### How many logs are there?

```python

df.count()

>>>1276263

```

##### Get only two columns:

```python

df.select('@timestamp', 'field1').show()

>>>

+--------------------+--------------------+

|          @timestamp|              field1|

+--------------------+--------------------+

|2018-04-01T10:36:...|             data...|

|2018-04-01T10:36:...|             data...|

|2018-04-01T10:36:...|             data...|

|2018-04-01T10:34:...|             data...|

|2018-04-01T10:36:...|             data...|

|2018-04-01T10:36:...|             data...|

|2018-04-01T10:36:...|             data...|

|2018-04-01T10:36:...|             data...|

|2018-04-01T10:36:...|             data...|

|2018-04-01T10:36:...|             data...|

|2018-04-01T10:36:...|             data...|

|2018-04-01T10:36:...|             data...|

|2018-04-01T10:36:...|             data...|

|2018-04-01T10:36:...|             data...|

|2018-04-01T10:36:...|             data...|

|2018-04-01T10:36:...|             data...|

|2018-04-01T10:36:...|             data...|

|2018-04-01T10:34:...|             data...|

|2018-04-01T10:36:...|             data...|

|2018-04-01T10:36:...|             data...|

+--------------------+--------------------+

only showing top 20 rows

```

##### Get the rows that contain a specific string

```python

specific_df = df.select("*").where(df.field1 == 'hello')

specific_df.show()  # will output the results

```

A more specific example:

```python

specific_df = df.select("*").where(

    (df.content_type == 'image/jpeg') & (df.dnet == 'somednet')

)

specific_df.show()  # will output the results

```

## Docker

To run the `simple_retrieve_cli.py`:

- Create a `.env` file according to the `dot_env_example`

- Modify `docker-compose.yaml` according to the options and filters explained in the [CLI example](#cli-example) section

```

command: python simple_retrieve_cli.py ${ELK_HOST} es-index es-index-type "start datetime" "end datetime" -u ${ELK_USER} -p -o /data/nameofthefolder

# for example:

command: python simple_retrieve_cli.py ${ELK_HOST} test.log web_access "2019-09-01 00:00:00" "2019-09-01 01:00:00" -u ${ELK_USER} -p -o /data/testdata

```

The example above will get all data between "2019-09-01 00:00:00" and "2019-09-01 01:00:00" and it will store them under `/data/testdata_YYYY_MM_dd` folder.

If the data range spans over more than a single day, then the data will be stored in separate folders `/data/testdata_YYYY_MM_day1`, `/data/testdata_YYYY_MM_day2` etc.

- Run: `docker-compose run --rm --service-ports es_retriever` - you will be prompted for your elastic search password. 

This process will probably take a lot of time to complete, depending on the date range and the criteria given, so it is better if this is run using e.g. `screen`.

You can also check the spark console under `localhost:4040`

## Limitations and Future Work

- Better docstrings

- Unittests to be added

- EsStorage assumes time based indices

- Config assumes ssl for es

- Currently, `geoip` and `tags` need to be filtered out otherwise save fails.

- Getting one years' data with 1h sampling (different hour each time)

needs some groupping for the requests to be efficient.

[note](https://stackoverflow.com/a/33537029/3433323)






This work is copyright (c) 2020, eQualit.ie inc., and is licensed under a Creative Commons Attribution 4.0 International License.
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/equalitie/esretriever

Awesome Lists containing this project

README