https://github.com/corelight/zeek2es

A Python application to filter and transfer Zeek logs to Elastic/OpenSearch+Humio. This app can also output pure JSON logs to stdout for further processing!
https://github.com/corelight/zeek2es

elasticsearch humio kibana opensearch python zeek

Last synced: about 2 months ago
JSON representation

A Python application to filter and transfer Zeek logs to Elastic/OpenSearch+Humio. This app can also output pure JSON logs to stdout for further processing!

Host: GitHub
URL: https://github.com/corelight/zeek2es
Owner: corelight
License: bsd-3-clause
Created: 2021-12-28T14:39:54.000Z (about 4 years ago)
Default Branch: master
Last Pushed: 2022-08-18T13:23:02.000Z (over 3 years ago)
Last Synced: 2024-10-28T20:39:46.280Z (over 1 year ago)
Topics: elasticsearch, humio, kibana, opensearch, python, zeek
Language: Python
Homepage:
Size: 3.31 MB
Stars: 35
Watchers: 5
Forks: 7
Open Issues: 0
Metadata Files:
- Readme: Readme.md
- Changelog: CHANGES
- License: LICENSE

Awesome Lists containing this project

awesome-security - zeek2es - An open source tool to convert Zeek logs to Elastic/OpenSearch. You can also output pure JSON from Zeek's TSV logs! (Network / IDS / IPS / Host IDS / Host IPS)
awesome-networking - zeek2es - A Zeek log to Elastic/OpenSearch log converter. (Security Monitoring / SD-WAN)
venom - `zeek2es` - An open source tool to convert Zeek logs to Elastic/OpenSearch. You can also output pure JSON from Zeek's TSV logs! (Network / IDS / IPS / Host IDS / Host IPS)

README

# zeek2es.py

This Python application translates [Zeek's](https://zeek.org/) ASCII TSV and JSON
logs into [ElasticSearch's bulk load JSON format](https://www.elastic.co/guide/en/elasticsearch/reference/current/getting-started.html#add-multiple-documents).

## Table of Contents:
- [Introduction](#introduction)
- [Installation](#installation)
- [Elastic v8.0+](#elastic80)
- [Docker](#docker)
- [Upgrading zeek2es](#upgradingzeek2es)
- [ES Ingest Pipeline](#esingestpipeline)
- [Filtering Data](#filteringdata)
- [Python Filters](#pythonfilters)
- [Filter on Keys](#filteronkeys)
- [Command Line Examples](#commandlineexamples)
- [Command Line Options](#commandlineoptions)
- [Requirements](#requirements)
- [Notes](#notes)
- [Humio](#humio)
- [JSON Log Input](#jsonloginput)
- [Data Streams](#datastreams)
- [Helper Scripts](#helperscripts)
- [Cython](#cython)

## Introduction

![Kibana](images/kibana.png)

Want to see multiple Zeek logs for the same connection ID (uid)
or file ID (fuid)? Here are the hits from files.log, http.log, and
conn.log for a single uid:

![Kibana](images/multi-log-correlation.png)

You can perform subnet searching on Zeek's 'addr' type:

![Kibana Subnet Searching](images/kibana-subnet-search.png)

You can create time series graphs, such as this NTP and HTTP graph:

![Kibana Time Series](images/kibana-timeseries.png)

IP Addresses can be Geolocated with the `-g` command line option:

![Kibana Mapping](images/kibana-map.png)

Aggregations are simple and quick:

![Kibana Aggregation](images/kibana-aggregation.png)

This application will "just work" when Zeek log formats change. The logic reads
the field names and associated types to set up the mappings correctly in
ElasticSearch.

This application will recognize gzip or uncompressed logs. This application assumes
you have ElasticSearch set up on your localhost at the default port.
If you do not have ElasticSearch you can output the JSON to stdout with the `-s -b` command line options
to process with the [jq application](https://stedolan.github.io/jq).

You can add a keyword subfield to text fields with the `-k` command line option. This is useful
for aggregations in Kibana.

If Python is already on your system, there is nothing additional for you to copy over
to your machine than [Elasticsearch, Kibana](https://www.elastic.co/start), and [zeek2es.py](zeek2es.py)
if you already have the [requests](https://docs.python-requests.org/en/latest/) library installed.

## Installation

Assuming you meet the [requirements](#requirements), there is none. You just
copy [zeek2es.py](zeek2es.py) to your host and run it with Python. Once Zeek
logs have been imported with automatic index name generation (meaning, you did not supply the `-i` option)
you will find your indices named "zeek_`zeeklogname`_`date`", where `zeeklogname` is a log name like `conn`
and the `date` is in `YYYY-MM-DD` format. Set your Kibana index pattern to match `zeek*` in this case. If
you named your index with the `-i` option, you will need to create a Kibana index pattern that
matches your naming scheme.

If you are upgrading zeek2es, please see [the section on upgrading zeek2es](#upgradingzeek2es).

### Elastic v8.0+

If you are using Elastic v8.0+, it has security enabled by default. This adds a requirement of a username
and password, plus HTTPS.

If you want to be able to delete indices/data streams with wildcards (as examples in this readme show),
edit `elasticsearch.yml` with the following line:

```
action.destructive_requires_name: false
```

You will also need to change the curl commands in this readme to contain `-k -u elastic:`
where the `elastic` user's password is set with a command like the following:

```
./bin/elasticsearch-reset-password -u elastic -i
```

You can use `zeek2es.py` with the `--user` and `--passwd` command line options to specify your
credentials to ES. You can also supply these options via the extra command line arguments for the helper
scripts.

### Docker

Probably the easiest way to use this code is through Docker. All of the files are in the `docker` directory.
First, you will want to edit the lines with `CHANGEME!!!` in the `.env` file to fit your environment.
You will also need to edit the Elastic password in `docker/zeek2es/entrypoint.sh` to match. It can be found after the `--passwd` option.
Next, you can change directory into the `docker` directory and type the following commands to bring
up a zeek2es and Elasticsearch cluster:

```
docker-compose build
dockr-compose up
```

Now you can put logs in the `VOLUME_MOUNT/data/logs` directory (`VOLUME_MOUNT` you set in the `.env` file).
When logs are CREATED in this directory, zeek2es will begin processing them and pushing them into Elasticsearch.
You can then login to https://localhost:5601 with the username and password you set up in the `.env` file.
By default there is a self signed certificate, but you can change that if you edit the docker compose files. Once inside
Kibana you will go to Stack Management->Data Views and create a data view for `logs*` with the timestamp `@timestamp`.
Now you will be able to go to Discover and start searching your logs! Your data is persistent in the `VOLUME_MOUNT/data` directory you set.
If you would like to remove all data, just `rm -rf VOLUME_MOUNT/data`, substituting the directory you set into that remove command.
The next time you start your cluster it will be brand new for more data.

## Upgrading zeek2es

Most upgrades should be as simple as copying the newer [zeek2es.py](zeek2es.py) over
the old one. In some cases, the ES ingest pipeline required for the `-g` command line option
might change during an upgrade. Therefore, it is strongly recommend you delete
your [ingest pipeline](#esingestpipeline) before you run a new version of zeek2es.py.

### ES Ingest Pipeline

If you need to [delete the "zeekgeoip" ES ingest pipeline](https://www.elastic.co/guide/en/elasticsearch/reference/current/delete-pipeline-api.html)
used to geolocate IP addresses with the `-g` command line option, you can either do it graphically
through Kibana's Stack Management->Ingest Pipelines or this command will do it for you:

```
curl -X DELETE "localhost:9200/_ingest/pipeline/zeekgeoip?pretty"
```

This command is strongly recommended whenever updating your copy of zeek2es.py.

## Filtering Data

### Python Filters

zeek2es provides filtering capabilities for your Zeek logs before they are stored in ElasticSearch. This
functionality can be enabled with the `-a` or `-f` options. The filters are constructed from Python
lambda functions, where the input is a Python dictionary representing the output. You can add a
filter to only store connection logs where the `service` field is populated using the `-f` option with
this lambda filter file:

```
lambda x: 'service' in x and len(x['service']) > 0
```

Or maybe you'd like to filter for connections that have at least 1,024 bytes, with at least 1 byte coming from
the destination:

```
lambda x: 'orig_ip_bytes' in x and 'resp_ip_bytes' in x and x['orig_ip_bytes'] + x['resp_ip_bytes'] > 1024 and x['resp_ip_bytes'] > 0
```

Simpler lambda filters can be provided on the command line via the `-a` option. This filter will only store
connection log entries where the originator IP address is part of the `192.0.0.0/8` network:

```
python zeek2es.py conn.log.gz -a "lambda x: 'id.orig_h' in x and ipaddress.ip_address(x['id.orig_h']) in ipaddress.ip_network('192.0.0.0/8')"
```

For power users, the `-f` option will allow you to define a full function (instead of Python's lambda functions) so you can write functions that
span multiple lines.

### Filter on Keys

In some instances you might want to pull data from one log that depends on another. An
example would be finding all `ssl.log` rows that have a `uid` matching previously
indexed rows from `conn.log`, or vice versa. You can filter by importing your
`conn.log` files with the `-o uid uid.txt` command line. This will log all uids that were
indexed to a file named `uid.txt`. Then, when you import your `ssl.log` files you will provide
the `-e uid uid.txt` command line. This will only import SSL rows
containing `uid` values that are in `uid.txt`, previously built from our import of `conn.log`.

## Command Line Examples

```
python zeek2es.py your_zeek_log.gz -i your_es_index_name
```

This script can be run in parallel on all connection logs, 10 at a time, with the following command:

```
find /some/dir -name “conn*.log.gz” | parallel -j 10 python zeek2es.py {1} :::: -
```

If you would like to automatically import all conn.log files as they are created in a directory, the following
[fswatch](https://emcrisostomo.github.io/fswatch/) command will do that for you:

```
fswatch -m poll_monitor --event Created -r /data/logs/zeek/ | awk '/^.*\/conn.*\.log\.gz$/' | parallel -j 5 python ~/zeek2es.py {} -g -d :::: -
```

If you have the jq command installed you can perform searches across all your logs for a common
field like connection uid, even without ElasticSearch:

```
find /usr/local/var/logs -name "*.log.gz" -exec python ~/Source/zeek2es/zeek2es.py {} -s -b -z \; | jq -c '. | select(.uid=="CLbPij1vThLvQ2qDKh")'
```

You can use much more complex jq queries than this if you are familiar with jq.

If you want to remove all of your Zeek data from ElasticSearch, this command will do it for you:

```
curl -X DELETE http://localhost:9200/zeek*
```

Since the indices have the date appended to them, you could
delete Dec 31, 2021 with the following command:

```
curl -X DELETE http://localhost:9200/zeek_*_2021-12-31
```

You could delete all conn.log entries with this command:

```
curl -X DELETE http://localhost:9200/zeek_conn_*
```

## Command Line Options

```
$ python zeek2es.py -h
usage: zeek2es.py [-h] [-i ESINDEX] [-u ESURL] [--user USER] [--passwd PASSWD]
[-l LINES] [-n NAME] [-k KEYWORDS [KEYWORDS ...]]
[-a LAMBDAFILTER] [-f FILTERFILE]
[-y OUTPUTFIELDS [OUTPUTFIELDS ...]] [-d DATASTREAM]
[--compress] [-o fieldname filename] [-e fieldname filename]
[-g] [-p SPLITFIELDS [SPLITFIELDS ...]] [-j] [-r] [-t] [-s]
[-b] [--humio HUMIO HUMIO] [-c] [-w] [-z]
filename

Process Zeek ASCII logs into ElasticSearch.

positional arguments:
filename The Zeek log in *.log or *.gz format. Include the full path.

optional arguments:
-h, --help show this help message and exit
-i ESINDEX, --esindex ESINDEX
The Elasticsearch index/data stream name.
-u ESURL, --esurl ESURL
The Elasticsearch URL. Use ending slash. Use https for Elastic v8+. (default: http://localhost:9200)
--user USER The Elasticsearch user. (default: disabled)
--passwd PASSWD The Elasticsearch password. Note this will put your password in this shell history file. (default: disabled)
-l LINES, --lines LINES
Lines to buffer for RESTful operations. (default: 10,000)
-n NAME, --name NAME The name of the system to add to the index for uniqueness. (default: empty string)
-k KEYWORDS [KEYWORDS ...], --keywords KEYWORDS [KEYWORDS ...]
A list of text fields to add a keyword subfield. (default: service)
-a LAMBDAFILTER, --lambdafilter LAMBDAFILTER
A Python lambda function, when eval'd will filter your output JSON dict. (default: empty string)
-f FILTERFILE, --filterfile FILTERFILE
A Python function file, when eval'd will filter your output JSON dict. (default: empty string)
-y OUTPUTFIELDS [OUTPUTFIELDS ...], --outputfields OUTPUTFIELDS [OUTPUTFIELDS ...]
A list of fields to keep for the output. Must include ts. (default: empty string)
-d DATASTREAM, --datastream DATASTREAM
Instead of an index, use a data stream that will rollover at this many GB.
Recommended is 50 or less. (default: 0 - disabled)
--compress If a datastream is used, enable best compression.
-o fieldname filename, --logkey fieldname filename
A field to log to a file. Example: uid uid.txt.
Will append to the file! Delete file before running if appending is undesired.
This option can be called more than once. (default: empty - disabled)
-e fieldname filename, --filterkeys fieldname filename
A field to filter with keys from a file. Example: uid uid.txt. (default: empty string - disabled)
-g, --ingestion Use the ingestion pipeline to do things like geolocate IPs and split services. Takes longer, but worth it.
-p SPLITFIELDS [SPLITFIELDS ...], --splitfields SPLITFIELDS [SPLITFIELDS ...]
A list of additional fields to split with the ingestion pipeline, if enabled.
(default: empty string - disabled)
-j, --jsonlogs Assume input logs are JSON.
-r, --origtime Keep the numerical time format, not milliseconds as ES needs.
-t, --timestamp Keep the time in timestamp format.
-s, --stdout Print JSON to stdout instead of sending to Elasticsearch directly.
-b, --nobulk Remove the ES bulk JSON header. Requires --stdout.
--humio HUMIO HUMIO First argument is the Humio URL, the second argument is the ingest token.
-c, --cython Use Cython execution by loading the local zeek2es.so file through an import.
Run python setup.py build_ext --inplace first to make your zeek2es.so file!
-w, --hashdates Use hashes instead of dates for the index name.
-z, --supresswarnings
Supress any type of warning. Die stoically and silently.

To delete indices:

curl -X DELETE http://localhost:9200/zeek*?pretty

To delete data streams:

curl -X DELETE http://localhost:9200/_data_stream/zeek*?pretty

To delete index templates:

curl -X DELETE http://localhost:9200/_index_template/zeek*?pretty

To delete the lifecycle policy:

curl -X DELETE http://localhost:9200/_ilm/policy/zeek-lifecycle-policy?pretty

You will need to add -k -u elastic_user:password if you are using Elastic v8+.
```

## Requirements

- A Unix-like environment (MacOs works!)
- Python
- [requests](https://docs.python-requests.org/en/latest/) Python library installed, such as with with `pip`.

## Notes

### Humio

To import your data into Humio you will need to set up a repository with the `corelight-json` parser. Obtain
the ingest token for the repository and you can import your data with a command such as:

```
python3 zeek2es.py -s -b --humio http://localhost:8080 b005bf74-1ed3-4871-904f-9460a4687202 http.log
```

The URL should be in the format of: `http://yourserver:8080`, as the rest of the path is added by the
`zeek2es.py` script automatically for you.

### JSON Log Input

Since Zeek JSON logs do not have type information like the ASCII TSV versions, only limited type information
can be provided to ElasticSearch. You will notice this most for Zeek "addr" log fields that
are not id$orig_h and id$resp_h, since the type information is not available to translate the field into
ElasticSearch's "ip" type. Since address fields will not be of type "ip", you will not be able to use
subnet searches, for example, like you could for the TSV logs. Saving Zeek logs in ASCII TSV
format provides for greater long term flexibility.

### Data Streams

You can use data streams instead of indices for large logs with the `-d` command line option. This
option creates index templates beginning with `zeek_`. It also creates a lifecycle policy
named `zeek-lifecycle-policy`. If you would like to delete all of your data streams, lifecycle policies,
and index templates, these commands will do it for you:

```
curl -X DELETE http://localhost:9200/_data_stream/zeek*?pretty
curl -X DELETE http://localhost:9200/_index_template/zeek*?pretty
curl -X DELETE http://localhost:9200/_ilm/policy/zeek-lifecycle-policy?pretty
```

### Helper Scripts

There are two scripts that will help you make your logs into data streams such as `logs-zeek-conn`.
The first script is [process_logs_as_datastream.sh](process_logs_as_datastream.sh) and given
a list of logs and directories, will import them as such. The second script
is [process_log.sh](process_log.sh), and it can be used to import logs
one at a time. This script can also be used to monitor logs created in a directory with
[fswatch](https://emcrisostomo.github.io/fswatch/). Both scripts have example command lines
if you run them without any parameters.

```
$ ./process_logs_as_datastream.sh
Usage: ./process_logs_as_datastream.sh NJOBS "ADDITIONAL_ARGS_TO_ZEEK2ES" "LIST_OF_LOGS_DELIMITED_BY_SPACES" DIR1 DIR2 ...

Example:
time ./process_logs_as_datastream.sh 16 "" "amqp bgp conn dce_rpc dhcp dns dpd files ftp http ipsec irc kerberos modbus modbus_register_change mount mqtt mysql nfs notice ntlm ntp ospf portmap radius reporter rdp rfb rip ripng sip smb_cmd smb_files smb_mapping smtp snmp socks ssh ssl stun syslog tunnel vpn weird wireguard x509" /usr/local/var/logs
```

```
$ ./process_log.sh
Usage: ./process_log.sh LOGFILENAME "ADDITIONAL_ARGS_TO_ZEEK2ES"

Example:
fswatch -m poll_monitor --event Created -r /data/logs/zeek | awk '/^.*\/(conn|dns|http)\..*\.log\.gz$/' | parallel -j 16 ./process_log.sh {} "" :::: -
```

You will need to edit these scripts and command lines according to your environment.

Any files having a name of a log such as `conn_filter.txt` in the `lambda_filter_file_dir`, by default your home directory, will be applied as a lambda
filter file to the corresponding log input. This allows you to set up all of your filters in one directory and import multiple log files with
that set of filters in one command with [process_logs_as_datastream.sh](process_logs_as_datastream.sh).

The following lines should delete all Zeek data in ElasticSearch no matter if you use indices or
data streams, or these helper scripts:

```
curl -X DELETE http://localhost:9200/zeek*?pretty
curl -X DELETE http://localhost:9200/_data_stream/zeek*?pretty
curl -X DELETE http://localhost:9200/_data_stream/logs-zeek*?pretty
curl -X DELETE http://localhost:9200/_index_template/zeek*?pretty
curl -X DELETE http://localhost:9200/_index_template/logs-zeek*?pretty
curl -X DELETE http://localhost:9200/_ilm/policy/zeek-lifecycle-policy?pretty
```

... or if using Elastic v8+ ...

```
curl -X DELETE -k -u elastic:password https://localhost:9200/zeek*?pretty
curl -X DELETE -k -u elastic:password https://localhost:9200/_data_stream/zeek*?pretty
curl -X DELETE -k -u elastic:password https://localhost:9200/_data_stream/logs-zeek*?pretty
curl -X DELETE -k -u elastic:password https://localhost:9200/_index_template/zeek*?pretty
curl -X DELETE -k -u elastic:password https://localhost:9200/_index_template/logs-zeek*?pretty
curl -X DELETE -k -u elastic:password https://localhost:9200/_ilm/policy/zeek-lifecycle-policy?pretty
```

But to be able to do this in v8+ you will need to configure Elastic as described
in the section [Elastic v8.0+](#elastic80).

### Cython

If you'd like to try [Cython](https://cython.org/), you must run `python setup.py build_ext --inplace`
first to generate your compiled file. You must do this every time you update zeek2es!

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/corelight/zeek2es

Awesome Lists containing this project

README