{"id":30450227,"url":"https://github.com/deflect-ca/baskerville","last_synced_at":"2025-08-23T13:26:06.067Z","repository":{"id":37667620,"uuid":"262622156","full_name":"deflect-ca/baskerville","owner":"deflect-ca","description":"Security Analytics Engine - Anomaly Detection in Web Traffic","archived":false,"fork":false,"pushed_at":"2025-05-26T12:47:09.000Z","size":79130,"stargazers_count":31,"open_issues_count":32,"forks_count":4,"subscribers_count":6,"default_branch":"master","last_synced_at":"2025-05-26T13:58:00.098Z","etag":null,"topics":["apache-kafka","apache-spark","big-data","grafana","isolation-forest","machine-learning","prometheus","python3","security-analytics","spark"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/deflect-ca.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2020-05-09T17:23:05.000Z","updated_at":"2025-01-31T09:29:06.000Z","dependencies_parsed_at":"2024-06-11T18:40:28.786Z","dependency_job_id":null,"html_url":"https://github.com/deflect-ca/baskerville","commit_stats":null,"previous_names":[],"tags_count":7,"template":false,"template_full_name":null,"purl":"pkg:github/deflect-ca/baskerville","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/deflect-ca%2Fbaskerville","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/deflect-ca%2Fbaskerville/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/deflect-ca%2Fbaskerville/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/deflect-ca%2Fbaskerville/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/deflect-ca","download_url":"https://codeload.github.com/deflect-ca/baskerville/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/deflect-ca%2Fbaskerville/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":271749048,"owners_count":24814115,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-08-23T02:00:09.327Z","response_time":69,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["apache-kafka","apache-spark","big-data","grafana","isolation-forest","machine-learning","prometheus","python3","security-analytics","spark"],"created_at":"2025-08-23T13:25:56.883Z","updated_at":"2025-08-23T13:26:06.044Z","avatar_url":"https://github.com/deflect-ca.png","language":"Python","readme":"![Unit tests](https://github.com/equalitie/baskerville/workflows/Unit%20tests/badge.svg)\n\n## Contents\n\n- [What is Baskerville](#what-is-baskerville)\n    - [Overview](#overview)\n    - [Technology](#technology)  \n- [Requirements](#requirements)\n- [Useful Definitions](#useful-definitions)\n    - [Runtime](#runtime)    \n    - [Model](#model)    \n    - [Anomaly detector](#anomaly-detector)    \n    - [Request set](#request-set)    \n    - [Subset](#subset)    \n    - [Time bucket](#time-bucket)\n- [Baskerville Engine](#baskerville-engine)\n  - [Pipelines](#pipelines):\n    - [Raw Logs](#raw-logs)\n    - [Elastic Search](#elastic-search)\n    - [Kafka](#kafka)\n    - [Training](#training)\n  - [Predictions and ML](#predictions-and-ml)\n- [Installation](#installation)\n- [Configuration](#configuration)\n- [How to run](#how-to-run)\n- [Testing](#testing)\n- [Docs](#docs)\n- [TODO](#todo)\n- [Contributing](#contributing)\n  - [Contributors](#contributors)\n\n## What is Baskerville\n\nManual identification and mitigation of (DDoS) attacks on websites is a difficult and time-consuming task with many challenges. This is why Baskerville was created, to identify the attacks directed to [Deflect](https://deflect.ca) protected \nwebsites as they happen and give the infrastructure the time to respond properly. Baskerville is an analytics engine that leverages Machine Learning to distinguish between normal and abnormal web traffic behavior. \nIn short, Baskerville is a Layer 7 (application layer) DDoS attack mitigation tool.\n\nThe challenges:\n- Be fast enough  to make it count\n- Be able to  adapt to traffic\n (Apache Spark, Apache Kafka)\n- Provide actionable info (A prediction and a score for an IP)\n- Provide reliable predictions (Probation period \u0026 feedback)\n- As with any ML project: not enough labelled data (Using normal-ish data - the anomaly/ novelty detector can accept a small percentage of anomalous data in the dataset)\n\n\n## Overview\n\nBaskerville is the component of the Deflect analysis engine that is used to\ndecide whether IPs connecting to Deflect hosts are authentic normal\nconnections, or malicious bots. In order to make this assessment, Baskerville\ngroups incoming requests into *request sets* by requested host and requesting IP.\n\nFor each request set, a selection of *features* are\ncomputed. These are properties of the requests within the request set (e.g.\naverage path depth, number of unique queries, HTML to image ratio...) that are\nintended to help differentiate normal request sets from bot request sets. A supervised\n*novelty detector*, trained offline on the feature vectors of a set of normal\nrequest sets, is used to predict whether new request sets are normal or suspicious. The request sets,\ntheir features, trained models, and details of suspected attacks and attributes,\nare all saved to a relational database (Postgres currently).\n\nPut simply, the Baskerville *engine* is the workhorse for consuming,\nprocessing, and saving the output from input web logs. This engine can be run as\nBaskerville *on-line*, which enables the immediate identification of suspicious IPs, or as Baskerville *off-line*, which conducts this same\nanalysis for log files saved locally or in an elasticsearch database.\n\n![Deflect and Baskerville](data/img/deflect_and_baskerville.png?raw=true \"Deflect and Baskerville\")\n\n\n1. The first step is to get the data (Combined log format) and transport it to the processing system\n2. Once we get the logs, the processing system will start transforming the input to feature vectors and subsequently predictions for target-ip pairs\n3. The information that comes out from the processing is stored in a database\n4. At the same time, while the system is running, there are other systems in place to monitor its process and alert if anything is out of place\n5. And eventually, the insight we gained from the process will return to the source so that actions can be taken\n\n## Technology\n- [Apache Spark](https://spark.apache.org)\n- [Apache Kafka](https://kafka.apache.org)\n- [Postgres (with the Timescale extension)](https://github.com/timescale/timescaledb)\n- Metrics:\n  - [Prometheus (with the Postgres adapter for storage)](https://prometheus.io)\n  - [Grafana](https://grafana.com)\n\nFor the Machine Learning:\n- [Pyspark ML](https://spark.apache.org/docs/latest/api/python/pyspark.ml.html)\n- [Isolation Forest](https://github.com/titicaca/spark-iforest) (a slightly modified version)\n- [Scikit Learn](https://scikit-learn.org/stable/) (not actively used because of performance issues)\n\n## Requirements\n- Python \u003e= 3.6 (ideally 3.6 because there have been some issues with 3.7 testing and numpy in the past)\n- Postgres 10\n- If you want to use the IForest model, then https://github.com/titicaca/spark-iforest is required (`pyspark_iforest.ml`)\n- Java 8 needs to be in place (and in PATH) for [Spark](https://spark.apache.org/docs/2.4.5/)\n(Pyspark version \u003e=2.4.4) to work,\n- The required packages in `requirements.txt`\n- Tests need `pytest`, `mock` and `spark-testing-base`\n- For the Elastic Search pipeline: access to the [esretriever](https://github.com/equalitie/esretriever.git)\nrepository \n- Optionally: access to the [Deflect analytics ecosystem](https://github.com/equalitie/deflect-analytics-ecosystem)\nrepository (to run Baskerville dockerized components like Postgres, Kafka, Prometheus, Grafana etc).\n\n## Installation\nFor Baskerville to be fully functional we need to install the following:\n```bash\n# clone and install spark-iforest\ngit clone https://github.com/titicaca/spark-iforest\ncd spark-iforest/python\npip install .\n\n# clone and install esretriever - for the ElasticSearch pipeline\ncd ../../\ngit clone https://github.com/equalitie/esretriever.git\ncd esretriever\npip install .\n\n# Finally, clone and install baskerville\ncd ../\ngit clone https://github.com/equalitie/baskerville.git\ncd baskerville\npip install .[test]\n```\n\nNote: In Windows you might have to install `pypandoc` first for pyspark to install.\n\n## Configuration\nSince it is good practice to use environment variables for the configuration, the following can be set and used as follows:\n\n```bash\ncd baskerville\n# must be set:\nexport BASKERVILLE_ROOT=$(pwd)  # full path to baskerville folder\nexport DB_HOST=the db host (docker ip if local)\nexport DB_PORT=5432\nexport DB_USER=postgres user\nexport DB_PASSWORD=secret\n\n# optional - pipeline dependent\nexport ELK_USER=elasticsearch\nexport ELK_PASSWORD=changeme\nexport ELK_HOST=the elasticsearch host:port (localhost:9200)\nexport KAFKA_HOST=kafka host:port (localhost:9092)\n```\n\nA basic configuration for running the raw log pipeline is the following - rename [`baskerville/conf/conf_rawlog_example_baskerville.yaml`](conf/conf_rawlog_example_baskerville.yaml) to `baskerville/conf/baskerville.yaml`:\n```yaml\n---\ndatabase:\n  name: baskerville                   # the database name\n  user: !ENV ${DB_USER}\n  password: !ENV '${DB_PASS}'\n  type: 'postgres'\n  host: !ENV ${DB_HOST}\n  port: !ENV ${DB_PORT}\n  maintenance:                        # Optional, for data partitioning and archiving\n    partition_table: 'request_sets'   # default value\n    partition_by: week                # partition by week or month, default value is week\n    partition_field: created_at       # which field to use for the partitioning, this is the default value, can be omitted\n    strict: False                     # if False, then for the week partition the start and end date will be changed to the start and end of the respective weeks. If true, then the dates will remain unchanged. Be careful to be consistent with this.\n    data_partition:                   # Optional: Define the period to create partitions for\n      since: 2020-01-01               # when to start partitioning\n      until: \"2020-12-31 23:59:59\"    # when to stop partitioning\n\nengine:\n  storage_path: !ENV '${BASKERVILLE_ROOT}/data'\n  raw_log:\n    paths:                    # optional, a list of logs to parse - they will be parsed subsequently\n      - !ENV '${BASKERVILLE_ROOT}/data/samples/test_data_1k.json'  # sample data to run baskerville raw log pipeline with\n  cache_expire_time: 604800       # sec (604800 = 1 week)\n  model_id: -1                    # optional, -1 returns the latest model in the database\n  extra_features:                 # useful when we need to calculate more features than the model requests or when there is no model\n      - css_to_html_ratio         # feature names have the following convention: class FeatureCssToHtmlRatio --\u003e 'css_to_html_ratio'\n      - image_to_html_ratio\n      - js_to_html_ratio\n      - minutes_total\n      - path_depth_average\n      - path_depth_variance\n      - payload_size_average\n  data_config:\n    schema: !ENV '${BASKERVILLE_ROOT}/data/samples/sample_log_schema.json'\n  logpath: !ENV '${BASKERVILLE_ROOT}/baskerville.log'\n  log_level: 'ERROR'\n\nspark:\n  app_name: 'Baskerville'   # the application name - can be changed for two different runs - used by the spark UI\n  master: 'local'           # the ip:port of the master node, e.g. spark://someip:7077 to submit to a cluster\n  parallelism: -1           # controls the number of tasks, -1 means use all cores - used for local master\n  log_level: 'INFO'         # spark logs level\n  storage_level: 'OFF_HEAP' # which strategy to use for storing dfs - valid values are the ones found here: https://spark.apache.org/docs/2.4.0/api/python/_modules/pyspark/storagelevel.html default: OFF_HEAP\n  jars: !ENV '${BASKERVILLE_ROOT}/data/jars/postgresql-42.2.4.jar,${BASKERVILLE_ROOT}/data/spark-iforest-2.4.0.jar' # or /path/to/jars/mysql-connector-java-8.0.11.jar\n  spark_driver_memory: '6G' # depends on your dataset and the available ram you have. If running locally 6 - 8 GB should be a good choice, depending on the amount of data you need to process\n  metrics_conf: !ENV '${BASKERVILLE_ROOT}/data/spark.metrics'  # Optional: required only  to export spark metrics\n  jar_packages: 'com.banzaicloud:spark-metrics_2.11:2.3-2.0.4,io.prometheus:simpleclient:0.3.0,io.prometheus:simpleclient_dropwizard:0.3.0,io.prometheus:simpleclient_pushgateway:0.3.0,io.dropwizard.metrics:metrics-core:3.1.2'  # required to export spark metrics\n  jar_repositories: 'https://raw.github.com/banzaicloud/spark-metrics/master/maven-repo/releases' # Optional: Required only to export spark metrics\n  event_log: True\n  serializer: 'org.apache.spark.serializer.KryoSerializer'\n```\n\nIn [`baskerville/conf/conf_example_baskerville.yaml`](conf/conf_example_baskerville.yaml) you can see all the possible configuration options.\n\nExample of configurations for the other pipelines:\n- kafka: [`baskerville/conf/conf_kafka_example_baskerville.yaml`](conf/conf_kafka_example_baskerville.yaml)\n- es: [`baskerville/conf/conf_es_example_baskerville.yaml`](conf/conf_es_example_baskerville.yaml)\n- training: [`baskerville/conf/conf_training_example_baskerville.yaml`](conf/conf_training_example_baskerville.yaml)\n\n## How to run\nIn general, there are two ways to run Baskerville, a pure python one or using `spark-submit`, both are detailed below.\n\nThe full set of options:\n```bash\nusage: main.py [-h] [-s] [-e] [-t] [-c CONF_FILE] pipeline\n\npositional arguments:\n  pipeline              Pipeline to use: es, rawlog, or kafka\n\noptional arguments:\n  -h, --help            show this help message and exit\n  -s, --simulate        Simulate real-time run using kafka\n  -e, --startexporter   Start the Baskerville Prometheus exporter at the\n                        specified in the configuration port\n  -t, --testmodel       Add a test model in the models table\n  -c CONF_FILE, --conf CONF_FILE\n                        Path to config file\n```\n\n- For the python approach (which I find it easier when running locally):\n```bash\ncd baskerville/src/baskerville\npython3 main.py [which pipeline to use] [should baskerville register and export metrics? if yes use the -e flag]\n# which means:\npython3 main.py [kafka||rawlog||es] [-e] [-t] [-s]\n```\n\nNote: you can replace `python3 main.py` with `baskerville` like this:\n```python\nbaskerville [kafka||rawlog||es] [-e] [-t] [-s]\n```\n\n- And for [`spark-submit`](https://spark.apache.org/docs/latest/submitting-applications.html) (which is usually used when you want to submit a Baskerville Application to a cluster):\n\n```bash\ncd baskerville\nexport BASKERVILLE_ROOT=$(pwd)\nspark-submit --jars $BASKERVILLE_ROOT/data/jars/[relevant jars like postgresql-42.2.4.jar,spark-streaming-kafka-0-8-assembly_2.11-2.3.1.jar etc] --conf [spark configurations if any - note that these will override the baskerville.yaml configurations] $BASKERVILLE_ROOT/src/baskerville/main.py [kafka||rawlog||es] [-e] -c $BASKERVILLE_ROOT/conf/baskerville.yaml (or the absolute path to your configuration)\n```\n\nExamples:\n```bash\ncd baskerville\nexport BASKERVILLE_ROOT = $(pwd) # or full path to baskerville\n# minimal spark-submit for the raw logs pipeline:\nspark-submit --jars $BASKERVILLE_ROOT/data/jars/postgresql-42.2.4.jar --conf spark.memory.offHeap.enabled=true --conf spark.memory.offHeap.size=2g $BASKERVILLE_ROOT/src/baskerville/main.py rawlog -c $BASKERVILLE_ROOT/conf/baskerville.yaml \n# or\nspark-submit --packages org.postgresql:postgresql:42.2.4 --conf spark.memory.offHeap.enabled=true --conf spark.memory.offHeap.size=2g $BASKERVILLE_ROOT/src/baskerville/main.py rawlog -c $BASKERVILLE_ROOT/conf/baskerville.yaml \n\n# minimal spark-submit for the elastic search pipeline:\nspark-submit --jars $BASKERVILLE_ROOT/data/jars/postgresql-42.2.4.jar,$BASKERVILLE_ROOT/data/jars/elasticsearch-spark-20_2.11-5.6.5.jar --conf spark.memory.offHeap.enabled=true --conf spark.memory.offHeap.size=2g $BASKERVILLE_ROOT/src/baskerville/main.py es -c $BASKERVILLE_ROOT/conf/baskerville.yaml \n\n# minimal spark-submit for the kafka pipeline:\nspark-submit --jars $BASKERVILLE_ROOT/data/jars/postgresql-42.2.4.jar,$BASKERVILLE_ROOT/data/jars/spark-streaming-kafka-0-8-assembly_2.11-2.3.1.jar --conf spark.memory.offHeap.enabled=true --conf spark.memory.offHeap.size=2g $BASKERVILLE_ROOT/src/baskerville/main.py kafka -c $BASKERVILLE_ROOT/conf/baskerville.yaml\n\n# note the spark.memory.offHeap.size does not have to be 2G and it depends on the size of your data\n```\nThe paths in spark-submit must be absolute and accessible from all the workers.\n\nNote for Windows:\nIn Winows Spark might not initialize. If so, set $HADOOP_HOME as follows and download the appropriate `winutils.exe` from [https://github.com/steveloughran/winutils](https://github.com/steveloughran/winutils)\n\n```bash\nmkdir c:\\hadoop\\bin\nexport HADOOP_HOME=c:\\hadoop\ncp $HOME\\Downloads\\winutils.exe $HADOOP_HOME\\bin\n```\n##### Necessary services/ components for each pipeline:\n\nFor Baskerville `kafka` you'll need:\n  - Kafka\n  - Zookeeper\n  - Postgres\n  - Prometheus  [optional]\n  - Grafana     [optional]\n\nFor Baskerville `rawlog` you'll need:\n  - Postgres\n  - Elastic Search\n  - Prometheus  [optional]\n  - Grafana     [optional]\n\nFor Baskerville `es` you'll need:\n  - Postgres\n  - Elastic Search\n  - Prometheus  [optional]\n  - Grafana     [optional]\n\n__An ElasticSearch service is not provided.__\n\nFor Baskerville `training` you'll need:\n  - Postgres\n  - Prometheus  [optional]\n  - Grafana     [optional]``\n\n#### Using Docker - for the dev environment\nUnder `baskerville/container` there is a Dockerfile that sets up the appropriate version of Java, Python and sets up Baskerville.\nTo run this, run `docker-compose up` in the directory where the `docker-compose.yaml` is. \n\nNote that you will have to provide the relevant environment variables as defined in the docker-compose file under `args`.\nAn easy way to do this is to use a `.env` file. Rename the `dot_enf_file` to `.env` and modify it accordingly:\n```yaml\n# must be set:\nBASKERVILLE_BRANCH=e.g. master, develop etc, the branch to pull from\nDB_HOST=the db host (docker ip if local)\nDB_PORT=5432\nDB_USER=postgres\nDB_PASSWORD=secret\n\n# optional - pipeline dependent\nELK_USER=elasticsearch\nELK_PASSWORD=changeme\nELK_HOST=the elasticsearch host (docker ip if local):port (default port is 9200)\nKAFKA_HOST=kafka ip (docker ip if local):port (default:9092)\n```\nThe docker ip can be retrieved by `$(ipconfig getifaddr en0)`, assuming `en0` is you active network interface.\nThe `${DOCKER_IP}` is used when the systems are running on your local docker. Otherwise, just use the appropriate environment variable.\nE.g. `- ELK_HOST=${ELK_HOST}`, where `${ELK_HOST}` is set in `.env` to the ip of the elasticsearch instance.\n\nThe command part is how you want baskerville to run. E.g. for the rawlog pipeline - with metrics exporter:\n```yaml\ncommand: python ./main.py -c /app/baskerville/conf/baskerville.yaml rawlog -e\n```\nOnce you have test data in Postgres, you can train:\n```yaml\ncommand: python ./main.py -c /app/baskerville/conf/baskerville.yaml training\n```\nOr use the provided test model:\n```yaml\ncommand: python ./main.py -c /app/baskerville/conf/baskerville.yaml rawlog -e -t\n```\nAnd so on. See [Running](#running) section for more options.\n\nMake sure you have a Postgres instance running. You can use [the deflect-analytics-ecosystem's docker-compose to spin up a postgres instance](https://github.com/equalitie/deflect-analytics-ecosystem/blob/master/docker-compose.yaml#L15) \n\n\nIn [deflect-analytics-ecosystem](https://github.com/equalitie/deflect-analytics-ecosystem) you can find a docker-compose that creates all the aforementioned components and runs Baskerville too. **It should be the simplest way to test Baskerville out**. To launch the relevant services, comment out what you don't need from the [`docker-compose.yaml`](https://github.com/equalitie/deflect-analytics-ecosystem/blob/master/docker-compose.yaml) and run\n\n```bash\ndocker-compose up\n```\nin the directory with the `docker-compose.yaml` file.\n\n*Note*: For the Kafka service, before you run `docker-compose up` set:\n\n```bash\nexport DOCKER_KAFKA_HOST=currentl-local-ip or $(ipconfig getifaddr en0) # where en0 is your current active interface\n```\n\n##### How to check it worked as it should?\nThe simplest way is to check the database.\n```sql\n-- Give baskerville is the database name defined in thec configuration:\nSELECT count(id) from request_sets where id_runtime in (select max(id) from runtimes); -- if you used the test data, it should be 1000 after a full successful execution\nSELECT * FROM baskerville.request_sets limit 10;  -- fields should be complete, e.g. features should be something like the following\n```\n\nRequest_set features example:\n```json\n{\n    \"top_page_to_request_ratio\": 1,\n    \"response4xx_total\": 0,\n    \"unique_path_total\": 2,\n    \"request_interval_variance\": 0,\n    \"unique_path_to_request_ratio\": 1,\n    \"image_total\": 2,\n    \"css_total\": 0,\n    \"js_total\": 0,\n    \"css_to_html_ratio\": 0,\n    \"unique_query_to_unique_path_ratio\": 0,\n    \"unique_ua_rate\": 1,\n    \"top_page_total\": 2,\n    \"js_to_html_ratio\": 0,\n    \"request_total\": 2,\n    \"image_to_html_ratio\": 200,\n    \"request_interval_average\": 0,\n    \"unique_path_rate\": 1,\n    \"minutes_total\": 0,\n    \"unique_query_rate\": 0,\n    \"html_total\": 0,\n    \"path_depth_variance\": 50,\n    \"path_depth_average\": 5,\n    \"unique_query_total\": 0,\n    \"unique_ua_total\": 2,\n    \"payload_size_average\": 10.40999984741211,\n    \"response4xx_to_request_ratio\": 0,\n    \"payload_size_log_average\": 9.250617980957031\n}\n\n```\n## Metrics\nGrafana is a metrics visualization web application that can be configured to\ndisplay several dashboards with charts, raise alerts when metric crosses a user defined threshold and notify through mail or other means. Within Baskerville, under data/metrics, there is an importable to Grafana dashboard which presents the statistics of the Baskerville engine in a customisable manner. It is intended to be the principal visualisation and alerting tool of incoming Deflect traffic, displaying metrics in graphical form.\nPrometheus is the metric storage and aggregator that provides Grafana with the charts data.\nBesides the spark ui - which usually is under `http://localhost:4040`, there is a set of metrics for the Baskerville engine set up, using the Python Prometheus library. To see those metrics,\ninclude the `-e` flag and go to the configured localhost port (`http:// localhost:8998/metrics` by default). You will need a configured Prometheus instance and a Grafana instance to be able to\nvisualize them using the auto generated [baskerville dashboard](data/metrics/Baskerville-metrics-dashboard.json), which is saved in the data directory, and can\nbe imported in Grafana.\nThere is also an [Anomalies Dashboard](/data/metrics/anomalies-dashboard.json) and a [Kafka Dashboard](/data/metrics/kafka-dashboard.json) under [`data/metrics`](data/metrics).\n\n![Baskerville's Anomalies Dashboard](data/img/anomalies_dashboard.jpg?raw=true \"Baskerville's Anomalies Dashboard\")\n\n![Kafka Dashboard](data/img/kafka_dashboard.png?raw=true \"Baskerville's Kafka Dashboard\")\n\nTo view the spark generated metrics, you have to include the configuration for the jar packages:\n```yaml\n  metrics_conf: !ENV '${BASKERVILLE_ROOT}/data/metrics/spark.metrics'\n  jar_packages: 'io.prometheus:simpleclient:0.3.0,io.prometheus:simpleclient_dropwizard:0.3.0,io.prometheus:simpleclient_pushgateway:0.3.0,io.dropwizard.metrics:metrics-core:3.1.2,com.banzaicloud:spark-metrics_2.11:2.3-2.0.4'\n  jars_repositories: 'https://raw.github.com/banzaicloud/spark-metrics/master/maven-repo/releases'\n```\n\nAnd also have a prometheus push gateway set up. You can use [the deflect-analytics-ecosystem's docker-compose for that](https://github.com/equalitie/deflect-analytics-ecosystem/blob/master/docker-compose.yaml#L44). \n\n## Testing\n\n__Unit Tests__:\nBasic unittests for the features and the pipelines have been implemented.\nTo run them:\n```\npython3 -m pytest tests/\n```\n\nNote: Many of the tests have to use a spark session, so unfortunately it will take some time for all to complete. (~=2 minutes currently)\n\n__Functional Tests__:\nNo functional tests exist yet. They will be added as the structure stabilizes.\n\n## Docs\nWe use `pydoc` to generate docs for Baskerville under `baskerville/docs` folder. \n```shell script\n# use --force to overwrite docs\n pdoc3 --html --force --output-dir docs baskerville\n```\nThen open (`docs/baskerville/index.html`)[docs/baskerville/index.html] with a browser.\n\n### Useful Definitions\n\n- **Runtime**: The time and details for each time Baskerville runs, from start to finish of a Pipeline. E.g. start time, number of subsets processed etc.\n\n- **Model**: The Machine Learning details that are stored in the database.\n\n- **Anomaly detector**:\nThe structure that holds the ML related steps, like the algoithm, the respective scaler etc. Current algorithm is Isolation Forest (scikit or pyspark implementation). OneClassSvm was a candidate but it was slow and the results less accurate.\n\n- **Request set**:\nA request set is the behaviour (the characteristics of the requests being made) of a specific IP towards a specific target for a certain amount of time (runtime).\n\n- **Subset**:\nGiven that a request set has a duration of a runtime, then the subset is the behaviour of the IP-to-target for a specific time-window (`time_bucket`)\n\n- **Time bucket**:\nIt is used to define how often should Baskerville consume and process logs. The default value is `120` (seconds).\n\n### Baskerville Engine\nThe main Baskerville engine consumes web logs and uses them to compute\nrequest sets (i.e. the groups of requests made by each IP-host pair) and extract the request set features. It applies a trained novelty detection algorithm to predict whether each request set is normal or anomalous within the current time window. It saves the request set\nfeatures and predictions to the Baskerville storage database. It can also cross-references\nincoming IP addresses with attacks logged in the database, to determine if a\nlabel (known malicious or known benign) can be applied to the request set.\n\n![Baskerville's Batch Processing Flow](data/img/batch_processing_flow.png?raw=true \"Baskerville's Batch Processing Flow\")\n\n\nEach request-set is divided into *subsets*.\nSubsets have a two-minute length (configurable), and the request set\nfeatures (and prediction) are updated at the end of each subset using a feature-specific update method (discussed\n[here](data/feature_overview/feature_overview.md)).\n\nFor nonlinear features, the feature value will be dependent on the subset\nlength, so for this reason, logs are processed in two-minute subsets even when\nnot being consumed live. This is also discussed in depth in the feature document above.\n\n![Baskerville's Time Window Processing Flow](data/img/tw_processing.png?raw=true \"Baskerville's Time Window Processing Flow\")\n\n\nThe Baskerville engine utilises [Apache Spark](https://spark.apache.org/); an\nanalytics framework designed for the purpose of large-scale data processing. The decision to use Spark in Baskerville was made to ensure that the engine can\nachieve a high enough level of efficiency to consume and process web logs in real time, and thus run continuously as part of the Deflect ecosystem.\n\n![Baskerville's Basic Flow](data/img/basic_flow.png?raw=true \"Baskerville's Basic Flow\")\n\n#### Pipelines\nThere are four main pipelines. Three have to do with the different kinds of input and the fourth one with the ML.\nBaskerville is designed to consume web logs from various sources in predefined intervals (`time bucket` set to 120 seconds by default)\n\n##### Raw logs\nRead from json logs, process, store in postgres.\n![Baskerville's Raw logs Processing Flow](data/img/raw_logs_pipeline.png?raw=true \"Baskerville's Raw logs Processing Flow\")\n\n##### Elastic Search\nRead from Elastc Search splitting the period in smaller periods so as not to overload the ELK cluster, process, store in postgres. (optionally store logs locally too)\n![Baskerville's Elastic Search Processing Flow](data/img/elastic_search_pipeline.png?raw=true \"Baskerville's Elastic Search Processing Flow\")\n\n##### Kafka\nRead from a topic every `time_bucket` seconds, process, store in postgres.\n![Baskerville's Kafka Processing Flow](data/img/kafka_pipeline.png?raw=true \"Baskerville's Kafka Processing Flow\")\n\n##### Training\nThe pipeline used to train the machine learning model. Reads preprocessed data from Postgres, trains, saves model.\n\nPrediction is optional in the first 3 pipelines - which means you may run the pipelines only to process the data, e.g. historic data, and then use the training pipeline to train/ update the model.\n\n#### Predictions and ML\nBecause of the lack of a carefully curated labelled dataset, the approach being used here is to have an anomaly / novelty detector, like OneClassSVM or Isolation Forest trained on **mostly normal** traffic. We can train on data from days we know there have been no major incidents and still be accurate enough. More details here: [Deflect Labs Report #5](https://equalit.ie/deflect-labs-report-5-baskerville/)\nThe output of the prediction process is the prediction and an anomaly score to help indicate confidence in prediction. The output accompanies a specific request set, an IP-target pair for a specific (`time_bucket`) window.\n\nNote: Isolation Forest was preferred to OneClassSVM because of speed and better accuracy.\n\n##### Features\nUnder [`src/baskerville/features`](src/baskerville/features) you can find all the currently implemented features, like:\n- Css to html ratio\n- Image to html ratio\n- Js to html ratio\n- Minutes total\n- Path depth average\n- Path depth variance\n- Payload size average\n- Payload size log average\n- Request interval average\n- Request interval variance\n- Request total\n- Response 4xx to request ratio\n- Top page to request ratio\n- Unique path rate\n- Unique path to request ratio\n- Unique query rate\n- Unique query to unique path ratio\n\nand many more.\n\nMost of the features are `updateable`, wich means, they **take the past into consideration**. For this purpose, we keep a **request set cache** for a predefined amount of time (1 week by default), where we store the details and feature vectors for previous request sets, in order to be used in the updating process. This cache is a two layer cache, one layer has all the unique request sets (unique ip-host pairs) for the week and the other one has only the unique ip-host pairs and their respective details for the current time window. More on feature updating [here](data/feature_overview/feature_overview.md).\n\n![Baskerville's Request Set Cache](data/img/request_set_cache.png?raw=true \"Baskerville's Request Set Cache\")\n\n## Building Baskerville image\n* build spark image https://levelup.gitconnected.com/spark-on-kubernetes-3d822969f85b\n```commandline\nwget https://archive.apache.org/dist/spark/spark-2.4.6/spark-2.4.6-bin-hadoop2.7.tgz\nmkdir spark\nmv spark-2.4.6-bin-hadoop2.7.tgz spark\ntar -xvzf spark-2.4.6-bin-hadoop2.7.tgz\nexport SPARK_HOME=/root/spark/spark-2.4.6-bin-hadoop2.7\nalias spark-shell=”$SPARK_HOME/bin/spark-shell”\n\n$SPARK_HOME/bin/docker-image-tool.sh -r baskerville -t spark2.4.6 -p $SPARK_HOME/kubernetes/dockerfiles/spark/bindings/python/Dockerfile build\ndocker tag baskerville/spark-py:v2.4.6 equalitie/baskerville:spark246\n```\n\n* build Baskerville worker image\n```commandline\ndocker build -t equalitie/baskerville:worker dockerfiles/worker/\n```\n\n* build the latest Baskerville image with your local changes\n```commandline\ndocker build -t equalitie/baskerville:latest .\ndocker push equalitie/baskerville:latest\n```\n\n## Related Projects\n- ES Retriever: https://github.com/equalitie/esretriever: A spark wrapper to retrieve data from ElastiSearch\n- Deflect Analysis Ecosystem: https://github.com/equalitie/deflect-analysis-ecosystem:\n    Docker files for all the components baskerville might need.\n- Baskerville client: https://github.com/equalitie/baskerville_client\n\n## TODO\n- Implement full suite of unit and entity tests.\n- Conduct model tuning and feature selection / engineering.\n\n## Contributing\nIf there is an issue or a feature suggestion, please, use the [issue list](https://github.com/equalitie/baskerville/issues) with the appropriate labels. \nThe standard process is to create the issue, get feedback for it wherever necessary, create a branch named `issue_issuenumber_short_description`, \nimplement the solution for the issue there, along with the relevant tests and then create an MR, so that it can be code reviewed and merged.\n\n### Contributors\n- [Anna](https://github.com/apfitzmaurice)\n- [Anton](https://github.com/mazhurin)\n- [Maria](https://github.com/mkaranasou)\n- [Te-k](https://github.com/Te-k)\n\n\n\u003ca rel=\"license\" href=\"http://creativecommons.org/licenses/by/4.0/\"\u003e\n\u003cimg alt=\"Creative Commons Licence\" style=\"border-width:0\" src=\"https://i.creativecommons.org/l/by/4.0/80x15.png\" /\u003e\u003c/a\u003e\u003cbr /\u003e\nThis work is copyright (c) 2020, eQualit.ie inc., and is licensed under a \u003ca rel=\"license\" href=\"http://creativecommons.org/licenses/by/4.0/\"\u003eCreative Commons Attribution 4.0 International License\u003c/a\u003e.\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdeflect-ca%2Fbaskerville","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdeflect-ca%2Fbaskerville","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdeflect-ca%2Fbaskerville/lists"}