{"id":47219316,"url":"https://github.com/siesta-tool/siesta-demo","last_synced_at":"2026-03-13T17:11:30.065Z","repository":{"id":218567294,"uuid":"746765560","full_name":"siesta-tool/siesta-demo","owner":"siesta-tool","description":"An application-agnostic, open-source tool designed to build incremental indices from continuously streaming event data.","archived":false,"fork":false,"pushed_at":"2025-07-07T20:02:25.000Z","size":16398,"stargazers_count":2,"open_issues_count":0,"forks_count":1,"subscribers_count":0,"default_branch":"main","last_synced_at":"2025-07-07T21:40:15.488Z","etag":null,"topics":["data-mining","event-log","event-processing"],"latest_commit_sha":null,"homepage":"https://datalab.csd.auth.gr/tools-apps/siesta/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/siesta-tool.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2024-01-22T16:25:00.000Z","updated_at":"2025-07-07T20:02:29.000Z","dependencies_parsed_at":null,"dependency_job_id":"3578c5e2-6035-4eca-891c-9efe10101ef0","html_url":"https://github.com/siesta-tool/siesta-demo","commit_stats":null,"previous_names":["siesta-tool/siesta-demo"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/siesta-tool/siesta-demo","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/siesta-tool%2Fsiesta-demo","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/siesta-tool%2Fsiesta-demo/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/siesta-tool%2Fsiesta-demo/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/siesta-tool%2Fsiesta-demo/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/siesta-tool","download_url":"https://codeload.github.com/siesta-tool/siesta-demo/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/siesta-tool%2Fsiesta-demo/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":30471180,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-03-13T11:00:43.441Z","status":"ssl_error","status_checked_at":"2026-03-13T11:00:23.173Z","response_time":60,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.6:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["data-mining","event-log","event-processing"],"created_at":"2026-03-13T17:11:29.447Z","updated_at":"2026-03-13T17:11:30.059Z","avatar_url":"https://github.com/siesta-tool.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# SIESTA DEMO\n\n## Overview\nAnalyzing sequential event data tremendously benefits organizations by enabling them to identify potential \nprocess optimizations and extract useful insights.  However, there is a lack of systems that can both \nefficiently query complex ad-hoc patterns and perform advanced sequential pattern mining-like tasks. \nTo address this, we have proposed **SIESTA**, which stands for Scalable Infrastructure for sEquential paTtern Analysis, \nand is a scalable tool designed to support both pattern detection and pattern mining in large event log datasets that \nare continuously updated.\nThis repository demonstrates how the entire **SIESTA** infrastructure can be deployed using Docker\ncontainers. \n\n\n## Architecture\n\u003cimg src='siesta_architecture.jpg' width='927'\u003e\nSIESTA's infrastructure is composed of three primary components along with additional supporting tools:\n\n- **Preprocess Component**: Handles incoming data and computes the appropriate indices. It can operate in two modes, \nthat is events can arrive (i) in batches, e.g., every hour, and (ii) in real-time stream, \nwhere the data are read from a source, like Apache Kafka.\n- **Database Layer**: After the indices are computed, they are stored in a database.\nSIESTA is designed to be decoupled from the storage mechanism, enhancing its scalability and flexibility.\nIn the current implementation, we used an Object Storage System (OSS) with the S3 protocol via **MinIO** an open source\nimplementation of S3\n- **Query Processor**: This component functions as a RESTful API designed to handle pattern queries. \nIt leverages the stored indices to prune the search space and provide efficient responses.\n- **User Interface**: A React-based web application that enables end users to fully access all functionalities in a user-friendly manner. An example of the UI can be seen below. Pattern Detection (right) and Pattern Mining (left).\n![siesta-ui](https://github.com/user-attachments/assets/0459eab6-9b3c-4794-bcb6-432393d5288e)\n- **Message Broker**: Utilizes **Apache Kafka** for event streaming.\n- **Supporting Services**:  **PostgreSQL** (for metadata storage), and **Zookeeper** (for Kafka coordination).\n\nAll containers communicate within an internal **Docker network**.\n\n## Prerequisites\nEnsure that you have the following installed on your system:\n\n- [Docker](https://www.docker.com/)\n- [Docker Compose](https://docs.docker.com/compose/)\n\n## Installation \u0026 Deployment\nTo deploy the entire SIESTA infrastructure, run the following command from the root directory:\n\n```bash\ndocker-compose up -d\n```\n\n## Configuration\nAll images for the **SIESTA** infrastructure are available in [DockerHub](https://hub.docker.com/u/mavroudo), \nwith descriptions of the required environmental variables. Each repository also contains instructions for building the \nimages from scratch.\n\n### Environment Variables\nBelow are the essential environment variables used in the **docker-compose.yml** file, categorized by service:\n\n#### **Preprocess Component**\n- `s3accessKeyAws`: MinIO access key (default: `minioadmin`)\n- `s3secretKeyAws`: MinIO secret key (default: `minioadmin`)\n- `s3ConnectionTimeout`: Connection timeout for S3 (default: `600000`)\n- `s3endPointLoc`: Endpoint location of MinIO (default: `http://minio:9000`)\n- `kafkaBroker`: Kafka broker URL (default: `http://kafka:9092`)\n- `kafkaTopic`: Kafka topic name (default: `test`)\n- `POSTGRES_ENDPOINT`: PostgreSQL connection endpoint (default: `postgres:5432/metrics`)\n- `POSTGRES_USERNAME`: PostgreSQL username (default: `admin`)\n- `POSTGRES_PASSWORD`: PostgreSQL password (default: `admin`)\n\n#### **Query Processor**\n- `master.uri`: Spark master URI (default: `local[4]` or `local[*]`)\n- `database`: Database type used (`s3` for object storage)\n- `delta`: Boolean flag for batch (`false`) or streaming (`true`) processing\n- `s3.endpoint`: MinIO endpoint (default: `http://minio:9000`)\n- `s3.user`: MinIO access key (default: `minioadmin`)\n- `s3.key`: MinIO secret key (default: `minioadmin`)\n- `s3.timetout`: S3 connection timeout (default: `600000`)\n- `server.port`: Port for the query processor application (default: `8090`)\n\n#### **MinIO (Object Storage)**\n- `MINIO_ROOT_USER`: MinIO root username (default: `minioadmin`)\n- `MINIO_ROOT_PASSWORD`: MinIO root password (default: `minioadmin`)\n\n#### **PostgreSQL Database**\n- `POSTGRES_DB`: Name of the PostgreSQL database (default: `metrics`)\n- `POSTGRES_USER`: PostgreSQL username (default: `admin`)\n- `POSTGRES_PASSWORD`: PostgreSQL password (default: `admin`)\n\n#### **Kafka \u0026 Zookeeper**\n- `ZOOKEEPER_CLIENT_PORT`: Port for Zookeeper client connections (default: `2181`)\n- `KAFKA_BROKER_ID`: Unique identifier for Kafka broker (default: `1`)\n- `KAFKA_ZOOKEEPER_CONNECT`: Zookeeper connection string (default: `zookeeper:2181`)\n- `KAFKA_LISTENER_SECURITY_PROTOCOL_MAP`: Security protocol mapping (default: `INSIDE:PLAINTEXT,OUTSIDE:PLAINTEXT`)\n- `KAFKA_ADVERTISED_LISTENERS`: Advertised listeners for Kafka broker (default: `INSIDE://:9093,OUTSIDE://siesta-kafka:9092`)\n- `KAFKA_LISTENERS`: Kafka listeners configuration (default: `INSIDE://:9093,OUTSIDE://:9092`)\n- `KAFKA_INTER_BROKER_LISTENER_NAME`: Inter-broker listener name (default: `INSIDE`)\n- `KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR`: Replication factor for Kafka topics (default: `1`)\n\n## Usage Examples\nHere are a few example executions to help you get started:\n\n### Example 1: Submitting a logfile for preprocessing\nThe first thing that we have to do is preprocess a logfile. In `/datasets` we provide two small datasets for testing.\nYou can submit preprocess a logfile, either by utilizing the [UI](http://localhost:80/preprocessing) which will allow\nus to upload a logfile or by sending a CURL\nrequest to the preprocessing component. More information about the functionalities of the preprocess component can be \nfound in its [repo](https://github.com/siesta-tool/SequenceDetectionPreprocess). Lets test the second way. Assuming that \nwe want to preprocess the `test.withTimestamp` logfile that is inside the `/datasets` folder, we have to execute the following:\n```bash\ncurl -X 'POST' 'http://localhost:8000/preprocess' \\\n  -H 'Content-Type: application/json' \\\n  -d '{\n    \"spark_master\": \"local[*]\",\n    \"file\": \"test.withTimestamp\",\n    \"logname\": \"test\"\n  }'\n```\nNotes:\n1. The logfile is visible to the preprocess as we have included a docker volume between the datasets and the place where\nit stores the uploaded files.\n2. This will start a preprocess task and return a unique id that we can use to monitor the tasks process.\n3. Once the task is completed, we have to refresh the indices in the UI, and then we will be able to see the new index\n   (called \"test\").\n\n### Example 2: Submitting a detection query for the processed log\nNow that the testing logfile has been indexed we can submit queries to the query processor. Again this can be done\nby either the [UI](http://localhost:80/indexes/test) or by sending request directly to the query processor. Lets try\nto find all traces in the test logfile that contains the pattern \u003cA_,B_\u003e (i.e., an instance of the event A followed by \nan instance of the event B). The `_` dictates that we just want a single instance of this event. SIESTA also supports\nother symbols like: \n- `+`: one or more occurrence\n- `*`: zero or more occurrences\n- `!`: negation\n- `||`: disjunction between two events (e.g,A or B)\n\nThe curl request will be:\n```bash\ncurl -X 'POST' 'http://localhost:8090/detection' \\\n  -H 'Content-Type: application/json' \\\n  -d '{\"log_name\":\"test\",\"pattern\":{\"eventsWithSymbols\":[{\"name\":\"A\",\"position\":0,\"symbol\":\"_\"},{\"name\":\"B\",\"position\":1,\"symbol\":\"_\"}]},\"returnAll\":null,\"hasGroups\":false,\"groups-config\":null,\"whyNotMatchFlag\":false,\"wnm-config\":null}'\n```\nAnd the response will be something like this:\n```json\n{\"occurrences\":[\n  {\"traceID\":\"2\",\"occurrences\":[{\"occurrence\":[{\"name\":\"A\",\"position\":0},{\"name\":\"B\",\"position\":1}]}]},\n  {\"traceID\":\"1\",\"occurrences\":[{\"occurrence\":[{\"name\":\"A\",\"position\":0},{\"name\":\"B\",\"position\":2}]}]},\n  {\"traceID\":\"3\",\"occurrences\":[{\"occurrence\":[{\"name\":\"A\",\"position\":0},{\"name\":\"B\",\"position\":3}]}]}],\n  \"performance statistics\":{\"time for pruning in ms\":1114,\"time for validation in ms\":1,\"response time in ms\":1115}}\n```\nwhich says that the pattern \u003cA,B\u003e exists in the traces with ids 1, 2 and 3 and also contains some information about\nthe performance of the query.\nNote that in the curl request there are a number of different fields that were not set, these correspond to the different\nfilters that are allowed by the detection query. This includes the minimum (or maximum) timespan between two events,\nthe time window of interest. We can also group different traces together (in query time) and even provide explanations \nabout unexpected outputs. All these are hard to configure through command-line and that is why we advise any new user to utilize\nthe UI.\n\n### Example 3: Submitting a mining query for the processed log\nSIESTA efficiently mines declarative constraints based on 21 different templates from the DECLARE language. \nThese templates act as constraint blueprints, and can be utilized in applications such as outlier detection, \nconformance checking, and event prediction. To extract all the constraints the example log we can simply go to the\n**Mining Constraint** tab in the [UI](http://localhost:80/indexes/test), check all boxes, set the minimum support\nthat a constraint should have in order to appear in the final set and the click submit (or\nagain submit a curl request).\n```bash\ncurl -X 'GET' 'http://localhost:8090/declare/?log_database=test\u0026support=0.9'\n```\nAnd the response will be a long list with constraints\n```json lines\n{\"existence patterns\":{\"absence\":[{\"n\":3,\"ev\":\"D\",\"support\":\"1.000\"},\n  {\"n\":2,\"ev\":\"D\",\"support\":\"1.000\"},{\"n\":3,\"ev\":\"C\",\"support\":\"1.000\"}],\n  \"exactly\":[],\n  \"existence\":[{\"n\":2,\"ev\":\"A\",\"support\":\"1.000\"},{\"n\":1,\"ev\":\"A\",\"support\":\"1.000\"},\n    {\"n\":1,\"ev\":\"C\",\"support\":\"1.000\"}],\n  \"co-existence\":[{\"evA\":\"A\",\"evB\":\"C\",\"support\":\"1.000\"}],\n  \"not co-existence\":null,\n  \"choice\":[{\"evA\":\"B\",\"evB\":\"C\",\"support\":\"1.000\"},\n    {\"evA\":\"B\",\"evB\":\"D\",\"support\":\"1.000\"},{\"evA\":\"A\",\"evB\":\"B\",\"support\":\"1.000\"},\n    {\"evA\":\"A\",\"evB\":\"C\",\"support\":\"1.000\"},{\"evA\":\"A\",\"evB\":\"D\",\"support\":\"1.000\"},\n    {\"evA\":\"C\",\"evB\":\"D\",\"support\":\"1.000\"}],\n  \"exclusive choice\":[{\"evA\":\"D\",\"evB\":\"B\",\"support\":\"1.000\"}],\n  \"responded existence\":[{\"evA\":\"B\",\"evB\":\"C\",\"support\":\"1.000\"},\n    {\"evA\":\"B\",\"evB\":\"A\",\"support\":\"1.000\"},{\"evA\":\"A\",\"evB\":\"C\",\"support\":\"1.000\"},\n    {\"evA\":\"C\",\"evB\":\"A\",\"support\":\"1.000\"},{\"evA\":\"D\",\"evB\":\"A\",\"support\":\"1.000\"},\n    {\"evA\":\"D\",\"evB\":\"C\",\"support\":\"1.000\"}]},\n  \"position patterns\":{\"first\":[{\"ev\":\"A\",\"support\":\"1.000\"}], \"last\":[]},\n  \"ordered relations\":{\"mode\":\"simple\", \n    \"response\":[{\"evA\":\"D\",\"evB\":\"A\",\"support\":\"1.000\"},\n      {\"evA\":\"B\",\"evB\":\"C\",\"support\":\"1.000\"},\n      {\"evA\":\"A\",\"evB\":\"C\",\"support\":\"0.900\"}],\n  \"precedence\":[{\"evA\":\"A\",\"evB\":\"C\",\"support\":\"1.000\"},\n    {\"evA\":\"A\",\"evB\":\"B\",\"support\":\"1.000\"},{\"evA\":\"C\",\"evB\":\"D\",\"support\":\"1.000\"},\n    {\"evA\":\"A\",\"evB\":\"D\",\"support\":\"1.000\"}],\n  \"succession\":[{\"evA\":\"A\",\"evB\":\"C\",\"support\":\"0.900\"}],\n  \"not-succession\":[{\"evA\":\"D\",\"evB\":\"C\",\"support\":\"1.000\"},\n    {\"evA\":\"B\",\"evB\":\"D\",\"support\":\"1.000\"}, {\"evA\":\"D\",\"evB\":\"B\",\"support\":\"1.000\"}]},\n  \"ordered relations alternate\":{\"mode\":\"alternate\",\"response\":[{\"evA\":\"D\",\"evB\":\"A\",\"support\":\"1.000\"}],\n    \"precedence\":[{\"evA\":\"C\",\"evB\":\"D\",\"support\":\"1.000\"},{\"evA\":\"A\",\"evB\":\"D\",\"support\":\"1.000\"}],\n    \"succession\":[],\"not-succession\":[]},\n  \"ordered relations chain\":{\"mode\":\"chain\",\"response\":[{\"evA\":\"D\",\"evB\":\"A\",\"support\":\"1.000\"}],\n    \"precedence\":[{\"evA\":\"C\",\"evB\":\"D\",\"support\":\"1.000\"}],\"succession\":[],\n    \"not-succession\":[{\"evA\":\"D\",\"evB\":\"C\",\"support\":\"1.000\"},\n      {\"evA\":\"B\",\"evB\":\"D\",\"support\":\"1.000\"},\n      {\"evA\":\"D\",\"evB\":\"B\",\"support\":\"1.000\"},{\"evA\":\"A\",\"evB\":\"D\",\"support\":\"1.000\"}]}}\n```\nObviously, this looks better in the UI.\n\n### _Bonus example:_ Incremental Declare Mining\nA recently added feature is **Incremental Declare Mining**. This allows users to perform a small post-processing step after \nindex building to extract key statistics. These statistics enhance the query processor’s ability to efficiently extract Declare constraints.\nAlthough this post-processing step is integrated into the preprocessing component, it can also be used independently. \nFor example, a company might choose to index its log files every few hours but may only need to mine the most recent constraints once a week.\nAdditionally, the query processor will automatically utilize these statistics when available, requiring no extra configuration. \nThis feature is particularly beneficial in **big data scenarios**, where event volume grows significantly.\nTo demonstrate this, we have included an extended example in the [`incremental_mining`](./incremental_mining) folder, \nshowcasing how we artificially generate large log files and incrementally extract constraints.\n\n## Datasets\nTwo sample event logs are available in the `/datasets` folder. The datasets are formatted in two ways:\n- **Custom Format (`.withTimestamp`)**: Contains trace IDs, event types, and timestamps separated by `**/delab/**`.\n- **XES Format (`.xes`)**: A standard format used in **Business Process Management**.\n\n\n## Publications\nSIESTA has been featured in various publications:\n\n1. [**Sequence detection in event log files**](https://openproceedings.org/2021/conf/edbt/p68.pdf) - Conference EDBT 2021\n2. [**SIESTA: A scalable infrastructure of sequential pattern analysis**](https://ieeexplore.ieee.org/abstract/document/9984935) - IEEE Transaction on Big Data 2022\n3. [**A Comprehensive Scalable Framework for Cloud-Native Pattern Detection with Enhanced Expressiveness**](https://arxiv.org/pdf/2401.09960) - Arxiv 2024\n4. [**Exploiting General Purpose Big-Data Frameworks in Process Mining: The Case of Declarative Process Discovery**](https://link.springer.com/chapter/10.1007/978-3-031-70396-6_11) - Conference BPM 2024\n5. [**Declarative process mining in big data scenarios using an application-agnostic framework**](https://link.springer.com/article/10.1007/s44311-025-00013-9) - Process Science 2025\n6. [**Discovering Comprehensive Branched Declarative Process Constraints**](https://doi.org/10.1007/978-3-032-02929-4_9) - Conference Forum BPM 2025 \n\n## License\nThis project is licensed under the **MIT License**. See the `LICENSE` file for details.\n\n\n\n## Additional Resources\n\n- [Demo Video](https://drive.google.com/file/d/13RsQ9CfXDP2DcFnHz_38C-HtHHNHm-W2/view?usp=sharing)\n  It can also be found in the assets folder (in case there is an issue with the above link)\n- [Preprocessing Component](https://github.com/siesta-tool/SequenceDetectionPreprocess) \n- [Query Processor](https://github.com/siesta-tool/SequenceDetectionQueryExecutor)\n- [User Interface](https://github.com/siesta-tool/siesta-ui)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsiesta-tool%2Fsiesta-demo","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fsiesta-tool%2Fsiesta-demo","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsiesta-tool%2Fsiesta-demo/lists"}