{"id":13807446,"url":"https://github.com/lsds/LightSaber","last_synced_at":"2025-05-14T00:31:30.607Z","repository":{"id":65832137,"uuid":"247694973","full_name":"lsds/LightSaber","owner":"lsds","description":"Multi-core Window-Based Stream Processing Engine","archived":false,"fork":false,"pushed_at":"2021-10-20T10:24:02.000Z","size":1624,"stargazers_count":71,"open_issues_count":0,"forks_count":18,"subscribers_count":5,"default_branch":"master","last_synced_at":"2025-04-07T07:36:34.791Z","etag":null,"topics":["aggregation","compression","cpp","incremental-computation","libaio","llvm","multi-core","numa","rdma","sliding-windows","ssd","stream-processing"],"latest_commit_sha":null,"homepage":"https://lsds.doc.ic.ac.uk/projects/lightsaber","language":"C++","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/lsds.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2020-03-16T12:20:42.000Z","updated_at":"2025-01-02T21:01:51.000Z","dependencies_parsed_at":"2023-02-13T03:31:44.539Z","dependency_job_id":null,"html_url":"https://github.com/lsds/LightSaber","commit_stats":null,"previous_names":[],"tags_count":3,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lsds%2FLightSaber","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lsds%2FLightSaber/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lsds%2FLightSaber/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lsds%2FLightSaber/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/lsds","download_url":"https://codeload.github.com/lsds/LightSaber/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":254046310,"owners_count":22005572,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["aggregation","compression","cpp","incremental-computation","libaio","llvm","multi-core","numa","rdma","sliding-windows","ssd","stream-processing"],"created_at":"2024-08-04T01:01:25.613Z","updated_at":"2025-05-14T00:31:25.592Z","avatar_url":"https://github.com/lsds.png","language":"C++","funding_links":[],"categories":["Table of Contents"],"sub_categories":["Streaming Engine"],"readme":"\u003cdiv align=\"center\"\u003e\n\u003cimg src=\"docs/images/logo.png\" height=\"70%\" width=\"70%\" alt=\"\"/\u003e\n\u003c/div\u003e\n\n[![License](https://img.shields.io/github/license/lsds/LightSaber.svg?branch=master)](https://github.com/lsds/LightSaber/blob/master/LICENCE.md)\n# Introduction \n\nAs an ever-growing amount of data is acquired and analyzed in real-time, stream processing engines have become an essential part of any data processing stack. Given the importance of this class of applications, modern stream processing engines must be designed specifically for the efficient execution on multi-core CPUs. However, it is challenging to analyze conceptually infinite data streams with high throughput and low latency performance while providing fault-tolerance semantics.\nThis project offers two systems to help tackle this problem.\n\n\n\n## LightSaber \u003cimg src=\"docs/images/logo.png\" align=\"right\" height=\"20%\" width=\"20%\" alt=\"\"/\u003e\n\nLightSaber is a stream processing engine that balances parallelism and incremental processing when executing window aggregation queries\non multi-core CPUs. LightSaber operates on in-order data streams and achieves up to an order of magnitude higher throughput than existing systems.\n\nSee application examples and how to configure LightSaber [here](#running-lightsaber).\n\n\u003cdiv align=\"center\"\u003e\n\u003cimg src=\"docs/images/architecture.png\" alt=\"\" height=\"90%\" width=\"90%\"/\u003e\n\u003c/div\u003e\n\n\n## Scabbard \u003cimg src=\"docs/images/Scabbard_logo.png\" align=\"right\" height=\"7%\" width=\"7%\" alt=\"\"/\u003e\nScabbard is the first single-node SPE that supports exactly-once fault-tolerance semantics despite limited local I/O bandwidth.\nIt tightly couples the persistence operations with the operator graph through a novel persistent operator graph model and\ndynamically reduces the required disk bandwidth at runtime through adaptive data compression.\nScabbard is based on the query execution engine and compiler from LightSaber.\n\nSee application examples and how to configure Scabbard [here](#running-scabbard).\n\n\u003cdiv align=\"center\"\u003e\n\u003cimg src=\"docs/images/Scabbard_arch.png\" height=\"70%\" width=\"70%\" alt=\"\"/\u003e\n\u003c/div\u003e\n\n## Getting started\n\nThe `prepare-software.sh` script will guide you through the installation of our system locally.\nThe script is tested on **Ubuntu 18.04.5 LTS**. If an error occurs, you may have to manually \nremove and add the symbolic links of the compiler binaries in `/usr/lib/ccache/`.\n\n```\n$ git clone https://github.com/lsds/LightSaber.git\n$ cd LightSaber\n$ ./scripts/prepare-software.sh\n$ ./scripts/build.sh\n```\n\nOtherwise, use the Dockerfile:\n```\n$ git clone https://github.com/lsds/LightSaber.git\n$ cd LightSaber\n$ docker build --tag=\"lightsaber\" .\n$ docker run -ti lightsaber\n```\n\n### Setting up variables before running the code\nWhen running a query, the **LightSaber system is used by default**.\n**To enable the features of Scabbard, we have to set the variables defined [here](#scabbard-configuration)**.\n\nSkip the next part if you don't want to change the folder where code/data is stored, and you have\ninstalled LLVM in the `$HOME` directory.\n\nBefore running any query, set the path (the default is the `$HOME` directory) where files are stored in the \nSystemConf.cpp file:\n```\nSystemConf::FILE_ROOT_PATH = ...\n```\nand the path for LLVM/Clang source files in src/CMakeLists (the default is the `$HOME` directory):\n```\nset(USER_PATH \"...\")\n```\n\n### Adding new applications\nWhen compiling in `Release` mode, add the `-UNDEBUG` flag in the `CMakeLists.txt` to enable `assert`:\n```\ntarget_compile_options(exec ... -UNDEBUG)\n```\n\n### Start with unit tests\n```\n$ ./build/test/unit_tests/ds_unit_tests\n$ ./build/test/unit_tests/internals_unit_tests\n$ ./build/test/unit_tests/operators_unit_tests           \n``` \n\n## Running LightSaber\n\n### Running a microbenchmark (e.g., Projection)\n```\n$ ./build/test/benchmarks/microbenchmarks/TestProjection\n```\n\n### Running a cluster monitoring application with sample data\n```\n$ ./build/test/benchmarks/applications/cluster_monitoring\n```\n\n### Running benchmarks from the paper\nYou can find the results in `build/test/benchmarks/applications/`.\n```\n$ cd scripts/lightsaber-bench\n$ ./run-benchmarks-lightsaber.sh\n```\n\n### LightSaber configuration\n\nVariables in **SystemConf.h** configure the LightSaber runtime. Each of them also corresponds to a command-line argument available to all LightSaber applications:\n\n###### --threads _N_\nSets the number of CPU worker threads (`WORKER_THREADS` variable). The default value is `1`. **CPU worker threads are pinned to physical cores**. The threads are pinned to core ids based on the underlying hardware (e.g., if there are multiple sockets with n cores each, the first n threads are pinned in the first socket and so on).\n\n###### --batch-size _N_\nSets the batch size in bytes (`BATCH_SIZE` variable). The default value is `131072`, i.e. 128 KB.\n\n###### --bundle-size _N_\nSets the bundle size in bytes (`BUNDLE_SIZE` variable), which is used for generating data in-memory.\nIt has to be a multiple of the `BATCH_SIZE`. The default value is `131072`, i.e. 128 KB, which is the same as the `BATCH_SIZE`.\n\n###### --slots _N_\nSets the number of intermediate query result slots (`SLOTS` variable). The default value is `256`.\n\n###### --partial-windows _N_\nSets the maximum number of window fragments in a query task (`PARTIAL_WINDOWS` variable). The default value is `1024`.\n\n###### --circular-size _N_\nSets the circular buffer size in bytes (`CIRCULAR_BUFFER_SIZE` variable). The default value is `4194304`, i.e. 4 MB.\n\n###### --unbounded-size _N_\nSets the intermediate result buffer size in bytes (`UNBOUNDED_BUFFER_SIZE` variable). The default value is `524288`, i.e. 512 KB.\n\n###### --hashtable-size _N_\nHash table size (in number of buckets): hash tables hold partial window aggregate results (`HASH_TABLE_SIZE` variable with the default value 512).\n\n###### --performance-monitor-interval _N_\nSets the performance monitor interval in msec (`PERFORMANCE_MONITOR_INTERVAL` variable).\nThe default value is `1000`, i.e. 1 sec. Controls how often LightSaber prints on standard output performance statistics such as throughput and latency.\n\n###### --latency `true`|`false`\nDetermines whether LightSaber should measure task latency or not (`LATENCY_ON` variable). The default value is `false`.\n\n###### --parallel-merge `true`|`false`\nDetermines whether LightSaber uses parallel aggregation when merging fragment windows or not (`PARALLEL_MERGE_ON` variable). The default value is `false`.\n\n###### To enable NUMA-aware scheduling\n\nSet the `HAVE_NUMA` flag in the respective CMakeLists.txt (e.g., in `test/benchmarks/applications/CMakeLists.txt`) and recompile the code.\n\n###### To ingest/output data with TCP\n\nSet the `TCP_INPUT`/`TCP_OUTPUT` flag in the respective CMakeLists.txt (e.g., in `test/benchmarks/applicationsWithCheckpoints/CMakeLists.txt`) and recompile the code.\nCheck the `test/benchmarks/applications/RemoteBenchmark` folder for code samples to create TCP sources/sinks.\n\n###### To ingest/output data with RDMA\n\nSet the `RDMA_INPUT`/`RDMA_OUTPUT` flag in the respective CMakeLists.txt (e.g., in `test/benchmarks/applicationsWithCheckpoints/CMakeLists.txt`) and recompile the code.\nCheck the `test/benchmarks/applications/RemoteBenchmark` folder for code samples to create RDMA sources/sinks.\n\n\n\n## Running Scabbard\n\n### Running a microbenchmark (e.g., Aggregation) with persistent input streams and 1-sec checkpoints\n```\n$ ./build/test/benchmarks/microbenchmarks/TestPersistentAggregation\n```\n\n### Running a cluster monitoring application with persistence using sample data\n```\n$ ./build/test/benchmarks/applicationsWithCheckpoints/cluster_monitoring_checkpoints --circular-size 33554432 --unbounded-size 524288 --batch-size 524288 --bundle-size 524288 --query 1 --checkpoint-duration 1000 --disk-block-size 65536 --checkpoint-compression true --persist-input true --lineage true --latency true --threads 1\n```\n\n### Running benchmarks from the paper \nYou can find the results in `build/test/benchmarks/applicationsWithCheckpoints/`.\n```\n$ cd scripts/scabbard-bench/paper/\n$ ./run-benchmarks-...-FIG_X.sh\n```\n\n### Scabbard configuration\n\nIn addition to [LightSaber's system variables](#lightsaber-configuration), we can configure the Scabbard runtime with variables specific its fault-tolerance semantics.\nEach of them also corresponds to a command-line argument available to all Scabbard applications:\n\n###### --compression-monitor-interval _N_\nSets the query compression decision update interval in msec (`COMPRESSION_MONITOR_INTERVAL` variable). The default value is `4000` i.e. 4 sec.\n\n###### --checkpoint-duration _N_\nSets the performance monitor interval in msec (`CHECKPOINT_INTERVAL` variable). The default value is `1000`, i.e. 1 sec.\n\n###### --disk-block-size _N_\nSets the size of blocks on disk in bytes (`BLOCK_SIZE` variable). The default value is `16KB`.\n\n###### --create-merge `true`|`false`\nDetermines whether Scabbard is generating merge tasks to avoid resource starvation due to asynchronous execution (`CREATE_MERGE_WITH_CHECKPOINTS` variable). The default value is `false`.\n\n###### --checkpoint-compression `true`|`false`\nDetermines whether Scabbard is compressing data before storing them to disk (`CHECKPOINT_COMPRESSION` variable). The default value is `false`.\n\n###### --persist-input `true`|`false`\nDetermines whether Scabbard persists its input streams (`PERSIST_INPUT` variable). The default value is `false`.\n\n###### --lineage `true`|`false`\nEnables dependency tracking required for exaclty-once results (`LINEAGE_ON` variable). The default value is `false`.\n\n###### --adaptive-compression `true`|`false`\nEnables adaptive compression (`ADAPTIVE_COMPRESSION_ON` variable). The default value is `false`.\n\n###### --adaptive-interval _N_\nSets the interval in msec that triggers the code generation of new compression functions based on collected statistics (`ADAPTIVE_COMPRESSION_INTERVAL` variable). The default value is `4000`, i.e. 4 sec.\n\n###### --recover `true`|`false`\nIf set true, Scabbard attempts to recover using previous persisted data (`RECOVER` variable). The default value is `false`.\n\n\n## How to cite Scabbard\n* **[VLDB]** Georgios Theodorakis, Fotios Kounelis, Peter R. Pietzuch, and Holger Pirk. Scabbard: Single-Node Fault-Tolerant Stream Processing, VLDB, 2022\n```\n@inproceedings{Theodorakis2022,\n author = {Georgios Theodorakis and Fotios Kounelis and Peter R. Pietzuch and Holger Pirk},\n title = {{Scabbard: Single-Node Fault-Tolerant Stream Processing}},\n series = {VLDB '22},\n year = {2022}, \n publisher = {ACM},\n}\n```\n\n## How to cite LightSaber\n* **[SIGMOD]** Georgios Theodorakis, Alexandros Koliousis, Peter R. Pietzuch, and Holger Pirk. LightSaber: Efficient Window Aggregation on Multi-core Processors, SIGMOD, 2020\n```\n@inproceedings{Theodorakis2020,\n author = {Georgios Theodorakis and Alexandros Koliousis and Peter R. Pietzuch and Holger Pirk},\n title = {{LightSaber: Efficient Window Aggregation on Multi-core Processors}},\n booktitle = {Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data},\n series = {SIGMOD '20},\n year = {2020}, \n publisher = {ACM},\n address = {Portland, OR, USA},\n}\n```\n\n### Other related publications\n* **[EDBT]** Georgios Theodorakis, Peter R. Pietzuch, and Holger Pirk. SlideSide: A fast Incremental Stream Processing Algorithm for Multiple Queries, EDBT, 2020\n* **[ADMS]** Georgios Theodorakis, Alexandros Koliousis, Peter R. Pietzuch, and Holger Pirk. Hammer Slide: Work- and CPU-efficient Streaming Window Aggregation, ADMS, 2018 [[code]](https://github.com/grtheod/Hammerslide)\n* **[SIGMOD]** Alexandros Koliousis, Matthias Weidlich, Raul Castro Fernandez, Alexander Wolf, Paolo Costa, and Peter Pietzuch. Saber: Window-Based Hybrid Stream Processing for Heterogeneous Architectures, SIGMOD, 2016\n\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flsds%2FLightSaber","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Flsds%2FLightSaber","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flsds%2FLightSaber/lists"}