{"id":19957904,"url":"https://github.com/pathwaycom/pathway-benchmarks","last_synced_at":"2025-08-02T09:10:18.007Z","repository":{"id":185050930,"uuid":"628939332","full_name":"pathwaycom/pathway-benchmarks","owner":"pathwaycom","description":"Benchmarks for data processing systems: Pathway, Spark, Flink, Kafka Streams","archived":false,"fork":false,"pushed_at":"2025-03-17T10:37:17.000Z","size":4926,"stargazers_count":70,"open_issues_count":0,"forks_count":4,"subscribers_count":11,"default_branch":"main","last_synced_at":"2025-05-26T06:52:28.868Z","etag":null,"topics":["benchmark-framework","flink","kafka-streams","latency","pagerank","pathway","spark-streaming","streaming","streaming-data","wordcount"],"latest_commit_sha":null,"homepage":"https://pathway.com","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/pathwaycom.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":"CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-04-17T09:42:18.000Z","updated_at":"2025-05-25T08:00:43.000Z","dependencies_parsed_at":"2025-04-09T09:10:56.657Z","dependency_job_id":null,"html_url":"https://github.com/pathwaycom/pathway-benchmarks","commit_stats":null,"previous_names":["pathwaycom/pathway-benchmarks"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/pathwaycom/pathway-benchmarks","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/pathwaycom%2Fpathway-benchmarks","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/pathwaycom%2Fpathway-benchmarks/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/pathwaycom%2Fpathway-benchmarks/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/pathwaycom%2Fpathway-benchmarks/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/pathwaycom","download_url":"https://codeload.github.com/pathwaycom/pathway-benchmarks/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/pathwaycom%2Fpathway-benchmarks/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":263047678,"owners_count":23405280,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["benchmark-framework","flink","kafka-streams","latency","pagerank","pathway","spark-streaming","streaming","streaming-data","wordcount"],"created_at":"2024-11-13T01:39:12.552Z","updated_at":"2025-07-01T23:06:13.948Z","avatar_url":"https://github.com/pathwaycom.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003cdiv align=\"center\"\u003e\n  \u003cimg src=\"https://pathway.com/logo-light.svg\" /\u003e\u003cbr /\u003e\u003cbr /\u003e\n\u003c/div\u003e\n\u003cp align=\"center\"\u003e\n    \u003ca href=\"https://github.com/pathwaycom/pathway-benchmarks/blob/main/LICENSE\"\u003e\n        \u003cimg src=\"https://img.shields.io/github/license/pathwaycom/pathway-benchmarks?style=plastic\" alt=\"Contributors\"/\u003e\u003c/a\u003e\n        \u003cimg src=\"https://img.shields.io/badge/OS-Linux-green\" alt=\"Linux\"/\u003e\n        \u003cimg src=\"https://img.shields.io/badge/OS-macOS-green\" alt=\"macOS\"/\u003e\n      \u003cbr\u003e\n    \u003ca href=\"https://discord.gg/pathway\"\u003e\n        \u003cimg src=\"https://img.shields.io/discord/1042405378304004156?logo=discord\"\n            alt=\"chat on Discord\"\u003e\u003c/a\u003e\n    \u003ca href=\"https://twitter.com/intent/follow?screen_name=pathway_com\"\u003e\n        \u003cimg src=\"https://img.shields.io/twitter/follow/pathway_com?style=social\u0026logo=twitter\"\n            alt=\"follow on Twitter\"\u003e\u003c/a\u003e\n\u003c/p\u003e\n\n[Pathway](https://www.pathway.com) is a reactive data processing framework designed for high-throughput and low-latency realtime data processing. Pathway's unified Rust engine processes code seamlessly in both batch and streaming mode using the same Python API syntax. \n\nThis repository contains benchmarks to compare the performance of Pathway against state-of-the-art technologies designed for streaming and batch data processing tasks, including Flink, Spark and Kafka Streaming. For a complete write-up of the benchmarks, read our corresponding [benchmarking article](https://pathway.com/blog/streaming-benchmarks-pathway-fastest-engine-on-the-market).\n\n![WordCount and PageRank Results](images/bm-wordcount-and-pagerank.png)\n\nThe benchmarks are reproducible using the code in this repository. Find the instructions below under \"Reproducing the benchmarks\". \n\n## Benchmarks\n\nThe repository contains two types of benchmarks:\n\n- The **WordCount Benchmark** reads words from the stream and stores the count of each word in the output stream. For the sake of output compaction, only changed entries are streamed. In this benchmark, a Kafka cluster is used for input and output.\n- The **PageRank Benchmark** gets graph edges from the input stream, computes the PageRank of each node, and streams it to the output. In this benchmark, filesystem input and empty output are used.\n\nThe two benchmarks each represent a type of workload Pathway aims to support: online streaming tasks and graph processing tasks. The graph-processing benchmark (i.e. PageRank) is evaluated in three modes: batch, streaming, and a mixed batch-online mode we call backfilling which evaluates the ability of the engine to switch from batch to online mid-way.\n\nFor a full discussion of the results obtained, read our [benchmarking article](https://pathway.com/blog/streaming-benchmarks-pathway-fastest-engine-on-the-market).\n\n## Machine specs\n\nBelow we present the results of the benchmarks. For these results, all benchmarks were run on dedicated machines with: 12-core AMD Ryzen 9 5900X Processor, 128GB of RAM and SSD drives. For all multithreaded benchmarks we explicitly allocate cores to ensure that threads maximally share L3 cache. This is important, as internally the CPU is assembled from two 6-core halves, and thread communication between halves is impacted. For this reason we report results on up to 6 cores for all frameworks.\n\nAll experiments are run using Docker, enforcing limits on used CPU cores and RAM consumption.\n\n\n## Results\n\nThis section presents the results of the benchmarks. The results show that:\n1. Pathway is on-par with or outperforms state-of-the-art solutions for common online streaming tasks (WordCount).\n2. Pathway outperforms the other benchmarked engines for iterative graph processing tasks in **batch**.\n3. Pathway outperforms the other benchmarked engines for iterative graph processing tasks in **streaming**.\n4. Pathway is uniquely able to handle mixed **batch-and-streaming** workloads at scale.\n\n### WordCount benchmark\nThe graph below shows results of the WordCount benchmark. 95% latency is reported in milliseconds (y-axis) per throughput value (x-axis). Out of the four tested solutions, Flink and Pathway are on-par, both clearly outperforming Spark Structured Streaming and Kafka Streams.\n\nPathway clearly outperforms the default Flink setup in terms of sustained throughput, and dominates the Flink minibatching setup in terms of latency for all of the throughput spectrum we could measure. For most throughputs, Pathway also achieves lower latency than the better of the two Flink setups.\n\n![WordCount Graph](images/bm-wordcount-lineplot.png)\n\n\n### PageRank benchmark (batch)\nThe table below shows results of the PageRank benchmark in batch mode. We report the total running time\nin seconds to process the dataset. The standard code logic is an idiomatic (join-based) implementation. Additionally, two incomparable implementations marked with (*) are benchmarked for Spark.\n\nThe fastest performance is achieved by the Spark GraphX implementation and the more aggressively-optimized Pathway build. The formulation (and syntax) of the GraphX algorithm is different from the others. Performing an apples-to-apples comparison of performance of equivalent logic in Table APIs, Pathway is the fastest, followed by Flink and Spark.\n\n![PageRank Batch Results](images/bm-pagerank-batch.png)\n\n\n### PageRank benchmark (streaming)\nThe table below shows results of the PageRank benchmark in streaming mode. We report the total running time in seconds to process the dataset by updating the PageRank results every 1000 edges. \n\nWe evaluate only two systems on the streaming PageRank task: Pathway and Flink. We don’t test Kafka Streams because it was suboptimal on the streaming wordcount task. Moreover, no Spark variant supports such a complicated streaming computation: GraphX doesn’t support streaming, Spark Structured Streaming doesn’t allow chaining multiple groupby’s and reductions, and Spark Continuous Streaming is too limited to support even simple streaming benchmarks.\n\nWe see that while both systems are able to run the streaming benchmark, Pathway maintains a large advantage over Flink. It is hard to say whether this advantage is “constant” (with a factor of about 50x) or increases “asymptotically” with dataset size. Indeed, extending the benchmarks to tests on larger datasets than those reported in Table 2 is problematic as Flink’s performance is degraded by memory issues.\n\n![PageRank Streaming Results](images/bm-pagerank-streaming.png)\n\n\n### PageRank benchmark (backfilling)\nThe table below shows results of the PageRank benchmark in a backfilling scenario that mixes batch and streaming. \n\nPathway again offers superior performance, completing the first of the datasets considered approximately 20x faster than Flink. The first large batch is processed by Pathway in times comparable to the pure batch scenario. \n\nFor backfilling on the complete LiveJournal dataset, Flink either ran out of memory or failed to complete the task on 6 cores within 2 hours, depending on the setup.\n\n![PageRank Backfilling Results](images/bm-pagerank-backfill.png)\n\nFor a full discussion of the results obtained, read our [benchmarking article](https://pathway.com/blog/streaming-benchmarks-pathway-fastest-engine-on-the-market).\n\nThe following sections contain information necessary for reproducing the benchmarks. \n\n## Reproducing the benchmarks\n\nThe repository provides a single Python script to run each benchmark:\n- Execute the `run_wordcount.py` script in the `wordcount-online-streaming` directory to launch the WordCount benchmark on all tested solutions.\n- Execute the `run_pagerank.py` script in the `pagerank-iterative-graph-processing` directory to launch the PageRank benchmark on all tested solutions. You will first need to download and preprocess the datasets (see below).\n\nYou will need a machine with at least 12 CPU cores to reproduce these benchmarks. Alternatively, you can edit the scripts mentioned above to run the benchmarks on a lower number of cores.\n\nAll benchmarks are run using Dockerized containers. Before launching the Python scripts, make sure you have the latest version of Docker installed and your Docker daemon is running. You may have to increase your allocated memory per container (at least 4GB per container) and allocate the necessary number of CPU cores to your Docker containers. \n\n\n## Accessing the datasets\n\n### WordCount\n\nThe WordCount benchmarks are run on a dataset of 76 million words taken uniformly at random from a dictionary of 5000 random 7-lowercase letter words. We split the dataset into two parts: we use 16 million words as a burn-in period (to disregard high-latency at engine start-up) and we include only the latencies of the remaining 60 million words in the final results. The dataset is generated automatically when you run the `run_wordcount.py` script. \n\nNote that if your username has characters such as a dot or similar, you should add the USER= variable before launching, otherwise you may run into an error message because the `docker-compose` project name is built based on the username.\n\nResults are stored in the `results` subdirectory.\n\n\n### PageRank\n\nThe PageRank benchmarks are run on various subsets of the [Stanford LiveJournal dataset](https://snap.stanford.edu/data/soc-LiveJournal1.html). You can download and preprocess the datasets by running the `get_datasets.sh` script in the `pagerank-iterative-graph-processing/datasets` directory.\n\nResults are stored in the `results` subdirectory.\n\n### Dataset Citation\n\n  author  = Jure Leskovec and Andrej Krevl,\n  title   = SNAP Datasets: Stanford Large Network Dataset Collection,\n  url     = http://snap.stanford.edu/data,\n  month   = jun,\n  year    = 2014\n\n\n\n## Repository organization\n\nThe repository is structured as follows:\n\n- `wordcount-online-streaming`, contains the scripts and files necessary to run the WordCount benchmark. This is where you will find the `run_wordcount.py` script to reproduce the benchmark yourself;\n- `pagerank-iterative-graph-processing`, contains the scripts and files necessary to run the PageRank benchmark. This is where you will find the `run_pagerank.py` script to reproduce the benchmark yourself; \n\n\n\n\n\n\n\n\n\n\n\n\n\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fpathwaycom%2Fpathway-benchmarks","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fpathwaycom%2Fpathway-benchmarks","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fpathwaycom%2Fpathway-benchmarks/lists"}