{"id":17299865,"url":"https://github.com/spycsh/hesse","last_synced_at":"2025-04-14T12:31:30.351Z","repository":{"id":39878678,"uuid":"452635999","full_name":"Spycsh/hesse","owner":"Spycsh","description":"a temporal graph analytics library based on Flink Stateful Functions","archived":false,"fork":false,"pushed_at":"2023-06-08T02:58:19.000Z","size":1967,"stargazers_count":11,"open_issues_count":1,"forks_count":4,"subscribers_count":3,"default_branch":"main","last_synced_at":"2025-03-28T01:48:13.832Z","etag":null,"topics":["docker","flink-statefun","graph","iterative-graph-processing","kafka","rocksdb","temporal-data","temporal-graph-processing","temporal-query","traversal-bfs"],"latest_commit_sha":null,"homepage":"","language":"Java","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Spycsh.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2022-01-27T10:25:46.000Z","updated_at":"2023-05-13T03:36:57.000Z","dependencies_parsed_at":"2023-02-17T05:45:55.088Z","dependency_job_id":null,"html_url":"https://github.com/Spycsh/hesse","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Spycsh%2Fhesse","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Spycsh%2Fhesse/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Spycsh%2Fhesse/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Spycsh%2Fhesse/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Spycsh","download_url":"https://codeload.github.com/Spycsh/hesse/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248881358,"owners_count":21176836,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["docker","flink-statefun","graph","iterative-graph-processing","kafka","rocksdb","temporal-data","temporal-graph-processing","temporal-query","traversal-bfs"],"created_at":"2024-10-15T11:24:11.201Z","updated_at":"2025-04-14T12:31:30.294Z","avatar_url":"https://github.com/Spycsh.png","language":"Java","funding_links":[],"categories":[],"sub_categories":[],"readme":"# hesse\n\nA temporal graph library based on Flink Stateful Functions\n\n## Features\n\nThis project aims to build an event-driven distributed temporal graph analytics library on top of [Flink Stateful Functions](https://nightlies.apache.org/flink/flink-statefun-docs-stable/).\nIt provides efficient storage of temporal graphs and supports different types of concurrent queries on different graph algorithms in arbitrary event-time windows.\nThe support of arbitrary event-time window query means that Hesse will recover the graph state by applying all the activities in the event-time window for temporal queries of different applications.\n\n## Architecture\n\nThe core architecture is divided into the Ingress, the storage, query, coordination, application layers, and the Egress.\nFlink Stateful Functions guarantee that each Function has its own context and communicates with each other by message passing.\nTherefore, Hesse is to some extent a group of event-driven functions.\nThey provide powerful functionalities and flexibility in the FaaS way.\nCurrently, the Kafka ingress and egress are used. The containers are built and run in the Docker environment.\n\nThe basic architecture is shown as follows:\n\n![arch old](doc/arch_hesse.png)\n\n## How to use\n\nThere are different scenarios so far you can try out, I write a simple [script](scripts/scenarios_config.py) to help you select the right scenario.\nFor developers, you can write your own `docker-compose.yml` and `module.yaml` to add your services or scenarios.\n\n```\npython scripts/scenarios_config.py\n```\n\nbuild the environment and start the containers\n\n```shell\ndocker-compose down\ndocker-compose build\ndocker-compose up\n```\n\nCurrently, five algorithms are implemented for queries, Connected Component algorithm, Strongly Connected Component algorithm,\nGNNSampling algorithm, Single Source Shortest Path algorithm, and PageRank algorithm.\nMake sure to select the right ingress file and type using `scripts/scenarios_config.py`.\n\nTo see the results of a query, you can execute the following command:\n\n```shell\ndocker-compose exec kafka kafka-console-consumer --bootstrap-server kafka:9092 --topic query-results --partition 0 --from-beginning --property print.key=true --property key.separator=\" ** \"\n```\n\nNotice that you should set a reasonable delay starting time for the query producer image using the config script based on how large your dataset is\nbecause you may want to see correct results after your graph is fully established.\n\nAnother way (**highly recommended**) is to decouple the whole `docker-compose up` into three stages: 1) edge producing, 2) edge storage 3) query producing and processing\nby using the following commands:\n\n```shell\ndocker-compose down\ndocker-compose build\n# open the benchmarks subscriber beforehand\ndocker-compose up -d hesse-benchmarks\n# do edge producing, and you can check topic temporal-graph\n# to see whether your dataset is fully pushed into topic\ndocker-compose up -d hesse-graph-producer\n# hesse, statefun worker and statefun manager will store\n# the edges into RocksDB state backend\ndocker-compose up -d statefun-worker\n# do query producing, and hesse will handle these queries\n# and show the results in topic query-results\ndocker-compose up -d hesse-query-producer\n```\n\nApart from the graph datasets and query stream that users can configure by editing the `docker-compose.yml`,\nusers can configure other system parameters by editing `hesse.properties` and `log4j2.properties` in the `resources` folder.\n\n\u003e 2023/06/08\n\nNotice! If you do not want to use a streaming edge producer, you can replace it by using native Kafka producer:\n\n```shell\ndocker-compose up -d kafka\ndocker-compose exec -T kafka kafka-console-producer --broker-list kafka:9092 --topic temporal-graph --property \"parse.key=true\" --property \"key.separator=:\"  \u003c $path_to_your_graph_dataset\n```\n\nThis will extremely boost the ingestion speed that help you ingest a huge existing graph dataset into hesse in only a few seconds. The original `hesse-graph-producer` image is not suitable for large graph dataset cold start. It is recommended to directly use the native producer.\n\nPlease remember that in this way, you must make sure your dataset contains edges with the format like `53: {\"src_id\": \"53\", \"dst_id\": \"28\", \"timestamp\": \"9986248\"}`, which can be obtained by running `scripts/convert_to_json_kv.py`.\n\n## Demo\n\nThis [demo](doc/demo.md) gives an example demo to start with the project.\n\n## Applications\n\nHere are some examples of queries, and the corresponding JSON query strings as streaming ingress:\n\n* What is the connected component id of vertex 151 between time 0 and 400000 and all the vertex ids in the same connected component?\n\n```json\n{\"query_id\": \"1\", \"user_id\": \"1\", \"vertex_id\": \"151\", \"query_type\": \"connected-components\", \"start_t\": \"0\", \"end_t\":\"400000\"}\n```\n\n* What is the single source shortest path from vertex 151 to the other approachable vertexes and the path distance between time 0 and 400000?\n\n```json\n{\"query_id\": \"2\", \"user_id\": \"1\", \"vertex_id\": \"151\", \"query_type\": \"single-source-shortest-path\", \"start_t\": \"0\", \"end_t\":\"400000\"}\n```\n\n* What is the neighbourhood spanning from (Breadth-First search) vertex 151 with hop size 2 and sample size 2 between time 0 and 400000?\n\n```json\n{\"query_id\": \"3\", \"user_id\": \"1\", \"vertex_id\": \"151\", \"query_type\": \"gnn-sampling\", \"start_t\": \"0\", \"end_t\":\"400000\", \"parameter_map\": {\"h\": \"2\", \"k\": \"2\"}}\n```\n\n* What is the PageRank value in 10 iterations of all the vertexes between time 0 and 150?\n\n```json\n{\"query_id\": \"4\", \"user_id\": \"1\", \"vertex_id\": \"all\", \"query_type\": \"pagerank\", \"start_t\": \"0\", \"end_t\":\"150\", \"parameter_map\": {\"iterations\":\"10\"}}\n```\n\nHesse even allows you to put these different types of queries with different parameters into one file to feed into the query stream.\nFor more examples, see the `datasets/query` folder.\n\n## Storage paradigms\n\nHesse offers four kinds of configurable [storage paradigms](https://github.com/Spycsh/hesse/blob/main/src/main/resources/hesse.properties),\nand can be configured based on users' graph stream for better performance. We recommend you configure storage paradigm as `TM` and `iTM` and specify appropriate bucket number if you want to do analytics on large-scale graph. Details will be demonstrated in future paper to this repository.\n\n\u003c!-- ## Benchmarking\n\nGraph Stream Dataset: refer to [link](https://snap.stanford.edu/data/email-Eu-core-temporal.html)\n\nQuery Stream (synthetic): refer to [link](https://github.com/Spycsh/hesse/blob/main/datasets/query/generate_synthetic_queries.py)\n\nStorage Paradigm: iTM with bucket number 128\n\n*  Overall latencies in **milliseconds** handling 100 concurrent queries given\n   different query types and time windows\n\n![overall-latencies-handling-one-hundred-query](doc/macro_benchmarking_2.PNG)\n\n* Average latencies in **milliseconds** handling one query given different query\n  types and time windows\n\n![avg-latency-handling-one-query](doc/macro_benchmarking_1.PNG) --\u003e\n\n## Advanced Tips\n\nThese are still in experiments and tips for developers\n\n* inspect the topics\n\n```\ndocker-compose exec kafka kafka-console-consumer --bootstrap-server kafka:9092 --topic \u003creplace topic name here\u003e --from-beginning\n```\n\nHere are the exposed topics:\n\n|topic name|usage|io form|\n|---|---|---|\n|temporal-graph|graph ingress stream|ingress|\n|query|query stream|ingress|\n|query-results|results of queries|egress|\n|storage-time|time of storage used for benchmarking|egress|\n|filter-time|time of filtering edge activities at arbitrary time windows|egress|\n\n* delete corrupted topics\n\n```shell\ndocker exec hesse_kafka_1 kafka-topics --list --zookeeper zookeeper:2181\ndocker exec hesse_kafka_1 kafka-topics --delete --zookeeper zookeeper:2181 --topic example-temporal-graph\n```\n\n* streaming mode\n\nYou can not only feed query records from files, but also in streaming. Just feed the query into the topic `query`,\nhesse will process them and egress the results to topic `query-results`. The streaming query should be like\n`\u003cquery_id\u003e:\u003cjson_query_body\u003e`. Here is an example:\n\n```shell\ndocker-compose exec kafka kafka-console-producer --broker-list kafka:9092 --topic query --property parse.key=true --property key.separator=:\n\u003e5:{\"query_id\":\"5\", \"user_id\": \"1\", \"vertex_id\": \"151\", \"query_type\": \"connected-components\", \"start_t\": \"0\", \"end_t\": \"300000\"}\n\ndocker-compose exec kafka kafka-console-consumer --bootstrap-server kafka:9092 --topic query-results --partition 0 --from-beginning --property print.key=true --property key.separator=\" ** \"\n```\n\n* Hot Deploy\n\nAfter changing code in hesse (for example, add an application algorithm), you can do a hot-redeploy\n\n```shell\ndocker-compose up -d --build hesse\n```\n\n## Benchmark\n\nPlease refer to [https://github.com/Spycsh/hesse-benchmarks](https://github.com/Spycsh/hesse-benchmarks).\n\n## Already Done\n\n- [x] Architecture design and Docker environment\n- [x] Kafka Graph Ingress and Query Ingress Stream\n- [x] Connected Component, Strongly Connected Component, GNNSampling algorithms based on Graph Traversal\n- [x] A basic non-benchmarking storage paradigm using TreeMap with persistence\n- [x] Query support for three algorithms on arbitrary time windows\n- [x] Query cache\n- [x] Time calculation for query\n- [x] User-configurable Implementation of different storage paradigms\n- [x] Performance benchmarking for different storage paradigms\n- [x] Measurement of time for queries of three algorithms\n- [x] add Logger and set logger level to eliminate the effect of print statements on time measurement\n- [x] Measurement of time for ingestion of edges\n- [x] Break storage TreeMap buckets into different ValueSpecs and see the performances\n- [x] Query Concurrency investigation on different concurrent applications\n- [x] Support of Single-Source-Shortest-Path algorithm and PageRank\n- [x] Micro-benchmarking and Macro-benchmarking on different datasets\n- [x] Deployment on serverless platforms such as AWS lambda \n\n## TODO\n\n- [ ] Add more graph algorithms\n\n[comment]: \u003c\u003e (- [ ] Efficient storage and retrievals of properties on edges and vertices)\n\n[comment]: \u003c\u003e (- [ ] Modification/Deletion of edges \u0026#40;extra fields in ingress stream\u0026#41;)\n\n[comment]: \u003c\u003e (- [ ] Performance benchmarking compared with other temporal graph engines)\n\n[comment]: \u003c\u003e (- [ ] A CLI for developers to add UDF functions to the system)\n\n[comment]: \u003c\u003e (- [ ] LRU cache refatoring in Query handler function)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fspycsh%2Fhesse","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fspycsh%2Fhesse","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fspycsh%2Fhesse/lists"}