{"id":22392944,"url":"https://github.com/nuald/lambda-sample","last_synced_at":"2026-04-16T05:31:42.964Z","repository":{"id":193414836,"uuid":"109089025","full_name":"nuald/lambda-sample","owner":"nuald","description":"A Sample of Lambda Architecture Project","archived":false,"fork":false,"pushed_at":"2021-03-09T22:02:29.000Z","size":1148,"stargazers_count":3,"open_issues_count":0,"forks_count":1,"subscribers_count":2,"default_branch":"master","last_synced_at":"2025-07-27T17:55:28.389Z","etag":null,"topics":["cassandra","lambda-architecture","mqtt","redis","scala"],"latest_commit_sha":null,"homepage":null,"language":"Scala","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"unlicense","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/nuald.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null}},"created_at":"2017-11-01T05:02:00.000Z","updated_at":"2022-11-19T00:06:23.000Z","dependencies_parsed_at":"2023-09-08T04:55:54.788Z","dependency_job_id":null,"html_url":"https://github.com/nuald/lambda-sample","commit_stats":null,"previous_names":["nuald/lambda-sample"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/nuald/lambda-sample","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/nuald%2Flambda-sample","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/nuald%2Flambda-sample/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/nuald%2Flambda-sample/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/nuald%2Flambda-sample/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/nuald","download_url":"https://codeload.github.com/nuald/lambda-sample/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/nuald%2Flambda-sample/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":31872616,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-04-15T15:24:51.572Z","status":"online","status_checked_at":"2026-04-16T02:00:06.042Z","response_time":69,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["cassandra","lambda-architecture","mqtt","redis","scala"],"created_at":"2024-12-05T04:22:15.077Z","updated_at":"2026-04-16T05:31:42.939Z","avatar_url":"https://github.com/nuald.png","language":"Scala","funding_links":[],"categories":[],"sub_categories":[],"readme":"# A Sample of Lambda architecture project\n\nThe boilerplate project for detecting IoT sensor anomalies using the Lambda architecture.\nIt assumes two data processing layers: the fast one (to ensure SLA requirements for\nlatency) and another one based on machine learning (to ensure low rate of false\npositives). One may imagine it as a data flow graph from one origin (events)\nwith two branches (those processing layers), hence the Lambda architecture (λ symbol).\n\nThe system layers:\n\n - Speed: based on the standard deviation\n - Batch: Random Forest classification (using [Smile](https://haifengl.github.io/smile/) engine)\n - Serving: [Akka](https://akka.io/) cluster with in-memory Redis database and persistence with Cassandra\n\nThe data flow:\n\n 1. MQTT messages are produced by the IoT emulator ([Producer](src/main/scala/mqtt/Producer.scala) actor). Messages are serialized into [Smile binary format](https://github.com/FasterXML/smile-format-specification).\n 2. MQTT subscriber saves the messages into Cassandra database ([Consumer](src/main/scala/mqtt/Consumer.scala) actor).\n 3. The Random Forest model is constantly trained by the messages ([Trainer](src/main/scala/analyzer/Trainer.scala) actor). The model is saved into Redis using Java serialization.\n 4. HTTP endpoint requests the computation using the trained model and heuristics ([Analyzer](src/main/scala/analyzer/Analyzer.scala) actor).\n\nData serialization is an important topic, because it affects the performance of\nthe whole system (please note the speed of the serialization/deserialization process,\nthe size of the generated content as it should be transferred over the network,\nsaved to/loaded from databases therefore affecting latency). [Protocol buffers](https://developers.google.com/protocol-buffers/)\nis a good candidate for the data serialization as it meets the requirements and\nsupported by many programming languages.\n\n# Table of Contents\n\n* [A Sample of Lambda architecture project](#a-sample-of-lambda-architecture-project)\n  * [Requirements](#requirements)\n    * [Cluster client requirements](#cluster-client-requirements)\n  * [Usage](#usage)\n    * [IoT emulation](#iot-emulation)\n    * [Interactive processing](#interactive-processing)\n      * [Preparing the data set](#preparing-the-data-set)\n      * [Fast analysis](#fast-analysis)\n      * [Fitting the model](#fitting-the-model)\n      * [Using the model](#using-the-model)\n    * [Processing Cluster](#processing-cluster)\n      * [Performance testing](#performance-testing)\n\n## Requirements\n\nPlease install:\n\n - [Eclipse Mosquitto](https://mosquitto.org/) MQTT broker\n - [Apache Cassandra](http://cassandra.apache.org/) NoSQL database\n - [Python 2](https://www.python.org/), required by Cassandra\n - [Redis](https://redis.io/) in-memory data store\n - [SBT](http://www.scala-sbt.org/) build tool\n\n**Linux** Please use the package manager shipped with the distribution.\nFor example, you may use pacman for ArchLinux (please note that Cassandra requires Java 8):\n\n    $ pacman -S --needed mosquitto redis sbt python2\n    $ git clone https://aur.archlinux.org/cassandra.git\n    $ cd cassandra\n    $ makepkg -si\n    $ archlinux-java set java-8-openjdk/jre\n\nOptionally you may install:\n\n - [Graphviz](http://www.graphviz.org/) visualization software (its dot utility is used for\n the Decision Tree visualization in the sample REPL session)\n - [Hey](https://github.com/rakyll/hey) HTTP load generator (used for the performance tests)\n - [Scala](https://www.scala-lang.org/download/) shell (used for [the helper script](start.sc) for clustering)\n\n**Linux** Please use the package manager shipped with the distribution.\nFor example, you may use pacman for ArchLinux:\n\n    $ pacman -S --needed graphviz hey scala\n\n### Cluster client requirements\n\nThe cluster clients only use Scala shell and SBT (and Git to clone the source codes).\nSee the platform specific notes below.\n\n**MacOS** You may use [Homebrew](https://brew.sh/):\n\n    $ brew install git scala sbt\n\n**Windows** You may use [Scoop](http://scoop.sh/):\n\n    $ scoop install git scala sbt\n\n**Linux** Please use the package manager shipped with the distribution.\nIf the repositories do not contain SBT, then follow the\n[Installing sbt on Linux](http://www.scala-sbt.org/1.x/docs/Installing-sbt-on-Linux.html)\ninstructions. For example, you may use pacman for ArchLinux:\n\n    $ pacman -S --needed git scala sbt\n\nTo verify the installations, please clone the project and use SBT and Scala script in a dry-run mode:\n\n    $ git clone https://github.com/nuald/lambda-sample.git\n    $ cd lambda-sample\n    $ sbt compile\n    $ scala start.sc client --dry-run\n\nIt should create the target jar (`target/scala-2.13` directory) and configurations (`target/env` directory).\n\n## Usage\n\nRun the spec-based unit tests to ensure that the code works correctly:\n\n    $ sbt test\n\nRun the servers:\n\n    $ mosquitto\n    $ cassandra -f\n    $ redis-server\n\nConfigure a Cassandra data store:\n\n    $ cqlsh -f resources/cassandra/schema.sql\n\n*NOTE: For dropping the keyspace please use: `$ cqlsh -e \"drop keyspace sandbox;\"`.*\n\nRun the system (for the convenience, all microservices are packaged into the one system):\n\n    $ sbt run\n\nPlease note that Random Forest classification requires at least two classes in the input\ndata (that means that the analyzed messages should contain anomalies). Please use the\nMQTT producer described below to generate anomalies,\notherwise the [Trainer](src/main/scala/analyzer/Trainer.scala) shows the errors.\n\n### IoT emulation\n\nModify the sensor values with the Producer (it's simultaneously the MQTT publisher\nand HTTP endpoint to control the events flow): http://localhost:8081\n\n![](resources/img/producer.png?raw=true)\n\nVerify the messages by subscribing to the required MQTT topic:\n\n    $ mosquitto_sub -t sensors/power\n\n### Interactive processing\n\nVerify the data stores with the Dashboard: http://localhost:8080\n\n![](resources/img/timeline.png?raw=true)\n\nVerify the entries data store using CQL:\n\n    $ cqlsh -e \"select * from sandbox.entry limit 10;\"\n\nAn example REPL session (with `sbt console`) consists of 4 parts:\n\n1. Preparing the data set\n2. Fast analysis\n3. Fitting the model (full analysis)\n4. Using the model for the prediction\n\n#### Preparing the data set\n\nDump the entries into the CSV file:\n\n    $ cqlsh -e \"copy sandbox.entry(sensor,ts,value,anomaly) to 'list.csv';\"\n\nRead the CSV file and extract the features and the labels for the particular sensor:\n\n```scala\nimport smile.data._\nimport smile.data.formula._\nimport smile.data.`type`._\n\nimport scala.jdk.CollectionConverters._\n\n// Declare the class to get better visibility on the data\ncase class Row(sensor: String, ts: String, value: Double, anomaly: Int)\n\n// Read the values from the CSV file\nval iter = scala.io.Source.fromFile(\"list.csv\").getLines()\n\n// Get the data\nval l = iter.map(_.split(\",\") match {\n  case Array(a, b, c, d) =\u003e Row(a, b, c.toDouble, d.toInt)\n}).toList\n\n// Get the sensor name for further analysis\nval name = l.head.sensor\n\n// Prepare data frame for the given sensor\nval data = DataFrame.of(\n  l.filter(_.sensor == name)\n    .map(row =\u003e Tuple.of(\n      Array(\n        row.value.asInstanceOf[AnyRef],\n        row.anomaly.asInstanceOf[AnyRef]),\n      DataTypes.struct(\n        new StructField(\"value\", DataTypes.DoubleType),\n        new StructField(\"anomaly\", DataTypes.IntegerType))))\n    .asJava)\n\n// Declare formula for the features and the labels\nval formula = \"anomaly\" ~ \"value\"\n\n```\n\n#### Fast analysis\n\nFast analysis (labels are ignored because we don't use any training here):\n\n```scala\n// Get the first 200 values\nval values = data.stream()\n  .limit(200).map(_.getDouble(0)).iterator().asScala.toArray\n\n// Use the fast analyzer for the sample values\nval samples = Seq(10, 200, -100)\nsamples.map(sample =\u003e analyzer.Analyzer.withHeuristic(sample, values))\n\n```\n\n#### Fitting the model\n\nFit and save the Random Forest model:\n\n```scala\nimport scala.language.postfixOps\n\nimport lib.Common.using\nimport java.io._\nimport scala.sys.process._\nimport smile.classification.randomForest\n\n// Fit the model\nval rf = randomForest(formula, data)\n\n// Get the dot diagram for a sample tree\nval desc = rf.trees()(0).dot\n\n// View the diagram (macOS example)\ns\"echo $desc\" #| \"dot -Tpng\" #| \"open -a Preview -f\" !\n\n// Set up the implicit for the using() function\nimplicit val logger = akka.event.NoLogging\n\n// Serialize the model\nusing(new ObjectOutputStream(new FileOutputStream(\"target/rf.bin\")))(_.close) { out =\u003e\n  out.writeObject(rf)\n}\n\n```\n\n#### Using the model\n\nLoad and use the model:\n\n```scala\n// Set up the implicit for the using() function\nimplicit val logger = akka.event.NoLogging\n\nimport lib.Common.using\nimport java.io._\nimport smile.classification.RandomForest\n\n// Deserialize the model\nval futureRf = using(new ObjectInputStream(new FileInputStream(\"target/rf.bin\")))(_.close) { in =\u003e\n  in.readObject().asInstanceOf[RandomForest]\n}\nval rf = futureRf.get\n\n// Use the loaded model for the sample values\nval samples = Seq(10.0, 200.0, -100.0)\nsamples.map { sample =\u003e\n  val posteriori = new Array[Double](2)\n  val prediction = rf.predict(\n    Tuple.of(\n      Array(sample),\n      DataTypes.struct(\n        new StructField(\"value\", DataTypes.DoubleType))),\n    posteriori)\n  (prediction, posteriori)\n}\n\n```\n\n### Processing Cluster\n\nVerify the endpoint for anomaly detection:\n\n    $ curl http://localhost:8082/\n\nCheck the latest analyzer snapshots:\n\n    $ redis-cli hgetall fast-analysis\n    $ redis-cli hgetall full-analysis\n\n*NOTE: For deleting the shapshots please use: `$ redis-cli del fast-analysis full-analysis`.*\n\nVerify the history of detecting anomalies using CQL:\n\n    $ cqlsh -e \"select * from sandbox.analysis limit 10;\"\n\nIn most cases it's better to use specialized solutions for clustering,\nfor example, [Kubernetes](https://doc.akka.io/docs/akka-management/current/kubernetes-deployment/forming-a-cluster.html).\nHowever, in the sample project the server and clients are configured manually\nfor the demonstration purposes.\n\nRun the servers (please use the external IP):\n\n    $ scala start.sc server --server-host=\u003chost\u003e\n\nRun the client (please use the external IP):\n\n    $ scala start.sc client --server-host=\u003cserver-host\u003e --client-host=\u003cclient-host\u003e --client-port=\u003cport\u003e\n\n#### Performance testing\n\nBy default, the server runs its own analyzer, however, it may affect\nthe metrics due to local analyzer works much faster than the remote ones.\nTo normalize the metrics you may use the `--no-local-analyzer` option:\n\n    $ scala start.sc server --server-host=\u003chost\u003e --no-local-analyzer\n\nThe endpoint supports two modes for analyzing - the usual one and the stress-mode compatible.\nThe latter one is required to eliminate the side-effects of connectivity issues to Cassandra\nand Redis (analyzers use the cached results instead of fetching the new data and recalculating):\n\n    $ curl http://localhost:8082/stress\n\nTo manually get the performance metrics please use the hey tool:\n\n    $ hey -n 500 -c 10 -t 10 http://127.0.0.1:8082/\n    $ hey -n 500 -c 10 -t 10 http://127.0.0.1:8082/stress\n\nOtherwise the stats is available via the Dashboard: http://localhost:8080\n\n![](resources/img/performance.png?raw=true)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fnuald%2Flambda-sample","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fnuald%2Flambda-sample","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fnuald%2Flambda-sample/lists"}