{"id":21244587,"url":"https://github.com/lac-dcc/lushu","last_synced_at":"2025-09-06T03:46:42.688Z","repository":{"id":149993492,"uuid":"615437873","full_name":"lac-dcc/lushu","owner":"lac-dcc","description":"System to recognize infinite languages and react to string events","archived":false,"fork":false,"pushed_at":"2023-12-10T20:02:40.000Z","size":1927,"stargazers_count":26,"open_issues_count":4,"forks_count":1,"subscribers_count":2,"default_branch":"main","last_synced_at":"2025-04-05T17:43:31.980Z","etag":null,"topics":["reactive-programming","string-event","unbound-data"],"latest_commit_sha":null,"homepage":"","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"gpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/lac-dcc.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-03-17T17:36:07.000Z","updated_at":"2024-12-31T15:20:23.000Z","dependencies_parsed_at":"2023-07-26T21:01:23.810Z","dependency_job_id":"af0a0850-3168-413f-8f20-fe8eeb428e1c","html_url":"https://github.com/lac-dcc/lushu","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/lac-dcc/lushu","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lac-dcc%2Flushu","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lac-dcc%2Flushu/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lac-dcc%2Flushu/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lac-dcc%2Flushu/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/lac-dcc","download_url":"https://codeload.github.com/lac-dcc/lushu/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lac-dcc%2Flushu/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":264666021,"owners_count":23646570,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["reactive-programming","string-event","unbound-data"],"created_at":"2024-11-21T01:28:58.742Z","updated_at":"2025-07-10T21:30:54.520Z","avatar_url":"https://github.com/lac-dcc.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003cdiv align=\"center\"\u003e\n  \u003cspan\u003e\u003cimg src=\"docs/images/lushu-logo.png\" alt=\"Lushu Logo\" width=\"100\" height=\"100\"\u003e\n\u003c/div\u003e\n\n# Lushu\n\n_Lushu_ (short for the Chinese 记录树, 录树), is a system that detects and\nreacts to user-defined string events in a never-ending stream of text, in real\ntime. The idea is to have pugglable reactions (JVM functions) be triggered\nwhenever the string event occurs. That reaction can be obfuscation of the\nstring, counting occurrences, sending an alert email, etc.\nTo know more about Lushu, read its companion [paper](https://homepages.dcc.ufmg.br/~fernando/publications/papers/Lushu23.pdf) or watch its [video tutorial](https://youtu.be/s17i2BhI_Eo).\n\n## Running\n\n### Video Tutorial\n\nCheck out this [3-minute Lushu Tutorial](https://youtu.be/s17i2BhI_Eo) on\nYouTube.\n\n### Simulate Lushu\n\nRun `gradle fatJar` to generate the file `./Lushu/build/libs/Lushu.jar`. Run it\nfollowing the example:\n\n```sh\ncat example/log/test/cpf-is-sensitive.log | \\\n  java -jar ./Lushu/build/libs/Lushu.jar \\\n    ./example/config.yaml ./example/log/train/cpf-is-sensitive.log\n```\n\nYou should see an output like the following:\n\n```\nTraining Lushu Grammar with file './example/log/train/cpf-is-sensitive.log'\n----------------------------------------\nTraining with log: The user \u003cs\u003e000.000.000-01\u003c/s\u003e logged in on 2023-04-29 21:42:04.\nTraining with log: A new user \u003cs\u003e586.431.715-65\u003c/s\u003e was created on 2023-04-30 12:48:53.\nTraining with log: The user \u003cs\u003e000.000.000-01\u003c/s\u003e sent a message to user \u003cs\u003e417.231.715-86\u003c/s\u003e on 2023-04-30 12:52:47.\nTraining with log: The product with ID RZbhCMwa was added to the cart by user \u003cs\u003e316.819.054-49\u003c/s\u003e on 2023-04-30 12:53:36.\n----------------------------------------\nFinished training grammar\n\nA new user ***** was created on 2023-04-30 13:16:51.\nA payment of $1957800,00 was processed on 2023-04-30 13:16:51.\nThe user ***** downloaded video.mp4 on 2023-04-30 13:16:51.\n...\n```\n\n### Generate example Lushu Grammar\n\nRun `gradle grammarJar` to generate the file\n`./Lushu/build/libs/Grammar.jar`. Run it following the example:\n\n```sh\ncat example/log/test/simple-ip.log | \\\n  java -jar ./Lushu/build/libs/Grammar.jar ./example/config.yaml\n```\n\nYou should see an ouput like the following:\n\n```\nR0 :: [023]{4,4}[-]{1,1}[04]{2,2}[-]{1,1}[29]{2,2} | R1\nR1 :: [0]{2,2}[:]{1,1}[0]{2,2}[:]{1,1}[0]{2,2}[,]{1,1}[123456789]{3,3} | R2\nR2 :: [RScdeimov]{4,8} | R3\nR3 :: [ehoqrstu]{5,7} | R4\nR4 :: [acfmoryz]{4,5} | R5\nR5 :: [0123456789]{1,3}[.]{1,1}[0123456789]{1,3}[.]{1,1}[0123456789]{1,3}[.]{1,1}[0123456789]{1,3} | [glo]{3,3} | R6\nR6 :: [ehr]{4,4} | R7\nR7 :: [abl]{3,3} | R8\nR8 :: [abl]{3,3}\n```\n\nNote that the first production of the grammar in rule `R5` has the format of an\nIP address. This is because the file `example/log/test/simple-ip.log` we gave as\nan input contains examples of IP addresses at that position.\n\n### Run the Merger\n\nRun `gradle mergerJar` to generate the file `./Lushu/build/libs/Merger.jar`. Run\nit following the example:\n\n```sh\necho '8.8.8.8 0.0.0.0' | java -jar ./Lushu/build/libs/Merger.jar ./example/config.yaml\n```\n\nYou should get the result:\n\n```\n[08]{1,1}[.]{1,1}[08]{1,1}[.]{1,1}[08]{1,1}[.]{1,1}[08]{1,1}\n```\n\nNotice that both IP addresses `8.8.8.8` and `0.0.0.0` were merged into a single\nregular expression. Try different combinations, and different number of words!\nHere are some more examples of words you can input:\n\n- Date: `2023/03/26 2023/02/26 2023/12/11 1999/09/09`\n- Timestamp: `00:00:00 12:34:56 12:34:57`\n- Key in KV database: `key1#secondary key2#secondary`\n\nAlso, try specifying different YAML configuration files. You may find it easier\nto edit the example file in `./example/config.yaml`.\n\n## Testing\n\nTo test, run `gradle test`. Find all source code for the tests under\n`./Lushu/src/test/`.\n\n## Theory\n\nLushu includes a novel way to merge regular expressions, based on a lattice we\ncall the Regex Lattice. The meet of two regexes in the Regex Lattice indicates\nthe result of their merge. A single word may be composed of multiple lattice\nnodes. It all depends on how we structure the lattice. For instance, if we say\nthat punctuations are \"blacklisted\" by \"alpha\" characters, then their meet will\ngo to the lattice top. This can be configured by the following `config.yaml`\nfile:\n\n```yaml\nlatticeBase:\n  alpha:\n    interval: 1,32\n    charset: \"abcdefghijklmnopqrstuvwxyz\"\n  punct:\n    interval: 1,2\n    charset: \"\\\"!#\\\\$%\u0026'()*+,-./:;\u003c\u003e=?@\\\\[\\\\]^_`{}|~\\\\\\\\\"\n    blacklist:\n      - alpha\n```\n\nArbitrary text is not in the format we require, originally. So the first thing\nwe do with text is divide it into words separated by space. We call these words\n_tokens_. Each token might be composed of multiple lattice nodes. For instance,\nsuppose we have two tokens, `ab:c` and `de:fg`. They are first transformed to\n_primitive_ lattice nodes:\n\n```\n[a]{1,1}[b]{1,1}[:]{1,1}[c]{1,1}\n[d]{1,1}[e]{1,1}[:]{1,1}[f]{1,1}[g]{1,1}\n```\n\nThese are called _primitive_ because the charset for each node is a single\ncharacter, and the interval is (1,1). Then, we _reduce_ these primitive nodes\ninto a more compact format. We collapse as much as possible, using the lattice\nmeet to check if the GLB is the Top node. If it is the top node, we do not merge\nthe nodes. For our example:\n\n```\nreduce([a]{1,1}[b]{1,1}[:]{1,1}[c]{1,1}) ==\u003e\n  [ab]{2,2}[:]{1,1}[c]{1,1}\n\nreduce([d]{1,1}[e]{1,1}[:]{1,1}[f]{1,1}[g]{1,1}) ==\u003e\n  [de]{2,2}[:]{1,1}[fg]{2,2}\n```\n\nFinally, two turn these two regular expressions into one, we perform a _zip_ and\nthen a _map_ operation (in the functional sense). The _zip_ operation checks\nthat the lists must have the same size and forms pairs like `([ab]{2,2},\n[de]{2,2})`. For each pair, we map their elements to their lattice meet. In\na pseudo-functional syntax:\n\n```\nmap(zip(nodes1, nodes2), (first, second) =\u003e {\n     lattice.meet(first, second)\n})\n```\n\nIf the lattice goes to top, the words are not mergeable. Otherwise, we merge\nthem. The result for our example would be:\n\n```\nmerge(ab:c, de:fg) =\n  map(zip(reduce(ab:c), reduce(de:fg)), (first, second) -\u003e {\n    lattice.meet(first, second).then { it -\u003e\n        when(it) {\n            is Top: not mergeable\n            else: it\n        }\n  })\n\n==\u003e merge(ab:c, de:fg) = [abde]{2,2}[:]{1,1}[cfg]{1,2}\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flac-dcc%2Flushu","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Flac-dcc%2Flushu","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flac-dcc%2Flushu/lists"}