https://github.com/lac-dcc/lushu

System to recognize infinite languages and react to string events
https://github.com/lac-dcc/lushu

reactive-programming string-event unbound-data

Last synced: 4 months ago
JSON representation

System to recognize infinite languages and react to string events

Host: GitHub
URL: https://github.com/lac-dcc/lushu
Owner: lac-dcc
License: gpl-3.0
Created: 2023-03-17T17:36:07.000Z (almost 3 years ago)
Default Branch: main
Last Pushed: 2023-12-10T20:02:40.000Z (about 2 years ago)
Last Synced: 2025-04-05T17:43:31.980Z (9 months ago)
Topics: reactive-programming, string-event, unbound-data
Language: Jupyter Notebook
Homepage:
Size: 1.84 MB
Stars: 26
Watchers: 2
Forks: 1
Open Issues: 4
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

          


  





# Lushu

_Lushu_ (short for the Chinese 记录树, 录树), is a system that detects and

reacts to user-defined string events in a never-ending stream of text, in real

time. The idea is to have pugglable reactions (JVM functions) be triggered

whenever the string event occurs. That reaction can be obfuscation of the

string, counting occurrences, sending an alert email, etc.

To know more about Lushu, read its companion [paper](https://homepages.dcc.ufmg.br/~fernando/publications/papers/Lushu23.pdf) or watch its [video tutorial](https://youtu.be/s17i2BhI_Eo).

## Running

### Video Tutorial

Check out this [3-minute Lushu Tutorial](https://youtu.be/s17i2BhI_Eo) on

YouTube.

### Simulate Lushu

Run `gradle fatJar` to generate the file `./Lushu/build/libs/Lushu.jar`. Run it

following the example:

```sh

cat example/log/test/cpf-is-sensitive.log | \

  java -jar ./Lushu/build/libs/Lushu.jar \

    ./example/config.yaml ./example/log/train/cpf-is-sensitive.log

```

You should see an output like the following:

```

Training Lushu Grammar with file './example/log/train/cpf-is-sensitive.log'

----------------------------------------

Training with log: The user 000.000.000-01 logged in on 2023-04-29 21:42:04.

Training with log: A new user 586.431.715-65 was created on 2023-04-30 12:48:53.

Training with log: The user 000.000.000-01 sent a message to user 417.231.715-86 on 2023-04-30 12:52:47.

Training with log: The product with ID RZbhCMwa was added to the cart by user 316.819.054-49 on 2023-04-30 12:53:36.

----------------------------------------

Finished training grammar

A new user ***** was created on 2023-04-30 13:16:51.

A payment of $1957800,00 was processed on 2023-04-30 13:16:51.

The user ***** downloaded video.mp4 on 2023-04-30 13:16:51.

...

```

### Generate example Lushu Grammar

Run `gradle grammarJar` to generate the file

`./Lushu/build/libs/Grammar.jar`. Run it following the example:

```sh

cat example/log/test/simple-ip.log | \

  java -jar ./Lushu/build/libs/Grammar.jar ./example/config.yaml

```

You should see an ouput like the following:

```

R0 :: [023]{4,4}[-]{1,1}[04]{2,2}[-]{1,1}[29]{2,2} | R1

R1 :: [0]{2,2}[:]{1,1}[0]{2,2}[:]{1,1}[0]{2,2}[,]{1,1}[123456789]{3,3} | R2

R2 :: [RScdeimov]{4,8} | R3

R3 :: [ehoqrstu]{5,7} | R4

R4 :: [acfmoryz]{4,5} | R5

R5 :: [0123456789]{1,3}[.]{1,1}[0123456789]{1,3}[.]{1,1}[0123456789]{1,3}[.]{1,1}[0123456789]{1,3} | [glo]{3,3} | R6

R6 :: [ehr]{4,4} | R7

R7 :: [abl]{3,3} | R8

R8 :: [abl]{3,3}

```

Note that the first production of the grammar in rule `R5` has the format of an

IP address. This is because the file `example/log/test/simple-ip.log` we gave as

an input contains examples of IP addresses at that position.

### Run the Merger

Run `gradle mergerJar` to generate the file `./Lushu/build/libs/Merger.jar`. Run

it following the example:

```sh

echo '8.8.8.8 0.0.0.0' | java -jar ./Lushu/build/libs/Merger.jar ./example/config.yaml

```

You should get the result:

```

[08]{1,1}[.]{1,1}[08]{1,1}[.]{1,1}[08]{1,1}[.]{1,1}[08]{1,1}

```

Notice that both IP addresses `8.8.8.8` and `0.0.0.0` were merged into a single

regular expression. Try different combinations, and different number of words!

Here are some more examples of words you can input:

- Date: `2023/03/26 2023/02/26 2023/12/11 1999/09/09`

- Timestamp: `00:00:00 12:34:56 12:34:57`

- Key in KV database: `key1#secondary key2#secondary`

Also, try specifying different YAML configuration files. You may find it easier

to edit the example file in `./example/config.yaml`.

## Testing

To test, run `gradle test`. Find all source code for the tests under

`./Lushu/src/test/`.

## Theory

Lushu includes a novel way to merge regular expressions, based on a lattice we

call the Regex Lattice. The meet of two regexes in the Regex Lattice indicates

the result of their merge. A single word may be composed of multiple lattice

nodes. It all depends on how we structure the lattice. For instance, if we say

that punctuations are "blacklisted" by "alpha" characters, then their meet will

go to the lattice top. This can be configured by the following `config.yaml`

file:

```yaml

latticeBase:

  alpha:

    interval: 1,32

    charset: "abcdefghijklmnopqrstuvwxyz"

  punct:

    interval: 1,2

    charset: "\"!#\\$%&'()*+,-./:;<>=?@\\[\\]^_`{}|~\\\\"

    blacklist:

      - alpha

```

Arbitrary text is not in the format we require, originally. So the first thing

we do with text is divide it into words separated by space. We call these words

_tokens_. Each token might be composed of multiple lattice nodes. For instance,

suppose we have two tokens, `ab:c` and `de:fg`. They are first transformed to

_primitive_ lattice nodes:

```

[a]{1,1}[b]{1,1}[:]{1,1}[c]{1,1}

[d]{1,1}[e]{1,1}[:]{1,1}[f]{1,1}[g]{1,1}

```

These are called _primitive_ because the charset for each node is a single

character, and the interval is (1,1). Then, we _reduce_ these primitive nodes

into a more compact format. We collapse as much as possible, using the lattice

meet to check if the GLB is the Top node. If it is the top node, we do not merge

the nodes. For our example:

```

reduce([a]{1,1}[b]{1,1}[:]{1,1}[c]{1,1}) ==>

  [ab]{2,2}[:]{1,1}[c]{1,1}

reduce([d]{1,1}[e]{1,1}[:]{1,1}[f]{1,1}[g]{1,1}) ==>

  [de]{2,2}[:]{1,1}[fg]{2,2}

```

Finally, two turn these two regular expressions into one, we perform a _zip_ and

then a _map_ operation (in the functional sense). The _zip_ operation checks

that the lists must have the same size and forms pairs like `([ab]{2,2},

[de]{2,2})`. For each pair, we map their elements to their lattice meet. In

a pseudo-functional syntax:

```

map(zip(nodes1, nodes2), (first, second) => {

     lattice.meet(first, second)

})

```

If the lattice goes to top, the words are not mergeable. Otherwise, we merge

them. The result for our example would be:

```

merge(ab:c, de:fg) =

  map(zip(reduce(ab:c), reduce(de:fg)), (first, second) -> {

    lattice.meet(first, second).then { it ->

        when(it) {

            is Top: not mergeable

            else: it

        }

  })

==> merge(ab:c, de:fg) = [abde]{2,2}[:]{1,1}[cfg]{1,2}

```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/lac-dcc/lushu

Awesome Lists containing this project

README