https://github.com/dataunitylab/semantic-regex

Last synced: 2 months ago
JSON representation

Host: GitHub
URL: https://github.com/dataunitylab/semantic-regex
Owner: dataunitylab
License: mit
Created: 2022-05-12T21:07:39.000Z (almost 4 years ago)
Default Branch: main
Last Pushed: 2026-01-12T21:42:20.000Z (3 months ago)
Last Synced: 2026-01-13T01:55:41.995Z (3 months ago)
Language: Python
Size: 548 KB
Stars: 2
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE.md
- Citation: CITATION.cff

Awesome Lists containing this project

README

# Learning from Uncurated Regular Expressions

[![CI](https://github.com/dataunitylab/semantic-regex/actions/workflows/ci.yml/badge.svg)](https://github.com/dataunitylab/semantic-regex/actions/workflows/ci.yml)
[![pre-commit.ci status](https://results.pre-commit.ci/badge/github/dataunitylab/semantic-regex/main.svg)](https://results.pre-commit.ci/latest/github/dataunitylab/semantic-regex/main)

Dependencies of all Python code are managed with [Pipenv](https://pipenv.pypa.io/en/latest/) and can be installed with `pipenv install`.
Note that the dataset from the [Sherlock](https://github.com/mitmedialab/sherlock-project) project should be available in a copy of the repository in alongside the directory for this project.
[`jq`](https://jqlang.github.io/jq/) is also required for some JSON processing.

## Model training

1. Download all regular expressions from regex101

`./download_patterns.sh`

This will create a directory `regex101` which has the individual regular expressions and `patterns.json` which contains only the expressions strings.

2. Compile a database of all the downloaded regular expressions

`pipenv run python compile_db.py < patterns.json > patterns_final.json`

`patterns_final.json` is a subset of the expressions in `patterns.json` which are supported by Hyperscan.
This step will also create `hs.db` which are the compiled regular expressions that can be used during preprocessing.

3. Preprocess the data to generate feature vectors

`pipenv run python preprocess.py train`

This will generate `preprocessed_train.txt` which contains all the feature vectors extracted using the regular expression extracted using the regular expressions.

4. Train the model on the extracted features

`pipenv run python train.py`

The model architecture will be stored in `nn_model_sherlock.json` with the weights in `nn_model_sherlock.weights.keras`.

## Evaluation

First, the test data must be preprocessed.

`pipenv run python preprocess.py test`

Then, the model can be evaluated.

`pipenv run python test.py`

## Model explanation

Explains for predictions for an individual class can be generated using [SHAP](https://shap.readthedocs.io/en/latest/).
First, follow the steps for training the model above.
The file `patterns_final.json` will be used to match the patterns back to the original regular expressions.

`pipenv run python find_patterns.py > pattern_ids.txt`

This file of pattern IDs will then be used to label the SHAP plot with the ID of the regular expression.
To generate the SHAP plot in `shap.png`, run the command below where `` is one of the semantic types defined by Sherlock.

`pipenv run python explain.py `

The IDs displayed in the SHAP plot can be used to reference the regular expressions by ID in the `regex101/patterns` directory or viewing it directly on regex101 at the URL `https://regex101.com/library/`.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/dataunitylab/semantic-regex

Awesome Lists containing this project

README