https://github.com/armijnhemel/proximity_matcher_webservice

Webservice for proximity matching based on TLSH and vantage point trees
https://github.com/armijnhemel/proximity_matcher_webservice

tlsh

Last synced: 6 months ago
JSON representation

Webservice for proximity matching based on TLSH and vantage point trees

Host: GitHub
URL: https://github.com/armijnhemel/proximity_matcher_webservice
Owner: armijnhemel
License: apache-2.0
Created: 2022-02-20T12:13:04.000Z (over 3 years ago)
Default Branch: main
Last Pushed: 2023-12-23T16:36:26.000Z (over 1 year ago)
Last Synced: 2023-12-23T17:45:02.743Z (over 1 year ago)
Topics: tlsh
Language: Python
Homepage:
Size: 91.8 KB
Stars: 6
Watchers: 4
Forks: 4
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

# Webservice for proximity matching using TLSH and Vantage Point Trees

This repository contains a proof of concept webservice for proximity matching
to quickly find the closest match for a hash in a collection of known hash.

One use case could be to find the closest match for an open source license
text in a collection of known license texts (as license texts tend to very
slightly), or to find a file in collection of known files, for example files
that have already been vetted or cleared by an open source software compliance
team or security audit team.

This tool should not be used in production as is but should serve as an
example of how to do these things. If you want to use it in production, keep
in mind that this service has no built-in security or user management, so it
should not be used in an untrusted environment,

## TLSH and exact matches

TLSH stands for "Trendmicro Locality Sensitive Hash" and, as the name implies,
it is a locality sensitive hash (LSH). Whereas regular cryptographic hashes
such as MD5 or SHA256 were designed in such a way that changing a single
byte would result in vastly different hashes (barring some edge cases) TLSH
was designed in such a way that similar files generate similar hashes, making
it possible to compute a distance between the two hashes (and thus files).

With regular cryptographic hashes it is only possible to answer the question
"are these two files identical?" but with TLSH it is possible to answer the
question "how close are these two files?" as well.

When combining these two approaches a few questions become much clearer to
answer, such as "how close is my source code tree to upstream?", or "is there
a maintenance risk?", and so on.

As TLSH is signficantly slower than MD5, SHA1 or SHA256 usually it makes
sense to first filter using normal hashes and when there is no match use
TLSH.

## Vantage Point Trees, HAC-T and other related work

It is easy when having a TLSH to compare it to a list of TLSH hashes
sequentially, but as more hashes are added it becomes more and more expensive
to do. Researchers at Trendmicro bumped into the same problem and wanted to
fix this. As malware is continuously morphing to evade detection using regular
hashes TLSH is used to find close matches to known malware. To do efficient
lookups and correctly label malware they use an algorithm named "HAC-T"[2],
which is based on Vantage Point Trees and clustering. Vantage Point Trees
with TLSH allow much quicker lookups than using a regular list[3].

Because for me there was no need for clustering (which requires labeling
files) only a modified version of the Vantage Point Tree part of HAC-T is
used here. The modifications are:

* removing dead code
* inlining code for performance
* pickling support
* cleanups to make things more Python and less C++

The original code can be found in the TLSH Github repository in the
directory `tlshCluster`.

# Running the proximity matching server

## Requirements

To run the proximity matching server you need a recent Python 3, plus a few
external packages:

* `click`
* `flask`
* `gevent`
* `gunicorn`
* `requests`
* `tlsh`

Be warned: some of these might be called different on different distributions.
Also, some distributions ship old versions of `tlsh`. In case you are using
the Nix package manager there is also a `shell.nix` file that you can use to
quickly set up your environment.

## Preparing the data set

To prepare the data you need to first have a list of TLSH hashes. This can
be done for example by looking at a directory of files, computing the TLSH
checksums for each file and adding the TLSH checksum to the file. See the
directory `examples` for inspiration.

After you have a file with TLSH hashes run the `create_vpt_pickle.py` tool
to create a vantage point tree and save it to a Python pickle. As creating
the vantage point tree can take some time this is not something that should
be done at startup of the webservice (unless it is the only way, see below).

The `create_vpt_pickle.py` tool can be invoked as follows:

```
$ python3 create_vpt_pickle.py -i /tmp/tlsh-hashes.txt -o /tmp/licenses-tlsh.pickle
```

By default the resulting `.pickle` is optimized, resulting in a smaller pickle
file, faster loading and less memory usage. The optimization process itself
can consume a very significant amount of memory. If optimization is causing
problems, then it can be turned off using the `--no-optimize` flag, as follows:

```
$ python3 create_vpt_pickle.py -i /tmp/tlsh-hashes.txt -o /tmp/licenses-tlsh.pickle --no-optimize
```

The data structure uses `__slots__` to squeeze more performance out of the
code. This means that the data structure is static and cannot be changed while
the program is running. If a dataset needs to be changed or updated, it should
be regenerated and the program needs to be restarted.

## Starting the server

### Configuring the server

The server needs to load data to be useful. This data should be in pickle form
and can be generated by `create_vpt_pickle.py` as described above.

The location of the pickle can be set through a configuration file. The
location of the configuration file in turn can be set through an environment
variable.

An example configuration file can be found in the file `proximity_config.py`.
Currently it looks like this:

```
TLSH_PICKLE_FILE = '/tmp/licenses-tlsh.pickle'
KNOWN_TLSH_HASHES = '/tmp/hashes.txt'
```

`TLSH_PICKLE_FILE` points to the pickle file with TLSH hashes as created by
`create_vpt_pickle.py`.

`KNOWN_TLSH_HASHES` is only used by the optimized version of the server and is
a file with one TLSH hash per line. These are the known hashes which are cached
to avoid tree traversals (see below).

The location of the configuration is set by the environment variable
`PROXIMITY_CONFIGURATION`, for example:

```console
export PROXIMITY_CONFIGURATION=/tmp/proximity_config.py
```

### Flask

```
$ export PROXIMITY_CONFIGURATION=/path/to/config
$ export FLASK_APP=proximity_server.py
$ flask run
```

This will launch the webservice, which can then be accessed on the advertised
URL.

There is also an optimized version of the server called
`proximity_server_opt.py` that keeps an additional list of TLSH hashes that
are known in the data set in memory, to be able to very quickly send responses
for already known hashes. Currently the list of the hashes (one TLSH hash per
line) is hardcoded to `/tmp/tlsh-hashes.txt`. This will be changed in the
future. Starting the webservice should be changed accordingly:

```
$ export PROXIMITY_CONFIGURATION=/path/to/config
$ export FLASK_APP=proximity_server_opt.py
$ flask run
```

The optimized version can use significantly more memory. If it is already
possible to filter known hashes before sending the hashes to the webservice
then it is recommended to use the regular version instead.

### Gunicorn

When using the Gunicorn WSGI server it is easier to use multiple workers.
Please note: right now a complete copy of all the data is kept in memory per
worker. The `--preload` option is to reduce memory usage. Sharing data to
reduce memory further is a TODO.

```
$ export PROXIMITY_CONFIGURATION=/path/to/config
$ gunicorn -w 4 -b 127.0.0.1:5000 --preload proximity_server:app
```

or the optimized version (see documentation above):

```
$ export PROXIMITY_CONFIGURATION=/path/to/config
$ gunicorn -w 4 -b 127.0.0.1:5000 -t 60 --preload proximity_server_opt:app
```

Please note that the optimized server uses the `-t` setting, which indicates
a time out.

If you want to use `keep-alive` connections, you need to add the
`--worker-class gevent` parameter, for example:

```
$ export PROXIMITY_CONFIGURATION=/path/to/config
$ gunicorn -w 8 -b 127.0.0.1:5000 --worker-class gevent --preload proximity_server_opt:app
```

Whether or not this will make performance a lot better depends on your data.
As the program is CPU-bound it might not matter that much according to
blogposts[4].

# Interpreting results

The results of the webservice are JSON that is fairly minimal and looks like
this:

```
{'distance': 24, 'match': True, 'tlsh': 'T1D931A61B13441BE75BF2178236AABDC4B08DC02EBB276E01186DF388537B12ED4B7190'}
{'distance': 0, 'match': True, 'tlsh': 'T16082932E774543B206C203916A4F6CCFA32BD478723E11656469C1AD236BE35C3BFAD9'}
{'distance': 0, 'match': True, 'tlsh': 'T13BB2553EA74513B206D202505A0F94DFE32BD07C32678A646499C15D23ABD3593FFBEA'}
{'distance': 0, 'match': True, 'tlsh': 'T1D731658B13481BA756E216D3B266BDC4F15AE02E3B135E02186DE394576B83EC0F7495'}
{'distance': 17, 'match': True, 'tlsh': 'T15231B54702841FA30AE2174231AAAEC0708DC02D3F236E041C7AF244537B42FD9B7081'}
```

Each line is a valid JSON response (as returned by the webserver). The results:

1. `distance`: the TLSH distance. `0` means that there is no difference with a
file known to the webservice.
2. `match`: whether or not there was a match. Right now if a valid TLSH hash
was used as a parameter the result is always `True`. Only if the parameter
is not a valid TLSH hash the result will be `False`. In case thresholds are
implemented (which is the plan) this might be changed.
3. `tlsh`: the TLSH hash of the match found by the webservice.

As the webservice does not include any of the "business logic" but is only used
for finding the closest match this means that the data has to be further
processed for example by cross-correlating with other data. Alternatively the
server could be extended to also return this information, for example by
loading it from an external database or some other resource.

# Caveats

There are some caveats when creating the database.

## Not all files are suitable

There are some files that are not very suitable for this approach, because of
the file format, where the structure of the file either enforces that there is
a lot of similarity, or where compression makes sure that there is very little
or a combination of both.

As an example, PNG files always start with a very similar header (often only
differing a few bytes) which will cause TLSH to say those parts are very
similar, but the payload of the data (the `iDAT` chunks) are zlib-compressed
and compression might lead to vastly different outcomes. Metadata, such as
XMP data (if not compressed) could also lead to strange results.

Examples of other files that are not very suitable are files with a lot of
markup or that are generated, sich as SVG files or Visual Studio build files.

## Unbalanced trees

Although the VPT trees try to be balanced they are not necessarily. Because the
VPT is a recursive data structure stored in a pickle it could mean that you get
errors like the following:

```
Traceback (most recent call last):
File "/home/armijn/git/proximity_matcher_webservice/src/create_vpt_pickle.py", line 34, in
vpt.pickle_tree(root, pickle_file)
File "/home/armijn/git/proximity_matcher_webservice/src/vpt/__init__.py", line 111, in pickle_tree
pickle.dump(root, pickle_file)
RecursionError: maximum recursion depth exceeded while calling a Python object
```

This is observed when for example first sorting the TLSH hashes. Some proposed
solutions:

1. do not sort the TLSH hashes
2. use less data
3. do not use a pickle, but recreate the vantage point tree whenever the
webservice is started. This might be expensive (several minutes) if a lot of
data is involved.
4. do not include useless data where there will not be a useful match such as
graphics files
5. partition the data and use multiple instances of the webservice. A good
partitioning could for example be by programming language, or extension.
6. raise the recursion limit (currently already raised to `1000` above the
system default)

# Comparison to snippet matching

Although the proximity matching method seems like what snippet scanners do
there are some very clear differences:

1. the method used here works on whole files only, whereas snippet scans work
by searching for snippets (hence the name) of known files
2. the result is a (shortest) distance to a known file in a set of files,
while the result of snippet scanners is a match with for example a
percentage ("files A and B match for 97%") and line numbers or byte ranges
where the files match.
3. proximity matching code will not be easily/reliably able to detect reuse of
small pieces of code.

Snippet scanning and proximity matching each have their own uses and can be
used to solve different problems. They complement each other but should *not*
be used in place of each other.

# Defensive publications

1. - Automated clearing of
software source code files using proximity matching and parsing file contents

# Links

1. - official TLSH website
2. - papers describing HAC-T and vantage point trees
3. - explanation of TLSH and vantage point trees (first 13 minutes)
4.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/armijnhemel/proximity_matcher_webservice

Awesome Lists containing this project

README