https://github.com/sri-csl/safedocs-yarn-public

Last synced: about 1 month ago
JSON representation

Host: GitHub
URL: https://github.com/sri-csl/safedocs-yarn-public
Owner: SRI-CSL
License: other
Created: 2022-03-29T12:57:49.000Z (about 3 years ago)
Default Branch: slim
Last Pushed: 2022-05-04T14:46:58.000Z (about 3 years ago)
Last Synced: 2025-04-13T12:27:33.634Z (about 1 month ago)
Language: Python
Size: 803 KB
Stars: 1
Watchers: 15
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

This directory contains a dynamic analysis framework built using DynamoRio.
All together, these tools and techniques are called YARN.

For optimal use, YARN requires specific parser binary builds to be
located in `parsers/xpdf-4.01.01_build/xpdf-4.01.01.zip`,
`parsers/mupdf-1.16.1_build/mupdf-1.16.1.zip`, and
`parsers/poppler0840_build/poppler-0840.zip`. If you have a copy of
the `slim` branch, these files will be missing.

Please note that this README and the one at tracetools/README.md may
be out-of-date.

This is messy and crufty research code. Use at your own risk.

# Requirements

- Docker, installed and running.
- Git lfs extension installed
- Binary ninja licence (see "Installing" section for information on
where these should be copied). In order to avoid api version
mismatch issues, binary ninja binaries are included in this
repository. This is not 100% required unless you plan on
developing/using anything that relies on
signatures/moment-of-recognition -- either
`tracetools/tools/pt_tracker.py` or
`tracetools/tools/poppler_jpeg.py`

The DynamoRio-based YARN instrumentation tool does not work with
macOS, even when run indirectly via docker.

# git clone-ing the source code

Make sure you have the git lfs extension installed before cloning
anything. If it isn't installed then none of the *.zip files in
./parsers will be valid zip files.

# Installing

If you have one, copy your binary ninja license to
`third-party/binaryninja/license.dat` (both headless and regular
binary ninja packages work).

After you have properly installed docker on your system, build the docker
```bash
./build.sh
```

This will build a docker for the DynamoRIO-based tools by default. The
image is named `mr_memtrace-analysis-dev`.

If you find yourself needing to debug the `build.sh` file, run it with
the `--no-cache` option to force docker to rebuild the image from
scratch.

# Running the YARN docker container

Starting the YARN docker container:
```bash
docker run -it --rm mr_memtrace-analysis-dev:latest /bin/bash
```

If you didn't have a binary ninja license when you built your docker
image but later obtained one, you can mount the path your local
`license.dat` file to the containers's
`/home/user/.binaryninja/license.dat` using docker's `-v` option.

Assuming `license.dat` is in your current directory:

```bash
docker run -it --rm mr_memtrace-analysis-dev:latest -v"$(pwd)/license.dat:/home/user/.binaryninja/license.dat" /bin/bash
```

If you plan on doing any tool development, you can mount your local
memtrace directory (repository root) to `/processor`.

```bash
docker run -it --rm -v"$(pwd):/processor" mr_memtrace-analysis-dev:latest /bin/bash
```

This should be run from the root of the memtrace directory in which
you will be working on memtrace or memtrace-tool scripts. If you edit
any files that relate to the dynamorio instrumentation (e.g.,
mem-trace.c), you will need to run `make` from your container's
`/processor` directory.

Note: if the filesystem where your docker containers live has limited
storage you may wish to tell docker to store the results/logs
generated by memtrace (stored in the container's `/results/` directory)
elsewhere using the `-v` (volume) option to specify the host directory, e.g.,

```bash
docker run -it --rm -v"/media/largedisk/results:/results" -v"$(pwd):/processor" mr_memtrace-analysis-dev:latest /bin/bash
```

# Running an intrumented parser run

Use `run_trace.py` to execute a instrumented run of a parser. It
supports a small number of parser/parser families including poppler's
pdftotext and pdftops as well as mupdf's mutool conversion to ps and
text.

`run_trace.py` wraps the output generated by the Dynamorio tools in a
structured manner with which all the processing tools in memtrace-tools
understand.

`run_trace.py` still has a lot of hard-coded cruft in there, so for now
it is best to run it inside the provided docker container.

You must specify the path to at least 1 input to be processed as
arguments to `run_trace.py`, i.e.,
```bash
> ./run_trace.py path/to/foo.pdf path/to/bar.pdf
```

By default, `run_trace.py` will execute poppler's pdftops. To see
what other parser families/binaries are supported, execute
`./run_trace.py --list` If you would like to run a non-default parser,
specify the parser family using the `-p` option, version using `-v`,
and binary using `-b`, e.g.,

```bash
> ./run_trace.py --list
Parser family mupdf:
input type: pdf:
version: 1.18.0
supported binaries: (name/command)
- mutool: mutool clean -s -ggg {in_file} out.pdf
- mutops: mutool convert -F ps -o out.ps {in_file}
- mutotext: mutool convert -F txt -o out.txt {in_file}
- mutotext-decrypt-user: mutool convert -p user -F txt -o out.txt {in_file}
- mutotext-decrypt-owner: mutool convert -p owner -F txt -o out.txt {in_file}
Parser family poppler:
input type: pdf:
version: 0840
supported binaries: (name/command)
- pdftops: utils/pdftops {in_file} out.ps
- pdf-fullrewrite: test/pdf-fullrewrite {in_file} out.pdf
- pdftocairo: utils/pdftocairo -png {in_file} out
- pdftotext: utils/pdftotext {in_file} out.txt
- pdftotext-decrypt-user: utils/pdftotext -upw user {in_file} out.txt
- pdftotext-decrypt-owner: utils/pdftotext -opw owner {in_file} out.txt
version: eval1_sri
supported binaries: (name/command)
- pdftops: utils/pdftops {in_file} out.ps
> ./run_trace.py -p mupdf -v 1.18.0 -b mutops path/to/foo.pdf path/to/bar.pdf
```

(Note: if supported needs to be added for a different parser family,
version, and/or binary, it needs to be added to a json configuration
file in `./parser-settings`. The contents/semantics/format of these
files are currently undocumented)

`./run_trace` will create a directory containing the run's results
under `/results` (or directory specified by `-r` option). The
subdirectory will be given a randomly generated name that starts with
`res_`. You may use`-t ` to tag the generated results with a
more memorable name (this merely creates a symbolic link). Most
memtrace postproccings tools require the path (or symbolic link) to
the result directory to be processed.

The generated results directory contains information including:
- process's address space layout (address map, in mmap.*.log)
- Binary event log generated by instrumentation (in memcalltrace.*.log, one per thread)
- command's standard output/error content (in `suprocess.out`)
- command invoked, exit value, runtime, etc (in info.txt)
- a copy of the input file

All binaries/libraries loaded by the parser will be cached in a
`/results/bins_*` directory (by default) -- this is done once per
instrumented parser binary. Each `/results/bins_*` directory contains
all results directories generated by its corresponding parser binay
(cached in the `/results/bins_*/data` directory). The
`/results/res_*` directories are merely symbolic links.

## Postprocessing instrumentation results

Tools for postprocessing instrumentation results live in the memtrace
subrepo/directory. See tracetools/README.md for information on
analyzing YARN's instrumentation's output.

# Running instrumentation on arbitrary executables

Use `./test_trace.py` directly to apply memtrace instrumentation to arbitrary
executables. E.g., to instrument a binary located at `./ls`,
```
> ./test_trace.py -R -b --parser ./ls --parser-args '' .
```

This will perform an instrumented run of `./ls` (because `--parser
./ls`) called with no arguments (`--parser-args ''`), tracing will
include basic block information (b/c `-b` argument is specified).
Tracing will being when `main` is called and end when it returns (use
`-e [fn]` argument to override), and then it will print out the path
to the result directory (because '-R' is specified). The trailing dot (`.`)
is treated as the binary's input file by the instrumentation. If the
binary doesn't process any input files, this final positional argument
can be any arbitrary file. If the binary does process an input file,
this argument should be the the path to the input file -- if the
binary needs to take the path as a command-line argument, update the
value `--parser-args` to reflect this. E.g., if you want `./ls -l
/root` to be called, then specify the argument using the `{in_file}`
placeholder in `--parser-args`, i.e.,
```
> ./test_trace.py -b -R --parser ./ls --parser-args '-l {in_file}' /root
```

If you get the following error:
"tracetools.results_data.ResultsException: Something went wrong and no
mmap log exists. Did memory tracker log ever get enabled/populated?"

This means that the nothing ever got logged. This is likely due to the
entrypoint (by default "main", otherwise specified using the `-e`
parameter) never being invoked. Check the spelling of the symbol name
and try running the applicaition within gdb to determine what
functions do get invoked.

# Running tools outside a container
This is left as an exercise for the reader.

The Makefile builds the dynamorio-based memory and callstack tracing
tools. It also has three "test" targets (test1, test2, test3) that
runs pdfto{text,html} against pdfs in ../tests. Output is saved in
./build/memcalltrace.pdfto*.log. Be aware that output generated by
these tests can be several hundred megabytes up through several
hundred gigabytes (and possibly larger)

# Printing/parsing trace output

tracetools/tools/print_log.py is a standalone python3 tool that simply parses the
output generated by the memcalltrace tool and prints out the contents
in a human-readable format.

Please see traetools/README.md for more information

# Funding statement

This material is based upon work supported by the Defense Advanced
Research Projects Agency (DARPA) under Contract No. HR001119C0074.

# License
This code is released under the MIT License

The MIT License (MIT)

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/sri-csl/safedocs-yarn-public

Awesome Lists containing this project

README