https://github.com/sri-csl/safedocs-recognizer
DARPA SafeDocs TA1 software suite to bundle and orchestrate various format-aware tracing tools.
https://github.com/sri-csl/safedocs-recognizer
Last synced: about 1 month ago
JSON representation
DARPA SafeDocs TA1 software suite to bundle and orchestrate various format-aware tracing tools.
- Host: GitHub
- URL: https://github.com/sri-csl/safedocs-recognizer
- Owner: SRI-CSL
- License: mit
- Created: 2021-10-28T15:24:34.000Z (over 3 years ago)
- Default Branch: main
- Last Pushed: 2022-05-05T14:44:57.000Z (about 3 years ago)
- Last Synced: 2024-12-24T18:29:20.429Z (5 months ago)
- Language: Python
- Size: 1.34 MB
- Stars: 1
- Watchers: 16
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE.txt
Awesome Lists containing this project
README
# safedocs-recognizer
DARPA SafeDocs TA1 software suite to bundle and orchestrate various format-aware tracing tools.## How to run
The first step is copying (or create a symlink) documents to the localdocs directory and creating the document index.
```
sh build_index.sh
```The database should then be started to store processing results.
```
docker compose up
```Build the CLI tool
```
go build
```Build the tooling
```
sh build-components.sh
```### Examples
#### Running tools without recognizer hardness
```
docker run --rm -i mr_file-features stdin < pdf-sample.pdf
``````
docker run --rm -i mr_qpdf_10.1.0 stdin < pdf-sample.pdf
```#### mupdf example within recognizer
Baseline and non-baseline processing (for performance reasons and prevent multiple passes over 1mil files, the consensus component combines bitcov and cfg tools)
```
./recognizer process --tag mr_mupdf_1.16.1 --subset evalThree --universe univA --baseline
./recognizer process --tag mr_mupdf_1.16.1 --subset evalThree10kTest --universe univA
./recognizer process --tag mr_file-features --subset evalThree --baseline
```Integrated components
Derive model
```
./recognizer bitcov --parser mupdf --universe univA
./recognizer bitcov --parser mupdf --universe univB
```Metrics comparing 10k non-baseline files with models A and B
```
./recognizer bitcov-diff --model mupdf_univA_model.png --parser mupdf
./recognizer bitcov-diff --model mupdf_univB_model.png --parser mupdf
```Derive model
```
./recognizer flat-cfg --parser mupdf --universe univA
./recognizer flat-cfg --parser mupdf --universe univB
```Metrics comparing 10k non-baseline files with models A and B
```
./recognizer flat-cfg-diff --parser mupdf --model mupdf_univA_flat_cfg_model.txt
./recognizer flat-cfg-diff --parser mupdf --model mupdf_univB_flat_cfg_model.txt
```#### Misc
Helper scripts
Extract PDF Object that QPDF fails to parse
```
docker run --rm -i mr_file-features stdin < localdocs/temp/163e61e6c3dd768854b2ead5616cbc2c2dbd9c8559aaca9fb8e8005f20d8e397_parsley | awk -v pdf_object=$(docker run --rm -i mr_qpdf stdin < localdocs/temp/163e61e6c3dd768854b2ead5616cbc2c2dbd9c8559aaca9fb8e8005f20d8e397_parsley | awk -f invalid_object.awk) -f extract_bytes.awk
```