{"id":13579411,"url":"https://github.com/trailofbits/polytracker","last_synced_at":"2025-05-14T23:06:17.920Z","repository":{"id":38419681,"uuid":"218835361","full_name":"trailofbits/polytracker","owner":"trailofbits","description":"An LLVM-based instrumentation tool for universal taint tracking, dataflow analysis, and tracing.","archived":false,"fork":false,"pushed_at":"2025-04-08T19:07:38.000Z","size":38279,"stargazers_count":560,"open_issues_count":46,"forks_count":45,"subscribers_count":39,"default_branch":"master","last_synced_at":"2025-04-13T20:39:13.992Z","etag":null,"topics":["dataflow-analysis","instrumentation","llvm","taint-analysis","taint-tracking"],"latest_commit_sha":null,"homepage":"","language":"C++","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/trailofbits.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":"CODEOWNERS","security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2019-10-31T18:37:16.000Z","updated_at":"2025-04-10T05:15:38.000Z","dependencies_parsed_at":"2023-12-05T00:27:55.975Z","dependency_job_id":"f30c1d53-a082-4da1-97e8-de83de6022b5","html_url":"https://github.com/trailofbits/polytracker","commit_stats":{"total_commits":1991,"total_committers":17,"mean_commits":"117.11764705882354","dds":0.7021597187343044,"last_synced_commit":"02e096ba4a0c1d3dea478cd6d687e625c20d8bdd"},"previous_names":[],"tags_count":6,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/trailofbits%2Fpolytracker","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/trailofbits%2Fpolytracker/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/trailofbits%2Fpolytracker/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/trailofbits%2Fpolytracker/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/trailofbits","download_url":"https://codeload.github.com/trailofbits/polytracker/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":254243360,"owners_count":22038046,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["dataflow-analysis","instrumentation","llvm","taint-analysis","taint-tracking"],"created_at":"2024-08-01T15:01:39.144Z","updated_at":"2025-05-14T23:06:12.902Z","avatar_url":"https://github.com/trailofbits.png","language":"C++","funding_links":[],"categories":["\u003ca name=\"cpp\"\u003e\u003c/a\u003eC++"],"sub_categories":[],"readme":"# PolyTracker\n\n\u003cp align=\"center\"\u003e\n  \u003cimg src=\"logo/polytracker_name.png?raw=true\" width=\"256\" title=\"PolyTracker\"\u003e\n\u003c/p\u003e\n\u003cbr /\u003e\n\n[![PyPI version](https://badge.fury.io/py/polytracker.svg)](https://badge.fury.io/py/polytracker)\n[![Tests](https://github.com/trailofbits/polytracker/workflows/Tests/badge.svg)](https://github.com/trailofbits/polytracker/actions)\n[![Slack Status](https://slack.empirehacking.nyc/badge.svg)](https://slack.empirehacking.nyc)\n\nPolyTracker is a tool originally created for the _Automated Lexical Annotation\nand Navigation of Parsers_, a backronym devised solely for the purpose of\nreferring to it as _The ALAN Parsers Project_. However, it has evolved into a\ngeneral purpose tool for efficiently performing data-flow and control-flow\nanalysis of programs. PolyTracker is an LLVM pass that instruments programs to\ntrack which bytes of an input file are operated on by which functions. It\noutputs a database containing the data-flow information, as well as a runtime\ntrace. PolyTracker also provides a Python library for interacting with and\nanalyzing its output, as well as an interactive Python REPL.\n\nPolyTracker can be used in conjunction with\n[PolyFile](https://github.com/trailofbits/polyfile) to automatically determine\nthe semantic purpose of the functions in a parser. It also has an experimental\nfeature capable of generating a context free grammar representing the language\naccepted by a parser.\n\nUnlike dynamic instrumentation alternatives like\n[Taintgrind](https://github.com/wmkhoo/taintgrind), PolyTracker imposes\nnegligible performance overhead for almost all inputs, and is capable of\ntracking every byte of input at once. PolyTracker started as a fork of the LLVM\nDataFlowSanitizer and takes much inspiration from the\n[Angora Fuzzer](https://github.com/AngoraFuzzer/Angora). However, unlike the\nAngora system, PolyTracker is able to track the entire _provenance_ of a taint.\nIn February of 2021, the LLVM DataFlowSanitizer added a new feature for tracking\ntaint provenance called [_origin tracking_](https://reviews.llvm.org/D95835).\nHowever, it is only able to track at most 16 taints at once, while PolyTracker\ncan track up to 2\u003csup\u003e31\u003c/sup\u003e-1.\n\nThis README serves as the general usage guide for installing PolyTracker and\ncompiling/instrumenting binaries. For programmatically interacting with or\nextending PolyTracker through its Python API, as well as for interacting with\nruntime traces produced from instrumented code,\n[consult the Python documentation](https://trailofbits.github.io/polytracker/latest/).\n\n## Quickstart\n\nPolyTracker is controlled via a Python script called `polytracker`. You can\ninstall it by running\n\n```shell-script\npip3 install polytracker\n```\n\nPolyTracker requires a very particular system environment to run, so almost all\nusers are likely to run it in a containerized environment. Luckily,\n`polytracker` makes this easy. All you need to do is have `docker` installed,\nthen run:\n\n```shell-script\npolytracker docker pull\n```\n\nand\n\n```shell-script\npolytracker docker run\n```\n\nThe latter command will mount the current working directory into the PolyTracker\nDocker container, and allow you to build and run instrumented programs.\n\nThe `polytracker` control script—which you can run from either your host system\nor from inside the Docker container—has a variety of commands, both for\ninstrumenting programs as well as analyzing the resulting artifacts. For\nexample, you can explore the dataflows in the execution, reconstruct the\ninstrumented program's control flow graph, and even extract a context free\ngrammar matching the inputs accepted by the program. You can explore these\ncommands by running\n\n```shell-script\npolytracker --help\n```\n\nThe `polytracker` script is also a REPL, if run with no command line arguments:\n\n```python\n$ polytracker\nPolyTracker (4.0.0)\nhttps://github.com/trailofbits/polytracker\nType \"help\" or \"commands\"\n\u003e\u003e\u003e commands\n```\n\n## Instrumenting a simple C/C++ program\n\nPolyTracker also comes with a `build` command. This command allows the user to\nrun any build command in a [Blight](https://github.com/trailofbits/blight)\ninstrumented environment. This will produce a `blight_journal.jsonl` file that\nrecords all commands run during the build. If you have a C/C++ target, you can\ninstrument it by invoking `polytracker build` and passing your build command:\n\n```shell-script\npolytracker build gcc -g -o my_binary my_source.c\n```\n\nTo instrument a build target, use the `instrument-targets` command. By default\nthe command will use the a `blight_journal.jsonl` in your current working\ndirectory to build an instrumented version of your build target. The\ninstrumented build target will be built using the same flags as the original\nbuild target.\n\n```shell-script\npolytracker instrument-targets my_binary\n```\n\n`build` also supports more complex programs that use a build system like\nautotiools or CMake:\n\n```shell-script\npolytracker build cmake .. -DCMAKE_BUILD_TYPE=Release\npolytracker build ninja\n# or\npolytracker build ./configure\npolytracker build make\n```\n\nThen run `instrument-targets` on any targets of the build:\n\n```shell-script\npolytracker instrument-targets a.bin b.so\n```\n\nThen `a.instrumented.bin` and `b.instrumented.so` will be the instrumented\nversions. See the Dockerfiles in the\n[examples](https://github.com/trailofbits/polytracker/tree/master/examples)\ndirectory for examples of how real-world programs can be instrumented.\n\n## Running and Analyzing an Instrumented Program\n\nThe instrumented software will write its output to the path specified in\n`POLYDB`, or `polytracker.tdag` if omitted. This is a binary file that can be\noperated on by running:\n\n```python\nfrom polytracker import PolyTrackerTrace, taint_dag\n\ntrace = PolyTrackerTrace.load(\"polytracker.tdag\")\ntdfile = trace.tdfile\n\nfirst_node = list(tdfile.nodes)[0]\nprint(f\"First node affects control flow: {first_node.affects_control_flow}\")\n\n# Operate on all Range nodes\nfor index, node in enumerate(tdfile.nodes):\n  if isinstance(node, taint_dag.TDRangeNode):\n    print(f\"Node {index}: first {node.first}, last {node.last}\")\n\n# Access taint forest\ntdforest = trace.taint_forest\nn1 = tdforest.get_node(1)\nprint(\n  f\"Forest node {n1.label}. Parent labels: {n1.parent_labels}, \"\n  f\"source: {n1.source.path if n1.source is not None else None}, \"\n  f\"affects control flow: {n1.affected_control_flow}\"\n)\n```\n\nYou can also run an instrumented binary directly from the REPL:\n\n```python\n$ polytracker\nPolyTracker (4.0.0)\nhttps://github.com/trailofbits/polytracker\nType \"help\" or \"commands\"\n\u003e\u003e\u003e trace = run_trace(\"path_to_binary\", \"path_to_input_file\")\n```\n\nThis will automatically run the instrumented binary in a Docker container, if\nnecessary.\n\n\u003e :warning: **If running PolyTracker inside Docker or a VM**: PolyTracker can be\n\u003e very slow if running in a virtualized environment and either the input file\n\u003e or, especially, the output database are located in a directory mapped or\n\u003e mounted from the host OS. This is particularly true when running PolyTracker\n\u003e in Docker from a macOS host. The solution is to write the database to a path\n\u003e inside of the container/VM and then copy it out to the host system at the very\n\u003e end.\n\nThe Python API documentation is available\n[here](https://trailofbits.github.io/polytracker/latest/).\n\n## Runtime Parameters and Instrumentation Tuning\n\nAt runtime, PolyTracker instrumentation looks for a number of configuration\nparameters specified through environment variables. This allows one to modify\ninstrumentation parameters without needing to recompile the binary.\n\n### Environment Variables\n\nPolyTracker accepts configuration parameters in the form of environment\nvariables to avoid recompiling target programs. The current set of environment\nvariables that PolyTracker supports is:\n\n```shell-script\nPOLYDB: A path to which to save the output database (default is polytracker.tdag)\n\nWLLVM_ARTIFACT_STORE: Provides a path to an existing directory to store artifact/manifest for all build targets\n\nPOLYTRACKER_TAINT_ARGV: Set to '1' to use argv as a taint source.\n\nPOLYTRACKER_STDIN_SOURCE: Set to '1' to use stdin as a taint source.\n\nPOLYTRACKER_STDOUT_SINK: Set to '1' to use stdout as a taint sink.\n\nPOLYTRACKER_STDERR_SINK: Set to '1' to use stderr as a taint sink.\n```\n\nPolytracker will set its configuration parameters in the following order:\n\n1. If a parameter is specified via an environment variable, use that value\n2. Else if a default value for the parameter exists, use the default\n3. Else throw an error\n\n### ABI Lists\n\nDFSan uses ABI lists to determine what functions it should automatically\ninstrument, what functions it should ignore, and what custom function wrappers\nexist. See the\n[dfsan documentation](https://clang.llvm.org/docs/DataFlowSanitizer.html) for\nmore information.\n\n### Creating custom ignore lists from pre-built libraries\n\nAttempting to build large software projects can be time consuming, especially\nolder/unsupported ones. It's even more time consuming to try and modify the\nbuild system such that it supports changes, like dfsan's/our instrumentation.\n\nThere is a script located in `polytracker/scripts` that you can run on any ELF\nlibrary and it will output a list of functions to ignore. We use this when we do\nnot want to track information going through a specific library like libpng, or\nother sub components of a program. The `Dockerfile-listgen.demo` exists to build\ncommon open source libraries so we can create these lists.\n\nThis script is a slightly tweaked version of what DataFlowSanitizer has, which\nfocuses on ignoring system libraries. The original script can be found in\n`dfsan_rt`.\n\n## Building the Examples\n\nCheck out this Git repository. From the root, either build the base PolyTracker\nDocker image:\n\n```shell-script\npip3 install -e \".[dev]\" \u0026\u0026 polytracker docker rebuild\n```\n\nor just pull the latest prebuilt version from DockerHub:\n\n```shell-script\ndocker pull trailofbits/polytracker:latest\n```\n\nFor a demo of PolyTracker running on the [MuPDF](https://mupdf.com/) parser run\nthis command:\n\n```shell-script\ndocker build -t trailofbits/polytracker-demo-mupdf -f examples/pdf/Dockerfile-mupdf.demo .\n```\n\n`mutool_track` will be build in `/polytracker/the_klondike/mupdf/build/debug`.\nRunning `mutool_track` will output `polytracker.tdag` which contains the\ninformation provided by the taint analysis.\n\nFor a demo of PolyTracker running on Poppler utils version 0.84.0 run this\ncommand:\n\n```shell-script\ndocker build -t trailofbits/polytracker-demo-poppler -f examples/pdf/Dockerfile-poppler.demo .\n```\n\nAll the poppler utils will be located in\n`/polytracker/the_klondike/poppler-0.84.0/build/utils`.\n\n```shell-script\ncd /polytracker/the_klondike/poppler-0.84.0/build/utils\n./pdfinfo_track some_pdf.pdf\n```\n\n## Hacking on PolyTracker Using the Docker Environment\n\nSuppose you want to get a little more in-depth in extending the PolyTracker\ncodebase or in analysing TDAG traces, and you don't want to mess with your\nlocal environment by installing an LLVM version that is heavily customized.\n\nIf you're working in Ubuntu and starting from a relatively clean 22.04 or 24.04\nbase, the [linked Gist](https://gist.github.com/kaoudis/cf412abafea5ca4054c852f9e5905aab)\ndetails steps to get a working passthrough version of the PolyTracker base container.\nThe base container provides a development environment with all dependencies\nthat you can directly work in, or can extend (as we've done in the example\nDockerfiles).\n\n## Running Tests\n\nRunning both the Python and C++ unit tests should be done inside the PolyTracker\nDocker container.\n\nThe Catch2 unit tests in `unittests/` live in\n`/polytracker-build/unittests/src/taintdag/` within the container. Run the test binary\nwithin the Docker container with\n\n```shell-script\n  cd /polytracker-build/unittests/src/taintdag/ \u0026\u0026 ./tests-taintdag\n```\n\nThe Python unit tests in `tests/` require local test C++ programs that the test\nfixtures will instrument. Run them using Pytest in the working\n\n```shell-script\n  pytest tests\n```\n\nOr use pytest to run a single test file with\n\n```shell-script\n  pytest tests/test_foo.py\n```\n\n## Current Status and Known Issues\n\nPolyTracker currently only runs on Linux, because that is the only system\nsupported by the DataFlow Santizer. This limitation is just due to a lack of\nsupport for semantics for other OSes system calls, which could be added in the\nfuture. However, this means that running PolyTracker on a non-Linux system will\nrequire Docker to be installed.\n\nTaints will not propagate through dynamically loaded libraries unless those\nlibraries were compiled from source using PolyTracker, _or_ there is specific\nsupport for the library calls implemented in PolyTracker. There _is_ currently\nsupport for propagating taint through the majority of uninstrumented C standard\nlibrary calls. To be clear, programs that use uninstrumented functions will\nstill run normally, however, operations performed by unsupported library calls\nwill not propagate taint. We are currently working on adding robust support for\nC++ programs, but currently the best results will be from C programs.\n\nIf there are issues with Docker, try performing a system prune and build with\n`--no-cache` for both PolyTracker and whatever demo you are trying to run.\n\nThe worst case performance of PolyTracker is exercised when a single byte in\nmemory is simultaneously tainted by a large number of input bytes from the\nsource file. This is most common when instrumenting compression and\ncryptographic algorithms that have large block sizes. There are a number of\nmitigations for this behavior currently being researched and developed.\n\n## Publications and Current Use Cases\n\nHere are some of the publicly available things we've done with PolyTracker. If you know of anything else you'd like to see listed here, please let us know!\n\n- The [Format Analysis Workbench](https://github.com/galoisinc/faw) integrates several PolyTracker features from different versions of the codebase, namely grammar extraction and blind spot detection.\n- Harmon, Carson, Bradford Larsen, and Evan A. Sultanik. \"[Toward automated grammar extraction via semantic labeling of parser implementations.](https://bradfordlarsen.com/files/publications/semantic-labeling-langsec-2020.pdf)\" 2020 IEEE Security and Privacy Workshops (SPW). IEEE, 2020.\n- Brodin, Henrik, Marek Surovič, and Evan Sultanik. \"[Blind spots: Identifying exploitable program inputs.](https://langsec.org/spw23/papers/Brodin_LangSec23.pdf)\"\n  2023 IEEE Security and Privacy Workshops (SPW). IEEE, 2023.\n- Henrik used PolyTracker's blind spots (`mapping` and `cavities` more precisely) trace analysis functionality to pinpoint a CVE and [wrote about it on the Trail of Bits blog](https://blog.trailofbits.com/2023/03/30/acropalypse-polytracker-blind-spots/).\n- Kaoudis, Kelly, Henrik Brodin, and Evan Sultanik. \"[Automatically Detecting Variability Bugs Through Hybrid Control and Data Flow Analysis.](https://langsec.org/spw23/papers/Kaoudis_LangSec23.pdf)\"\n  2023 IEEE Security and Privacy Workshops (SPW). IEEE, 2023.\n- Evan Sultanik, Marek Surovič, Henrik Brodin, Kelly Kaoudis, Facundo Tuesca, Carson Harmon, Lisa Overall, Joseph Sweeney, and Bradford Larsen.\n  \"[PolyTracker: Whole-Input Dynamic Information Flow Tracing.](https://github.com/trailofbits/publications/blob/master/papers/issta24-polytracker.pdf)\" In Proceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis (ISSTA), 2024.\n\n## License and Acknowledgements\n\nThis research was developed by [Trail of Bits](https://www.trailofbits.com/)\nwith funding from the Defense Advanced Research Projects Agency (DARPA) under\nthe SafeDocs program as a subcontractor to [Galois](https://galois.com). It is\nlicensed under the [Apache 2.0 license](LICENSE). © 2019, Trail of Bits.\n\n## Maintainers\n\nPlease contact us using `firstname.lastname@trailofbits.com`.\n\n[Evan Sultanik](https://github.com/ESultanik)\u003cbr /\u003e\n[Henrik Brodin](https://github.com/hbrodin)\u003cbr /\u003e\n[Kelly Kaoudis](https://github.com/kaoudis)\u003cbr /\u003e\n\n## Past Maintainers\n\n[Marek Surovič](https://github.com/surovic)\u003cbr /\u003e\n[Facundo Tuesca](https://github.com/facutuesca)\u003cbr /\u003e \u003cbr /\u003e\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftrailofbits%2Fpolytracker","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ftrailofbits%2Fpolytracker","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftrailofbits%2Fpolytracker/lists"}