{"id":25772851,"url":"https://github.com/risicle/cpytraceafl","last_synced_at":"2025-02-27T04:20:27.172Z","repository":{"id":62565218,"uuid":"236180484","full_name":"risicle/cpytraceafl","owner":"risicle","description":"CPython bytecode instrumentation and forkserver tools for fuzzing pure python and mixed python/c code using AFL","archived":false,"fork":false,"pushed_at":"2021-04-04T12:27:56.000Z","size":86,"stargazers_count":28,"open_issues_count":3,"forks_count":4,"subscribers_count":3,"default_branch":"master","last_synced_at":"2024-08-10T07:34:43.000Z","etag":null,"topics":["afl-fuzz","bytecode-manipulation","coverage","cpython","python","tracing"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/risicle.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2020-01-25T14:25:58.000Z","updated_at":"2024-03-30T04:24:01.000Z","dependencies_parsed_at":"2022-11-03T17:46:44.596Z","dependency_job_id":null,"html_url":"https://github.com/risicle/cpytraceafl","commit_stats":null,"previous_names":[],"tags_count":8,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/risicle%2Fcpytraceafl","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/risicle%2Fcpytraceafl/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/risicle%2Fcpytraceafl/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/risicle%2Fcpytraceafl/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/risicle","download_url":"https://codeload.github.com/risicle/cpytraceafl/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":240976846,"owners_count":19887546,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["afl-fuzz","bytecode-manipulation","coverage","cpython","python","tracing"],"created_at":"2025-02-27T04:20:26.462Z","updated_at":"2025-02-27T04:20:27.141Z","avatar_url":"https://github.com/risicle.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# cpytraceafl\n\nCPython bytecode instrumentation and forkserver tools for fuzzing python code using AFL.\n\nThe tools in this repository enable coverage-guided fuzzing of pure python and mixed python/c\ncode using [American Fuzzy Lop](https://github.com/google/AFL) (even better,\n[AFL++](https://github.com/vanhauser-thc/AFLplusplus)).\n\nThere are three main parts to this:\n\n - A bytecode rewriter using a technique inspired by inspired by Ned Batchelder's \"wicked hack\"\n   detailed at https://nedbatchelder.com/blog/200804/wicked_hack_python_bytecode_tracing.html.\n   In this case, the rewriter identifies \"basic blocks\" in the python bytecode and abuses the\n   `code` object's `lnotab` (line-number table) to mark each basic block as a new \"line\".\n   These new \"lines\" are what trigger CPython's line-level trace hooks. The result of this being\n   that we can get our trace hook executed on every new basic block.\n - A minimal \u0026 fast tracehook written in C, tallying visited locations to sysv shared memory.\n - A basic forkserver implementation.\n\nPreparing code for fuzzing involves a couple of steps. The first thing that should happen in\nthe python process is a call to `install_rewriter()`. It's important that this is done very\nearly as any modules that are imported before this will not be properly instrumented.\n\n```python\nfrom cpytraceafl.rewriter import install_rewriter\n\ninstall_rewriter()\n```\n\n`install_rewriter()` can optionally be provided with a `selector` controlling which code objects\nare instrumented and to what degree.\n\nFollowing this, modules can be imported as normal and will be instrumented by the monkeypatched\n`compile` functions. It's usually a good idea to initialize the test environment next, \nperforming as many setup procedures as possible before the input file is read. This may\ninclude doing an initial run of the function under test to ensure any internal imports or caches\nare set up. This is because we want to minimize work that has to be done post-fork - any work\ndone now only has to be done once,\n\nAfter calling\n\n```python\nfrom cpytraceafl import fuzz_from_here\n\nfuzz_from_here()\n```\n\nthe `fork()` will have been made and tracing started. You now simply read your input file and\ncall your function under test.\n\nExamples for fuzzing some common packages are provided in [examples/](./examples/).\n\nAs for hooking this script up to AFL, I tend to use the included\n[dummy-afl-qemu-trace](./dummy-afl-qemu-trace) shim script to fool AFL's QEmu mode into\ncommunicating directly with the python process.\n\n## Fuzzing mixed python/c code\n\nAs of version 0.4.0, `cpytraceafl` can gather trace information from C extension modules that\nhave been compiled with AFL instrumentation (e.g. using `llvm_mode`). This means that it can\nbe used to seamlessly fuzz projects which have a mix of python and C \"speedups\". This is\nimportant not only because a lot of python format-parsing packages use this approach, but\nbecause issues revealed in native code are far more likely to have security implications.\n\nIncluding instrumented native code requires a little more care when preparing a target for\nfuzzing. For instance, it's important to ensure the `cpytraceafl.tracehook` module has been\nimported and it has had its `set_map_start(...)` function provided with a valid memory\narea *before* any instrumented extension modules are loaded. This is because simply loading an\ninstrumented native module will cause it to attempt to log its execution trace somewhere.\n\nThe example [pillow_pcx_example.py](./examples/pillow_pcx_example.py) demonstrates a fuzzing\ntarget taking the necessary precautions into account.\n\nIt's possible that you're _only_ interested in tracing the native code, using `cpytraceafl`\njust as a driver, in which case you can omit the early `install_rewriter()` call and all\nthe weirdness involved with that.\n\n## Regular expressions\n\n[cpytraceafl-regex](https://github.com/risicle/cpytraceafl-regex) is a companion,\n`re`-replacement regex implementation with added instrumentation that should aid AFL in\ngenerating examples that pass regular expressions used in the target code, or\nexercise them in interesting ways. Without this, AFL will just see regular expressions\nas a black box that will act as a barrier to path exploration.\n\n## Trophy cabinet\n\n`cpytraceafl` has been used to find:\n\n - Pillow: [CVE-2020-10177](https://nvd.nist.gov/vuln/detail/CVE-2020-10177),\n   [CVE-2020-10378](https://nvd.nist.gov/vuln/detail/CVE-2020-10378),\n   [CVE-2020-10379](https://nvd.nist.gov/vuln/detail/CVE-2020-10379),\n   [CVE-2020-10994](https://nvd.nist.gov/vuln/detail/CVE-2020-10994),\n   [CVE-2020-11538](https://nvd.nist.gov/vuln/detail/CVE-2020-11538).\n - bsdiff4: [CVE-2020-15904](https://nvd.nist.gov/vuln/detail/CVE-2020-15904)\n - asyncpg: [CVE-2020-17446](https://nvd.nist.gov/vuln/detail/CVE-2020-17446)\n - clickhouse-driver: [CVE-2020-26759](https://nvd.nist.gov/vuln/detail/CVE-2020-26759)\n\n## Q \u0026 A\n\n### Is there any point in fuzzing python? Isn't it too slow?\n\nWell, yes and no. My experience has been that fuzzing python code is simply \"a bit different\"\nfrom fuzzing native code - you tend to be looking for different things. In terms of raw speed,\nfuzzing python is certainly not fast, but iteration rates I tend to work with aren't completely\ndissimilar to what I'm used to getting with AFL's Qemu mode (of course, no two fuzzing targets\nare really directly comparable).\n\nBecause of the memory-safe nature of pure python code, it's also more uncommon for issues\nuncovered through fuzzing to be security issues - logical flaws in parsing tend to lead to\nunexpected/unhandled exceptions. So it's still a rather useful tool in simply looking for bugs.\nIt can be used, for example, to generate a corpus of example inputs for your test suite which\nexercise a large amount of the code.\n\nHowever, note that while *pure* python code may be memory safe, as soon as you start using\nthe C api, Cython, or even start playing with the `ctypes` module, it is *not*.\n\n### Does basic block analysis make any sense for python code?\n\nFrom a rigorous academic stance, and for some uses, possibly not - you've got to keep in mind\nthat half the bytecode instructions could result in calls out to more arbitrary python or\n(uninstrumented) native code that could have arbitrary side effects. But for our needs it works\nwell enough (recall that AFL coverage analysis is robust to random instrumentation\nsites being omitted through `AFL_INST_RATIO` or `AFL_INST_LIBS`).\n\n### Doesn't abusing `lnotab` break python's debugging mechanisms?\n\nAbsolutely it does. Don't use instrumented programs to debug problematic cases - use it to\ngenerate problematic inputs. Analyze them with instrumentation turned off.\n\n### I'm getting `undefined symbol: __afl_area_ptr`\n\nLooks like you're trying to import an (instrumented) native extension module before the\n`cpytraceafl.tracehook` module has been loaded (which is what provides that symbol).\n\n### I'm getting Segmentation Faults after importing an instrumented native module\n\nYou probably also need to provide `cpytraceafl.tracehook.set_map_start(...)` with a valid\nwriteable memory area before the import. Assuming you're not interested in the trace associated\nwith the import process, this can just be a dummy which you later discard. I'd recommend either\nusing an `mmap` object or `sysv_ipc.SharedMemory`. When `fuzz_from_here()` is called, this will\nbe replaced with right one.\n\nIt's also possible the instrumented module was built with a different AFL `MAP_SIZE_POW2` from\nthat in `cpytraceafl.MAP_SIZE_BITS`.\n\n### Do I need a specially-built/instrumented version of cpython to use this?\n\nNo, you can use your normal distribution-installed python. If you're just looking at\nfuzzing pure python, you don't need to even think about building any binaries with\nfunny compilers.\n\nYou may be interested in building c/c++/cython-based modules or their underlying native\nlibraries with instrumentation if that's what you're trying to fuzz, but I suspect using\na natively-instrumented _cpython_ would be quite complicated and extremely slow.\n\n### Do you have any tips on detecting memory errors in cpython extensions?\n\nI have tended to use `tcmalloc`'s debugging modes with `TCMALLOC_PAGE_FENCE` and\n`TCMALLOC_PAGE_FENCE_NEVER_RECLAIM` enabled. In fact I have\n[a fork](https://github.com/gperftools/gperftools/compare/master...risicle:ris-extras)\nof `gperftools` containing some additional `tcmalloc` hacks I've found useful.\n\nOne problem with this of course is that much of cpython's  memory is allocated\nusing its own memory pool allocator, which is largely invisible to the `malloc`\nimplementation. So I've also got\n[a patch for cpython](https://gist.github.com/risicle/12c6f20518807699d816b8cb4389b840)\nwhich adds a very basic canary mechanism to its pool allocator (at the slight expense of\nmemory efficiency).\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Frisicle%2Fcpytraceafl","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Frisicle%2Fcpytraceafl","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Frisicle%2Fcpytraceafl/lists"}