{"id":13586041,"url":"https://github.com/trailofbits/polyfile","last_synced_at":"2026-02-12T00:14:05.244Z","repository":{"id":35723074,"uuid":"193975534","full_name":"trailofbits/polyfile","owner":"trailofbits","description":"A pure Python cleanroom implementation of libmagic, with instrumented parsing from Kaitai struct and an interactive hex viewer","archived":false,"fork":false,"pushed_at":"2025-04-28T12:06:20.000Z","size":7997,"stargazers_count":353,"open_issues_count":19,"forks_count":23,"subscribers_count":33,"default_branch":"master","last_synced_at":"2025-05-12T04:50:06.187Z","etag":null,"topics":["file-format-detection","file-formats","libmagic","polyglots","python"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/trailofbits.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":"CODEOWNERS","security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2019-06-26T20:48:20.000Z","updated_at":"2025-05-03T22:09:17.000Z","dependencies_parsed_at":"2023-01-16T04:15:48.660Z","dependency_job_id":"be31be82-86ca-4091-941f-ca4106b21a68","html_url":"https://github.com/trailofbits/polyfile","commit_stats":{"total_commits":656,"total_committers":9,"mean_commits":72.88888888888889,"dds":0.02896341463414631,"last_synced_commit":"438628fea2d32ee97b9f23a7aef7ffa3fdc80a0a"},"previous_names":[],"tags_count":21,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/trailofbits%2Fpolyfile","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/trailofbits%2Fpolyfile/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/trailofbits%2Fpolyfile/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/trailofbits%2Fpolyfile/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/trailofbits","download_url":"https://codeload.github.com/trailofbits/polyfile/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":254259370,"owners_count":22040819,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["file-format-detection","file-formats","libmagic","polyglots","python"],"created_at":"2024-08-01T15:05:17.616Z","updated_at":"2026-02-12T00:14:05.189Z","avatar_url":"https://github.com/trailofbits.png","language":"Python","readme":"# PolyFile\n\u003cp align=\"center\"\u003e\n  \u003cimg src=\"logo/polyfile_name.png?raw=true\" width=\"256\" title=\"PolyFile\"\u003e\n\u003c/p\u003e\n\u003cbr /\u003e\n\n[![PyPI version](https://badge.fury.io/py/polyfile.svg)](https://badge.fury.io/py/polyfile)\n[![Tests](https://github.com/trailofbits/polyfile/workflows/Tests/badge.svg)](https://github.com/trailofbits/polyfile/actions)\n[![Slack Status](https://slack.empirehacking.nyc/badge.svg)](https://slack.empirehacking.nyc)\n\nA utility to identify and map the semantic and syntactic structure of files,\nincluding polyglots, chimeras, and schizophrenic files. It has [a pure-Python implementation of libmagic](#file-support) and can act as a drop-in replacement for the [`file` command](https://github.com/file/file). However, unlike `file`, PolyFile can recursively identify embedded files, like [binwalk](https://github.com/ReFirmLabs/binwalk).\n\nPolyFile can be used in conjunction with its sister tool\n[PolyTracker](https://github.com/trailofbits/polytracker) for\n_Automated Lexical Annotation and Navigation of Parsers_, a backronym\ndevised solely for the purpose of collectively referring to the tools\nas _The ALAN Parsers Project_.\n\n## Quickstart\n\nYou can install the latest stable version of PolyFile from PyPI:\n```\npip3 install polyfile\n```\n\nTo install PolyFile from source, in the same directory as this README, run:\n```\npip3 install .\n```\n\nImportant: Before installing from source, make sure Java is installed. Java is used to\nrun the Kaitai Struct compiler, which compiles the file format definitions.\n\nThis will automatically install the `polyfile` and `polymerge` executables in your path.\n\n## Usage\n\nRunning `polyfile` on a file with no arguments will mimic the behavior of `file --keep-going`:\n```console\n$ polyfile png-polyglot.png\nPNG image data, 256 x 144, 8-bit/color RGB, non-interlaced\nBrainfu** Program\nMalformed PDF\nPDF document, version 1.3,  1 pages\nZIP end of central directory record Java JAR archive \n```\nTo generate an interactive hex viewer for the file, use the `--html` option:\n```console\n$ polyfile --html output.html png-polyglot.png\nFound a file of type application/pdf at byte offset 0\nFound a file of type application/x-brainfuck at byte offset 0\nFound a file of type image/png at byte offset 0\nFound a file of type application/zip at byte offset 0\nFound a file of type application/java-archive at byte offset 0\nSaved HTML output to output.html\n```\n\nRun `polyfile --help` for full usage instructions.\n\n### Interactive Debugger\n\nPolyFile has an interactive debugger both for its file matching and parsing. It can be used to debug a libmagic pattern \ndefinition, determine why a specific file fails to be classified as the expected MIME type, or step through a parser.\nYou can run PolyFile with the debugger enabled using the `-db` option.\n\n### File Support\n\nPolyFile has a cleanroom, [pure Python implementation of the libmagic file classifier](#libmagic-implementation), and supports all 263 MIME types that it can identify.\n\nIt currently has support for parsing and semantically mapping the following formats:\n* PDF, using an instrumented version of [Didier Stevens' public domain, permissive, forensic parser](https://blog.didierstevens.com/programs/pdf-tools/)\n* ZIP, including recursive identification of all ZIP contents\n* JPEG/JFIF, using its [Kaitai Struct grammar](https://formats.kaitai.io/jpeg/index.html)\n* [iNES](https://wiki.nesdev.com/w/index.php/INES)\n* [Any other format](https://formats.kaitai.io/index.html) specified in a [KSY grammar](https://doc.kaitai.io/user_guide.html)\n\nFor an example that exercises all of these file formats, run:\n```bash\ncurl -v --silent https://www.sultanik.com/files/ESultanikResume.pdf | polyfile --html ESultanikResume.html -\n```\n\nPrior to PolyFile version 0.3.0, it used the [TrID database](http://mark0.net/soft-trid-deflist.html) for file\nidentification rather than the libmagic file definitions. This proved to be very slow (since TrID has many duplicate\nentries) and prone to false positives (since TrID's file definitions are much simpler than libmagic's). The original\nTrID matching code is still shipped with PolyFile and can be invoked programmatically, but it is not used by default.\n\n### Output Format\n\nPolyFile has several options for outputting its results, specified by its `--format` option. For computer-readable output, PolyFile has an extension of the [SBuD](https://github.com/corkami/sbud) JSON format described [in the documentation](docs/json_format.md). Prior to version 0.5.0 this was the default output format of PolyFile. However, now the default output format is to mimic the behavior of the `file` command. To maintain the original behavior, use the `--format sbud` option.\n\n### libmagic Implementation\n\nPolyFile has a cleanroom implementation of [libmagic (used in the `file` command)](https://github.com/file/file).\nIt can be invoked programmatically by running:\n```python\nfrom polyfile.magic import MagicMatcher\n\nwith open(\"file_to_test\", \"rb\") as f:\n    # the default instance automatically loads all file definitions\n    for match in MagicMatcher.DEFAULT_INSTANCE.match(f.read()):\n        for mimetype in match.mimetypes:\n            print(f\"Matched MIME: {mimetype}\")\n        print(f\"Match string: {match!s}\")\n```\nTo load a specific or custom file definition:\n```python\nlist_of_paths_to_definitions = [\"def1\", \"def2\"]\nmatcher = MagicMatcher.parse(*list_of_paths_to_definitions)\nwith open(\"file_to_test\", \"rb\") as f:\n    for match in matcher.match(f.read()):\n        ...\n```\n\n## Extending PolyFile\n\nInstructions on extending PolyFile to support more file formats with new matchers and parsers is described [in the documentation]([in the documentation](docs/extending_polyfile.md)).\n\n## License and Acknowledgements\n\nThis research was developed by [Trail of\nBits](https://www.trailofbits.com/) with funding from the Defense\nAdvanced Research Projects Agency (DARPA) under the SafeDocs program\nas a subcontractor to [Galois](https://galois.com). It is licensed under the [Apache 2.0 license](LICENSE).\n© 2019, Trail of Bits.\n","funding_links":[],"categories":["Python"],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftrailofbits%2Fpolyfile","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ftrailofbits%2Fpolyfile","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftrailofbits%2Fpolyfile/lists"}