{"id":20325771,"url":"https://github.com/aboutcode-org/extractcode","last_synced_at":"2025-10-27T19:10:57.916Z","repository":{"id":40423256,"uuid":"301779951","full_name":"aboutcode-org/extractcode","owner":"aboutcode-org","description":"A mostly universal file extraction library and CLI tool to extract almost any archive in a reasonably safe way on Linux, macOS and Windows.","archived":false,"fork":false,"pushed_at":"2024-08-14T05:03:14.000Z","size":26216,"stargazers_count":35,"open_issues_count":47,"forks_count":17,"subscribers_count":8,"default_branch":"main","last_synced_at":"2024-12-09T02:10:54.238Z","etag":null,"topics":["7zip","archive","bzip2","cab","cpio","decompression","extract","extractor","gzip","iso9660","libarchive","lzma","tar","xz","zip","zstd"],"latest_commit_sha":null,"homepage":"https://www.aboutcode.org/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/aboutcode-org.png","metadata":{"files":{"readme":"README.rst","changelog":"CHANGELOG.rst","contributing":null,"funding":null,"license":null,"code_of_conduct":"CODE_OF_CONDUCT.rst","threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":"AUTHORS.rst","dei":null,"publiccode":null,"codemeta":null}},"created_at":"2020-10-06T15:55:34.000Z","updated_at":"2024-11-14T21:03:14.000Z","dependencies_parsed_at":"2024-05-16T04:59:14.731Z","dependency_job_id":"ea7ec44d-d0ba-4197-b3d7-1b341ad86fc6","html_url":"https://github.com/aboutcode-org/extractcode","commit_stats":{"total_commits":446,"total_committers":24,"mean_commits":"18.583333333333332","dds":"0.23991031390134532","last_synced_commit":"a945cc51d9aa9d973903eb110fc9acd7ecf95ed1"},"previous_names":[],"tags_count":9,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/aboutcode-org%2Fextractcode","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/aboutcode-org%2Fextractcode/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/aboutcode-org%2Fextractcode/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/aboutcode-org%2Fextractcode/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/aboutcode-org","download_url":"https://codeload.github.com/aboutcode-org/extractcode/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":230511479,"owners_count":18237657,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["7zip","archive","bzip2","cab","cpio","decompression","extract","extractor","gzip","iso9660","libarchive","lzma","tar","xz","zip","zstd"],"created_at":"2024-11-14T19:41:36.601Z","updated_at":"2025-10-27T19:10:57.825Z","avatar_url":"https://github.com/aboutcode-org.png","language":"Python","readme":"============\nExtractCode\n============\n\n- license: Apache-2.0\n- copyright: copyright (c) nexB. Inc. and others\n- homepage_url: https://github.com/nexB/extractcode\n- keywords: archive, extraction, libarchive, 7zip, scancode-toolkit, extractcode\n\nSupports Windows, Linux and macOS on 64 bits processors and Python 3.6 to 3.9.\n\n\n**ExtractCode is a (mostly) universal archive extractor.**\n\nInstall with::\n\n    pip install extractcode[full]\n\n\nWhy another extractor?\n----------------------\n\n**it will extract!**\n\nExtractCode will extract things where other archive and compressed file extractors may fail.\n\nExtractCode supports one of largest number of archive formats listed in the\nlong  `List of supported archive formats`_ found at the bottom of this document.\n\n- Say you want to extract the tarball of the Linux kernel source code on Windows.\n  It contains paths that are the same when ignoring the case and therefore will\n  not extract OK on Windows: some file may be munged or the extract may file.\n\n- Or a tarball (on any OS) may contain multiple times the exact same path. In\n  these cases the paths showing up earlier in the archive may be \"hidden\" and\n  overwritten by the same path showing up later in the archive giving the\n  impression that there is only one file.\n\n- Or an archive may be damaged a little but most files can still be extracted.\n\n- Or the extracted files are such permissions that you cannot read them and are\n  not owned by you.\n\n- Or the archive may contain weird paths including relative paths that may be\n  problematic to extract.\n\n- Or the archive may contain special file types (character/device files) that\n  may be problematic to extract.\n\n- Or an archive may be a virtual disk or some file system(s) images that would\n  typically need to be mounted to be accessed, and may require root access\n  and guesswork to find out which partition and filesystem are at play and\n  which driver to use.\n\nIn all these cases, ExtractCode will extract and try hard do the right thing to\nobtain the actual archived content when other tools may fail.\n\nIt can also extract recursively any type of (nested) archives-in-archives.\n\n\nAs a downside, the extracted content may not be exactly what would be extracted\nfor a typical usage of the contained files: for instance some file may be\nrenamed, special files and symlinks are skipped, permissions and owners are\nchanged but this it is fine for primary the use case which is analysis of file\ncontent for software composition or forensic analysis.\n\nBehind the scene, ExtractCode uses multiple tools such as:\n\n- the Python standard library,\n- a custom ctypes binding to libarchive,\n- the 7zip command line tool, and\n- optionally libguestfs on Linux.\n\nWith these, it is possible to extract a large number of common and less common\narchives and compressed file types. ExtractCode tries to extract things in the\nsame way on all supported OSes, including auto-renaming files that would have\ninvalid, non-extractible names on certain filesystems or when there are multiple\ncopies of the same path in a given archive (which is possible in a tar).\n\nThe extraction is driven from  a \"voting\" system that considers the file\nextension(s) and name, the filetype and mimetype (using a ctypes binding to\nlibmagic) to select the most appropriate extractor or decompressor function.\nIt can handle multi-level archives such as tar.gz and can extract recursively\nany nested archives.\n\nVisit https://aboutcode.org and https://github.com/nexB/ for support and download.\n\n\nWe run CI tests on:\n\n - Azure pipelines https://dev.azure.com/nexB/extractcode/_build\n\n\nInstallation\n------------\n\nTo install this package with its full capability (where the binaries for\n7zip and libarchive are installed), use the `full` extra option::\n\n    pip install extractcode[full]\n\nIf you want to use the version of binaries (possibly) provided by your operating\nsystem, use the `minimal` option::\n\n    pip install extractcode\n\nIn this case, you will need to provide a working and compatible libarchive and\n7zip installed and configured in one of these ways such that ExtractCode can\nfind them:\n\n- **a typecode-libarchive and typecode-7z plugin**: See the standard ones at\n  https://github.com/nexB/scancode-plugins/tree/main/builtins\n  These can either bundle a libarchive library, a 7z executable or expose a\n  system-installed libraries.\n  It does so by providing plugin entry points as ``scancode_location_provider``\n  for ``extractcode_libarchive`` that should point to a ``LocationProviderPlugin``\n  subclass with a ``get_locations()`` method that must return a mapping with\n  this key:\n\n    - 'extractcode.libarchive.dll': the absolute path to a **libarchive** shared object/DLL\n\n  See for example:\n\n    - https://github.com/nexB/scancode-plugins/blob/4da5fe8a5ab1c87b9b4af9e54d7ad60e289747f5/builtins/extractcode_libarchive-linux/setup.py#L40\n    - https://github.com/nexB/scancode-plugins/blob/4da5fe8a5ab1c87b9b4af9e54d7ad60e289747f5/builtins/extractcode_libarchive-linux/src/extractcode_libarchive/__init__.py#L17\n\n  And in the same way, the ``scancode_location_provider`` for ``extractcode_7zip``\n  should point to a ``LocationProviderPlugin`` subclass with a ``get_locations()``\n  method that must return a mapping with this key:\n\n    - 'extractcode.sevenzip.exe': the absolute path to a **7zip** executable\n\n  See for example:\n\n    - https://github.com/nexB/scancode-plugins/blob/4da5fe8a5ab1c87b9b4af9e54d7ad60e289747f5/builtins/extractcode_7z-linux/setup.py#L40\n    - https://github.com/nexB/scancode-plugins/blob/4da5fe8a5ab1c87b9b4af9e54d7ad60e289747f5/builtins/extractcode_7z-linux/src/extractcode_7z/__init__.py#L18\n\n- use **environment variables** to point to installed binaries:\n\n    - EXTRACTCODE_LIBARCHIVE_PATH: the absolute path to a libarchive DLL\n    - EXTRACTCODE_7Z_PATH: the absolute path to a 7zip executable\n\n\n- **a system-installed libarchive and 7zip executable** available in the system **PATH**.\n\n\nThe supported binary tools versions are:\n\n- libarchive  3.5.x\n- 7zip 16.5.x\n\nDevelopment\n-----------\n\nTo set up the development environment::\n\n    ./configure --dev\n    source venv/bin/activate\n\n\nTo run unit tests::\n\n    pytest -vvs -n 2\n\n\nTo clean up development environment::\n\n    ./configure --clean\n\n\nTo run the command line tool in the activated environment::\n\n    ./extractcode -h\n\n\nConfiguration with environment variables\n----------------------------------------\n\nExtractCode will use these environment variables if set:\n\n- EXTRACTCODE_LIBARCHIVE_PATH : the path to the ``libarchive.so`` libarchive\n  shared library used to support some of the archive formats. If not provided,\n  ExtractCode will look for a plugin-provided libarchive library path. See\n  https://github.com/nexB/scancode-plugins/tree/main/builtins for such plugins.\n  If no plugin contributes libarchive, then a final attempt is made to look for\n  it in the PATH using standard DLL loading techniques.\n\n- EXTRACTCODE_7Z_PATH : the path to the ``7z`` 7zip executable used to support\n  some of the archive formats. If not provided, ExtractCode will look for a\n  plugin-provided 7z executable path. See\n  https://github.com/nexB/scancode-plugins/tree/main/builtins for such plugins.\n  If no plugin contributes 7z, then a final attempt is made to look for\n  it in the PATH.\n\n- EXTRACTCODE_GUESTFISH_PATH : the path to the ``guestfish`` tool from\n  libguestfs to use to extract VM images. If not provided, ExtractCode will look\n  in the PATH for an installed ``guestfish`` executable instead.\n\n\n\nAdding support for VM images extraction\n---------------------------------------\n\nAdding support for VM images requires the manual installation of the\nlibguestfs-tools system package. This is supported only on Linux.\nOn Debian and Ubuntu you can use this command::\n\n    sudo apt-get install libguestfs-tools\n\n\nOn Ubuntu only, an additional manual step is required as the kernel executable\nfile cannot be read by users as required by libguestfish.\n\nRun this command as a temporary and immediate fix::\n\n    sudo chmod 0644 /boot/vmlinuz-*\n    for k in /boot/vmlinuz-*\n        do sudo dpkg-statoverride --add --update root root 0644 /boot/vmlinuz-$k\n    done\n\nYou likely want both this temporary fix and a more permanent fix; otherwise each\nkernel update will revert to the default permissions and ExtractCode will stop\nworking for VM images extraction.\n\nTherefore follow these instructions:\n\n1. As sudo, create the file /etc/kernel/postinst.d/statoverride with this\ncontent, devised by Kees Cook (@kees) in\nhttps://bugs.launchpad.net/ubuntu/+source/linux/+bug/759725/comments/3 ::\n\n    #!/bin/sh\n    version=\"$1\"\n    # passing the kernel version is required\n    [ -z \"${version}\" ] \u0026\u0026 exit 0\n    dpkg-statoverride --update --add root root 0644 /boot/vmlinuz-${version}\n\n2. Set executable permissions::\n\n    sudo chmod +x /etc/kernel/postinst.d/statoverride\n\nSee also these links for a complete discussion:\n\n    - https://bugs.launchpad.net/ubuntu/+source/linux/+bug/759725\n    - https://bugzilla.redhat.com/show_bug.cgi?id=1670790\n    - https://bugs.launchpad.net/ubuntu/+source/libguestfs/+bug/1813662/comments/24\n\n\nAlternative\n-----------\n\nThese other tools are related and were considered before creating ExtractCode:\n\nThese tools provide built-in, original extraction capabilities:\n\n- https://libarchive.org/ (integrated in ExtractCode) (BSD license)\n- https://www.7-zip.org/ (integrated in ExtractCode) (LGPL license)\n- https://theunarchiver.com/command-line (maintenance status unknown) (LGPL license)\n\nThese tools are command line tools  wrapping other extraction tools and are\nsimilar to ExtractCode but with different goals:\n\n- https://github.com/wummel/patool (wrapper on many CLI tools) (GPL license)\n- https://github.com/dtrx-py/dtrx (wrapper on a few CLI tools) (recently revived) (GPL license)\n\n\n\nList of supported archive formats\n-------------------------------------\n\nExtractCode can extract the following archives formats:\n\nArchive format kind: docs\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n\n  name: Office doc\n     - extensions: .docx, .dotx, .docm, .xlsx, .xltx, .xlsm, .xltm, .pptx, .ppsx, .potx, .pptm, .potm, .ppsm, .odt, .odf, .sxw, .stw, .ods, .ots, .sxc, .stc, .odp, .otp, .odg, .otg, .sxi, .sti, .sxd, .sxg, .std, .sdc, .sda, .sdd, .smf, .sdw, .sxm, .stw, .oxt, .sldx, .epub\n     - filetypes : zip archive, microsoft word 2007+, microsoft excel 2007+, microsoft powerpoint 2007+\n     - mimetypes : application/zip, application/vnd.openxmlformats\n\n  name: Dia diagram doc\n     - extensions: .dia\n     - filetypes : gzip compressed\n     - mimetypes : application/gzip\n\n  name: Graffle diagram doc\n     - extensions: .graffle\n     - filetypes : gzip compressed\n     - mimetypes : application/gzip\n\n  name: SVG Compressed doc\n     - extensions: .svgz\n     - filetypes : gzip compressed\n     - mimetypes : application/gzip\n\nArchive format kind: regular\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n\n  name: Tar\n     - extensions: .tar\n     - filetypes : .tar, tar archive\n     - mimetypes : application/x-tar\n\n  name: Zip\n     - extensions: .zip, .zipx\n     - filetypes : zip archive\n     - mimetypes : application/zip\n\n  name: Java archive\n     - extensions: .war, .sar, .ear\n     - filetypes : zip archive\n     - mimetypes : application/zip, application/java-archive\n\n  name: xz\n     - extensions: .xz\n     - filetypes : xz compressed\n     - mimetypes : application/x-xz\n\n  name: lzma\n     - extensions: .lzma\n     - filetypes : lzma compressed\n     - mimetypes : application/x-xz\n\n  name: Gzip\n     - extensions: .gz, .gzip, .wmz, .arz\n     - filetypes : gzip compressed, gzip compressed data\n     - mimetypes : application/gzip\n\n  name: bzip2\n     - extensions: .bz, .bz2, bzip2\n     - filetypes : bzip2 compressed\n     - mimetypes : application/x-bzip2\n\n  name: lzip\n     - extensions: .lzip\n     - filetypes : lzip compressed\n     - mimetypes : application/x-lzip\n\n  name: RAR\n     - extensions: .rar\n     - filetypes : rar archive\n     - mimetypes : application/x-rar\n\n  name: ar archive\n     - extensions: .ar\n     - filetypes : current ar archive\n     - mimetypes : application/x-archive\n\n  name: 7zip\n     - extensions: .7z\n     - filetypes : 7-zip archive\n     - mimetypes : application/x-7z-compressed\n\n  name: cpio\n     - extensions: .cpio\n     - filetypes : cpio archive\n     - mimetypes : application/x-cpio\n\n  name: Z\n     - extensions: .z\n     - filetypes : compress'd data\n     - mimetypes : application/x-compress\n\nArchive format kind: regular_nested\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n\n  name: Tar xz\n     - extensions: .tar.xz, .txz, .tarxz\n     - filetypes : xz compressed\n     - mimetypes : application/x-xz\n\n  name: Tar lzma\n     - extensions: tar.lzma, .tlz, .tarlz, .tarlzma\n     - filetypes : lzma compressed\n     - mimetypes : application/x-lzma\n\n  name: Tar gzip\n     - extensions: .tgz, .tar.gz, .tar.gzip, .targz, .targzip, .tgzip\n     - filetypes : gzip compressed\n     - mimetypes : application/gzip\n\n  name: Tar lzip\n     - extensions: .tar.lz, .tar.lzip\n     - filetypes : lzip compressed\n     - mimetypes : application/x-lzip\n\n  name: Tar lz4\n     - extensions: .tar.lz4\n     - filetypes : lz4 compressed\n     - mimetypes : application/x-lz4\n\n  name: Tar zstd\n     - extensions: .tar.zst, .tar.zstd\n     - filetypes : zstandard compressed\n     - mimetypes : application/x-zstd\n\n  name: Tar bzip2\n     - extensions: .tar.bz2, .tar.bz, .tar.bzip, .tar.bzip2, .tbz, .tbz2, .tb2, .tarbz2\n     - filetypes : bzip2 compressed\n     - mimetypes : application/x-bzip2\n\n  name: lz4\n     - extensions: .lz4\n     - filetypes : lz4 compressed\n     - mimetypes : application/x-lz4\n\n  name: zstd\n     - extensions: .zst, .zstd\n     - filetypes : zstandard compressed\n     - mimetypes : application/x-zstd\n\n  name: Tar 7zip\n     - extensions: .tar.7z, .tar.7zip, .t7z\n     - filetypes : 7-zip archive\n     - mimetypes : application/x-7z-compressed\n\n  name: Tar Z\n     - extensions: .tz, .tar.z, .tarz\n     - filetypes : compress'd data\n     - mimetypes : application/x-compress\n\n\nArchive format kind: package\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n\n  name: Ruby Gem package\n     - extensions: .gem\n     - filetypes : .tar, tar archive\n     - mimetypes : application/x-tar\n\n  name: Android app\n     - extensions: .apk\n     - filetypes : zip archive\n     - mimetypes : application/zip\n\n  name: Android library\n     - extensions: .aar\n     - filetypes : zip archive\n     - mimetypes : application/zip\n\n  name: Mozilla extension\n     - extensions: .xpi\n     - filetypes : zip archive\n     - mimetypes : application/zip\n\n  name: iOS app\n     - extensions: .ipa\n     - filetypes : zip archive\n     - mimetypes : application/zip\n\n  name: Springboot Java Jar package\n     - extensions: .jar\n     - filetypes : bourne-again shell script executable (binary data)\n     - mimetypes : text/x-shellscript\n\n  name: Java Jar package\n     - extensions: .jar, .zip\n     - filetypes : java archive\n     - mimetypes : application/java-archive\n\n  name: Java Jar package\n     - extensions: .jar\n     - filetypes : zip archive\n     - mimetypes : application/zip\n\n  name: Python package\n     - extensions: .egg, .whl, .pyz, .pex\n     - filetypes : zip archive\n     - mimetypes : application/zip\n\n  name: Microsoft cab\n     - extensions: .cab\n     - filetypes : microsoft cabinet\n     - mimetypes : application/vnd.ms-cab-compressed\n\n  name: Microsoft MSI Installer\n     - extensions: .msi\n     - filetypes : msi installer\n     - mimetypes : application/x-msi\n\n  name: Apple pkg or mpkg package installer\n     - extensions: .pkg, .mpkg\n     - filetypes : xar archive\n     - mimetypes : application/octet-stream\n\n  name: Xar archive v1\n     - extensions: .xar\n     - filetypes : xar archive\n     - mimetypes : application/octet-stream, application/x-xar\n\n  name: Nuget\n     - extensions: .nupkg\n     - filetypes : zip archive, microsoft ooxml\n     - mimetypes : application/zip, application/octet-stream\n\n  name: Static Library\n     - extensions: .a, .lib, .out, .ka\n     - filetypes : current ar archive, current ar archive random library\n     - mimetypes : application/x-archive\n\n  name: Debian package\n     - extensions: .deb, .udeb\n     - filetypes : debian binary package\n     - mimetypes : application/vnd.debian.binary-package, application/x-archive\n\n  name: RPM package\n     - extensions: .rpm, .srpm, .mvl, .vip\n     - filetypes : rpm\n     - mimetypes : application/x-rpm\n\n  name: Apple dmg\n     - extensions: .dmg, .sparseimage\n     - filetypes : zlib compressed\n     - mimetypes : application/zlib\n\nArchive format kind: file_system\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n\n  name: ISO CD image\n     - extensions: .iso, .udf, .img\n     - filetypes : iso 9660 cd-rom, high sierra cd-rom\n     - mimetypes : application/x-iso9660-image\n\n  name: SquashFS disk image\n     - extensions:\n     - filetypes : squashfs\n     - mimetypes :\n\n  name: QEMU QCOW2 disk image\n     - extensions: .qcow2, .qcow, .qcow2c, .img\n     - filetypes : qemu qcow2 image, qemu qcow image\n     - mimetypes : application/octet-stream\n\n  name: VMDK disk image\n     - extensions: .vmdk\n     - filetypes : vmware4 disk image\n     - mimetypes : application/octet-stream\n\n  name: VirtualBox disk image\n     - extensions: .vdi\n     - filetypes : virtualbox disk image\n     - mimetypes : application/octet-stream\n\nArchive format kind: patches\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n\n  name: Patch\n     - extensions: .diff, .patch\n     - filetypes : diff, patch\n     - mimetypes : text/x-diff\n\nArchive format kind: special_package\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n\n  name: InstallShield Installer\n     - extensions: .exe\n     - filetypes : installshield\n     - mimetypes : application/x-dosexec\n\n  name: Nullsoft Installer\n     - extensions: .exe\n     - filetypes : nullsoft installer\n     - mimetypes : application/x-dosexec\n\n\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Faboutcode-org%2Fextractcode","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Faboutcode-org%2Fextractcode","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Faboutcode-org%2Fextractcode/lists"}