{"id":20325789,"url":"https://github.com/aboutcode-org/back2source-data","last_synced_at":"2025-03-04T10:27:33.225Z","repository":{"id":244217036,"uuid":"812928456","full_name":"aboutcode-org/back2source-data","owner":"aboutcode-org","description":"Checking if package sources and binaries match","archived":false,"fork":false,"pushed_at":"2025-01-27T01:37:57.000Z","size":122688,"stargazers_count":0,"open_issues_count":3,"forks_count":0,"subscribers_count":3,"default_branch":"main","last_synced_at":"2025-01-27T02:28:52.157Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/aboutcode-org.png","metadata":{"files":{"readme":"README.rst","changelog":"CHANGELOG.rst","contributing":null,"funding":null,"license":null,"code_of_conduct":"CODE_OF_CONDUCT.rst","threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-06-10T07:11:10.000Z","updated_at":"2025-01-27T01:38:07.000Z","dependencies_parsed_at":"2024-08-14T10:06:07.940Z","dependency_job_id":null,"html_url":"https://github.com/aboutcode-org/back2source-data","commit_stats":null,"previous_names":["aboutcode-org/back2source-data"],"tags_count":1,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/aboutcode-org%2Fback2source-data","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/aboutcode-org%2Fback2source-data/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/aboutcode-org%2Fback2source-data/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/aboutcode-org%2Fback2source-data/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/aboutcode-org","download_url":"https://codeload.github.com/aboutcode-org/back2source-data/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":241829324,"owners_count":20027063,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-14T19:41:42.268Z","updated_at":"2025-03-04T10:27:28.212Z","avatar_url":"https://github.com/aboutcode-org.png","language":"Python","readme":"=======================================================================\nback2source-data: Checking if package's sources and binaries match.\n=======================================================================\n\nback2source is designed to provide accurate information about whether and which source code of a\npackage was used and built in the binaries. It can also help to determine if the source archive\nmatches the version control checkout. For instance it was used to detect positively that there were\npotentially malicious scripts in xz-utils that needed review and that were only present in the\nrelease source archives and were missing from the source code repositories.\n\nback2source goal is to effectively and automatically help software development teams to trust but\nverify that the packages they use do not contain unknown statically linked and other third-party\npackages be they from trustable origin or malicious.\n\nback2source consists in a set of pipelines and pipeline options in ScanCode.io and command line\ntools. back2source supports the analysis of binaries including byte-compiled Java, JavaScript with\nmap files, ELF with DWARFS debug symbols, Go binaries and plain source archives.\n\nThis repository contains the scan results of running back2source on many packages.\n\nTo validate the accuracy of back2source, at scale we collected a list of about 1000 open source\npackages from the Fedora  Linux distribution.\n\nFor each of these packages, we collected the source archive and the built binary package URLs.\n\nThen, we used a command line client to run scancode.io's back2source analysis on each pair of\npackage URLs.\n\nFinally, we ran a script to summarize the results and produce a table of the key results from these\nanalysis.\n\nThis repository contains both the summary and detail results of these analyses: a summary JSON\ntogether with the detailed results of the JSON back2source analysis for  each of the pair of source\nand binary packages .\n\nThere are a couple of interesting trends that emerged from this analysis:\n\nFirst, using the current code there are only a few packages that are not reported as missing some\nsource code. After investigation it happens, these are mostly false positives related to the use of\nthe standard libraries. We plan to improve back2source capabilities to weed out these incorrect\nreports. For the true false positives, these are bugs to be fixed.\n\nFor the true --but noisy-- positives pointing to the standard library, we will implement a feature\nto detect accurately if a problematic file path is part of the standard library or the tool chain or\nother common files that may not be present in the source code when the package is built, but are\ninstead the result of a system-wide package installation. Such packages are commonly not considered\nas explicit dependencies as thought to be used only in development, but their code is in practice\ncommonly reused and injected in the build of native binaries.\n\nSecond, there are several insightful findings that seem to be mostly oversights, rather than\nmalicious cases.\n\nFor instance, all required Go dependencies are statically linked in a single executable (ELF, Mach-O\nor Windows PE). The corresponding source code if the third-party packages are seldom included in the\nsource archive.  \nThey may represent a large number of \"ghost\" packages that are silently ignored by\nmost analysis tools and most package manifests. As a result these binary packages harbor \"unknown\nunknowns\" problems that go unnoticed:\n\n- open source license compliance violations where required license notices, copyright statements and\nother due credits are completely missing.\n\n- security vulnerability risks where package with known vulnerabilities may be included unknowingly\n\nAnother case is with C and C++ code that are built with CMake and QT like is common\nfor KDE-based utilities where a significant volume of the compiled code comes actual third-party\ndependencies. Typical C++ \"includes\" contain the full function definitions that are inlined and\nstatically linked in the resulting binaries. These are reported as missing sources, because they are\nnever part of what is considered as actual source for the package, but rather build-time dependencies.\nWe will need to account for these build-time dependencies to ensure they are part of the source code\nside of the analysis. Like with Go, these dependencies may be ignored by most detection tools and\nthey may be subject to vulnerabilities.\n\nTo understand the magnitude of the issue, a Go package like asnmap consists of ten Go source files,\nbut is compiled from 350 files, 340 of which come from non reported dependencies. But this is not\nalways the case. The Go aerc code has a source RPM that contains vendored code for all its\nincluded compiled dependencies.\n\n\nContent of this repository\n-----------------------------------------------\n\n- d2d-summary.csv: the summary of the analysis. This is the main attraction. The interesting columns\n  are the following and all these counters should have a value of zero\n    - codebase_resources_not_deployed: these are source files not part of the binaries\n    - codebase_resources_requires_review: these are deployed binary files for which we could not\n      find one or more source files\n    - codebase_resources_discrepancies_total: this is the total number of files with an issue\n    - For DWARF-based analysis we have these columns:\n        - dwarf_compiled_paths_not_mapped_total: The total number of DWARF compilation unit paths\n          found in an ELF for which we could not find the corresponding source source in the source\n          archive.\n        - dwarf_included_paths_not_mapped_total: The total number of DWARF \"include\" paths\n          found in an ELF for which we could not find the corresponding source source in the source\n          archive.\n- etc/scripts: the scripts used to execute the back2source analysis\n- data: a directory with a d2d-details.json and d2d-summary.json file for each pair of packages.\n  The file tree is organized as a mirror of the original web site tree.\n- package-pairs.csv: the list of current download URLs for each analyzed package pair\n\n\nInstructions to re-run this experiment:\n-----------------------------------------------\n\n1. Clone this git repository using `git clone  https://github.com/aboutcode-org/back2source-data`\n2. Create a virtualenv and install requirements using `pip install --requirement requirements.txt`.\n3. Optionally, run the script using `python3 etc/scripts/get_fedora_urls.py`.\n   This generates a file named `package-pairs.csv`. This file is already present in this repo.\n5. Install purldb as explained at https://github.com/nexB/purldb using its instructions\n7. Run `python3 etc/scripts/run_d2d.py`. This will run the analysis, and generate a summary file\n   named `d2d-summary.csv`\n\n\nLicense\n-------\n\nSPDX-License-Identifier: Apache-2.0\n\nThe ScanCode.io and PurlDB software is licensed under the Apache License version 2.0.\nData generated with ScanCode.io is provided as-is without warranties.\nScanCode is a trademark of nexB Inc.\n\n\nFunding\n-----------\n\nThis project is funded through NGI0 Entrust - https://nlnet.nl/entrust, a fund established by\nNLnet - https://nlnet.nl with financial support from the European Commission's Next Generation\nInternet https://ngi.eu program. Learn more at the NLnet project page\nhttps://nlnet.nl/project/Back2source\n\nIt also receives ongoing support from nexB and other contributors.","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Faboutcode-org%2Fback2source-data","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Faboutcode-org%2Fback2source-data","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Faboutcode-org%2Fback2source-data/lists"}