https://github.com/aboutcode-org/back2source-data
Checking if package sources and binaries match
https://github.com/aboutcode-org/back2source-data
Last synced: 10 months ago
JSON representation
Checking if package sources and binaries match
- Host: GitHub
- URL: https://github.com/aboutcode-org/back2source-data
- Owner: aboutcode-org
- Created: 2024-06-10T07:11:10.000Z (over 1 year ago)
- Default Branch: main
- Last Pushed: 2025-01-27T01:37:57.000Z (12 months ago)
- Last Synced: 2025-01-27T02:28:52.157Z (12 months ago)
- Language: Python
- Size: 117 MB
- Stars: 0
- Watchers: 3
- Forks: 0
- Open Issues: 3
-
Metadata Files:
- Readme: README.rst
- Changelog: CHANGELOG.rst
- Code of conduct: CODE_OF_CONDUCT.rst
Awesome Lists containing this project
README
=======================================================================
back2source-data: Checking if package's sources and binaries match.
=======================================================================
back2source is designed to provide accurate information about whether and which source code of a
package was used and built in the binaries. It can also help to determine if the source archive
matches the version control checkout. For instance it was used to detect positively that there were
potentially malicious scripts in xz-utils that needed review and that were only present in the
release source archives and were missing from the source code repositories.
back2source goal is to effectively and automatically help software development teams to trust but
verify that the packages they use do not contain unknown statically linked and other third-party
packages be they from trustable origin or malicious.
back2source consists in a set of pipelines and pipeline options in ScanCode.io and command line
tools. back2source supports the analysis of binaries including byte-compiled Java, JavaScript with
map files, ELF with DWARFS debug symbols, Go binaries and plain source archives.
This repository contains the scan results of running back2source on many packages.
To validate the accuracy of back2source, at scale we collected a list of about 1000 open source
packages from the Fedora Linux distribution.
For each of these packages, we collected the source archive and the built binary package URLs.
Then, we used a command line client to run scancode.io's back2source analysis on each pair of
package URLs.
Finally, we ran a script to summarize the results and produce a table of the key results from these
analysis.
This repository contains both the summary and detail results of these analyses: a summary JSON
together with the detailed results of the JSON back2source analysis for each of the pair of source
and binary packages .
There are a couple of interesting trends that emerged from this analysis:
First, using the current code there are only a few packages that are not reported as missing some
source code. After investigation it happens, these are mostly false positives related to the use of
the standard libraries. We plan to improve back2source capabilities to weed out these incorrect
reports. For the true false positives, these are bugs to be fixed.
For the true --but noisy-- positives pointing to the standard library, we will implement a feature
to detect accurately if a problematic file path is part of the standard library or the tool chain or
other common files that may not be present in the source code when the package is built, but are
instead the result of a system-wide package installation. Such packages are commonly not considered
as explicit dependencies as thought to be used only in development, but their code is in practice
commonly reused and injected in the build of native binaries.
Second, there are several insightful findings that seem to be mostly oversights, rather than
malicious cases.
For instance, all required Go dependencies are statically linked in a single executable (ELF, Mach-O
or Windows PE). The corresponding source code if the third-party packages are seldom included in the
source archive.
They may represent a large number of "ghost" packages that are silently ignored by
most analysis tools and most package manifests. As a result these binary packages harbor "unknown
unknowns" problems that go unnoticed:
- open source license compliance violations where required license notices, copyright statements and
other due credits are completely missing.
- security vulnerability risks where package with known vulnerabilities may be included unknowingly
Another case is with C and C++ code that are built with CMake and QT like is common
for KDE-based utilities where a significant volume of the compiled code comes actual third-party
dependencies. Typical C++ "includes" contain the full function definitions that are inlined and
statically linked in the resulting binaries. These are reported as missing sources, because they are
never part of what is considered as actual source for the package, but rather build-time dependencies.
We will need to account for these build-time dependencies to ensure they are part of the source code
side of the analysis. Like with Go, these dependencies may be ignored by most detection tools and
they may be subject to vulnerabilities.
To understand the magnitude of the issue, a Go package like asnmap consists of ten Go source files,
but is compiled from 350 files, 340 of which come from non reported dependencies. But this is not
always the case. The Go aerc code has a source RPM that contains vendored code for all its
included compiled dependencies.
Content of this repository
-----------------------------------------------
- d2d-summary.csv: the summary of the analysis. This is the main attraction. The interesting columns
are the following and all these counters should have a value of zero
- codebase_resources_not_deployed: these are source files not part of the binaries
- codebase_resources_requires_review: these are deployed binary files for which we could not
find one or more source files
- codebase_resources_discrepancies_total: this is the total number of files with an issue
- For DWARF-based analysis we have these columns:
- dwarf_compiled_paths_not_mapped_total: The total number of DWARF compilation unit paths
found in an ELF for which we could not find the corresponding source source in the source
archive.
- dwarf_included_paths_not_mapped_total: The total number of DWARF "include" paths
found in an ELF for which we could not find the corresponding source source in the source
archive.
- etc/scripts: the scripts used to execute the back2source analysis
- data: a directory with a d2d-details.json and d2d-summary.json file for each pair of packages.
The file tree is organized as a mirror of the original web site tree.
- package-pairs.csv: the list of current download URLs for each analyzed package pair
Instructions to re-run this experiment:
-----------------------------------------------
1. Clone this git repository using `git clone https://github.com/aboutcode-org/back2source-data`
2. Create a virtualenv and install requirements using `pip install --requirement requirements.txt`.
3. Optionally, run the script using `python3 etc/scripts/get_fedora_urls.py`.
This generates a file named `package-pairs.csv`. This file is already present in this repo.
5. Install purldb as explained at https://github.com/nexB/purldb using its instructions
7. Run `python3 etc/scripts/run_d2d.py`. This will run the analysis, and generate a summary file
named `d2d-summary.csv`
License
-------
SPDX-License-Identifier: Apache-2.0
The ScanCode.io and PurlDB software is licensed under the Apache License version 2.0.
Data generated with ScanCode.io is provided as-is without warranties.
ScanCode is a trademark of nexB Inc.
Funding
-----------
This project is funded through NGI0 Entrust - https://nlnet.nl/entrust, a fund established by
NLnet - https://nlnet.nl with financial support from the European Commission's Next Generation
Internet https://ngi.eu program. Learn more at the NLnet project page
https://nlnet.nl/project/Back2source
It also receives ongoing support from nexB and other contributors.