https://github.com/mithro/msvcpp-normalize-pe
Normalize Windows PE files for byte-for-byte reproducible MSVC++ builds by patching timestamps, GUIDs, and debug metadata
https://github.com/mithro/msvcpp-normalize-pe
build-tools coff debugging deterministic-builds msvc pe portable-executable python reproducible-builds windows
Last synced: 21 days ago
JSON representation
Normalize Windows PE files for byte-for-byte reproducible MSVC++ builds by patching timestamps, GUIDs, and debug metadata
- Host: GitHub
- URL: https://github.com/mithro/msvcpp-normalize-pe
- Owner: mithro
- License: apache-2.0
- Created: 2025-11-14T21:49:22.000Z (7 months ago)
- Default Branch: main
- Last Pushed: 2025-11-16T23:35:37.000Z (7 months ago)
- Last Synced: 2026-03-29T21:01:57.282Z (2 months ago)
- Topics: build-tools, coff, debugging, deterministic-builds, msvc, pe, portable-executable, python, reproducible-builds, windows
- Language: Python
- Size: 240 KB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# msvcpp-normalize-pe - Normalize PE Files for Reproducible MSVC++ Builds
[](https://msvcpp-normalize-pe.readthedocs.io/en/latest/)
[](https://pypi.org/project/msvcpp-normalize-pe/)
[](https://pypi.org/project/msvcpp-normalize-pe/)
[](https://github.com/mithro/msvcpp-normalize-pe/actions)
[](https://github.com/mithro/msvcpp-normalize-pe/actions/workflows/test-windows.yml)
**⚠️ AI-Assisted Development Notice: This project was developed as an experiment in AI-assisted "vibe coding" using Claude Code. While the code has comprehensive tests and linting, it was primarily generated through AI assistance. The implementation is based on reverse-engineering PE file formats and may have edge cases or behaviors that haven't been thoroughly tested with all possible MSVC configurations. Use with caution in production environments and verify results with your specific toolchain.**
A Python tool to patch Windows PE (Portable Executable) files to make MSVC builds reproducible by normalizing timestamps, GUIDs, and other non-deterministic debug metadata.
## The Problem
When compiling Windows executables with Microsoft Visual C++ (MSVC), even with the `/Brepro` flag enabled, builds are **not fully reproducible**. The same source code compiled twice produces different binaries due to non-deterministic debug information:
- **COFF Header TimeDateStamp**: Build timestamp in PE header
- **Debug Directory Timestamps**: 4 separate timestamps in debug entries (CODEVIEW, VC_FEATURE, POGO, REPRO)
- **CODEVIEW GUID**: Random GUID linking .exe to .pdb file
- **CODEVIEW Age**: Incremental counter that varies between builds
- **REPRO Hash**: Composite hash containing the GUID and timestamps
This makes **binary verification in CI impossible** - you can't verify that committed binaries match the source code because every rebuild produces different bytes, even though the executable code is identical.
## The Solution
This tool patches all non-deterministic fields in PE files to fixed, deterministic values:
- **All timestamps** → `0x00000001` (January 1, 1970 + 1 second)
- **CODEVIEW GUID** → `00000000-0000-0000-0000-000000000000`
- **CODEVIEW Age** → `1`
- **REPRO Hash** → All zeros
After patching, **identical source code produces byte-for-byte identical binaries**, enabling reproducible builds and CI verification.
## What Gets Patched
### Fields Patched (8 total)
1. **PE COFF Header TimeDateStamp** (offset varies, typically 0xC0-0x100)
2. **Debug CODEVIEW Entry Timestamp**
3. **Debug CODEVIEW GUID** (16 bytes)
4. **Debug CODEVIEW Age** (4 bytes)
5. **Debug VC_FEATURE Entry Timestamp**
6. **Debug POGO Entry Timestamp**
7. **Debug REPRO Entry Timestamp**
8. **Debug REPRO Hash** (36 bytes)
### What Doesn't Change
- **All executable code** (.text section)
- **All program data** (.data, .rdata sections)
- **Import/Export tables**
- **Section headers**
- **Relocations**
The binary behaves **identically at runtime** - only metadata used for debugging is normalized.
## Installation
### From PyPI (Recommended)
```bash
pip install msvcpp-normalize-pe
```
### From Source
```bash
git clone https://github.com/mithro/msvcpp-normalize-pe.git
cd msvcpp-normalize-pe
pip install .
```
### Using uv
```bash
uv pip install msvcpp-normalize-pe
```
## Usage
### Command Line
After installation, the `msvcpp-normalize-pe` command is available:
```bash
# Basic usage
msvcpp-normalize-pe program.exe
# Custom timestamp
msvcpp-normalize-pe program.exe 1234567890
# Verbose output
msvcpp-normalize-pe --verbose program.exe
# See all options
msvcpp-normalize-pe --help
```
### Python API
You can also use msvcpp-normalize-pe as a library in your Python code:
```python
from pathlib import Path
from msvcpp_normalize_pe import patch_pe_file
result = patch_pe_file(Path("program.exe"), timestamp=1, verbose=True)
if result.success:
print(f"Patched {result.patches_applied} fields")
else:
print(f"Errors: {result.errors}")
```
### Example Output
```
[1/1] COFF header: 0x829692a8 -> 0x00000001
[2/?] Debug CODEVIEW timestamp: 0x829692a8 -> 0x00000001
[3/?] Debug CODEVIEW GUID: e97b6ac706ea9b2dd577392d2bf08df7 -> 00000000000000000000000000000000
[4/?] Debug CODEVIEW Age: 7 -> 1
[5/?] Debug VC_FEATURE timestamp: 0x829692a8 -> 0x00000001
[6/?] Debug POGO timestamp: 0x829692a8 -> 0x00000001
[7/?] Debug REPRO timestamp: 0x829692a8 -> 0x00000001
[8/?] Debug REPRO hash: 20000000e97b6ac7... -> 000000000000000000...
Total: 8 timestamp(s) patched in program.exe
```
## Integration with Build Systems
### Makefile Integration (Native MSVC)
```makefile
# Native MSVC builds
ifeq ($(USE_NATIVE_MSVC),1)
program.exe: program.cpp
cl.exe /O2 /Zi program.cpp /link /DEBUG:FULL /Brepro
msvcpp-normalize-pe program.exe 1
endif
```
### CI/CD Verification Workflow
```yaml
name: Verify Binary Reproducibility
jobs:
verify:
runs-on: windows-latest
steps:
- name: Build from source
run: |
cl.exe /O2 program.cpp /link /DEBUG:FULL /Brepro
msvcpp-normalize-pe program.exe 1
- name: Compare with committed binary
run: |
fc /b program.exe committed/program.exe
```
## Requirements
- **Python 3.9+** (type hints, dataclasses)
- **Target files**: Windows PE executables (.exe) or DLLs (.dll)
- **Architecture**: Works with both 32-bit (PE32) and 64-bit (PE32+) binaries
No runtime dependencies - uses only Python standard library (`struct`, `sys`, `pathlib`, `dataclasses`).
## Limitations and Known Issues
### What This Tool Fixes
- ✅ Makes **PE executables** reproducible (timestamps, GUIDs)
- ✅ Works with **native MSVC** (cl.exe + link.exe)
- ✅ Preserves **debugging capability** (PDB files still work)
### What This Tool Cannot Fix
- ❌ **PDB files remain non-deterministic** (~11% of PDB content varies)
- PDB files contain thousands of small differences (padding, internal offsets, GUIDs)
- Microsoft's PDB format has fundamental non-determinism issues
- Industry solution: Use clang-cl + lld-link instead of native MSVC
- ❌ **Does not work with stripped binaries** (no debug directory to patch)
### Alternative: Use clang-cl + lld-link
For **fully reproducible builds** including PDB files, use LLVM's Windows toolchain:
```bash
clang-cl /O2 /std:c++17 program.cpp /link /DEBUG:FULL /Brepro /TIMESTAMP:1
```
The `/TIMESTAMP:` flag is **only supported by lld-link**, not native MSVC link.exe.
## Technical Details
### PE File Structure
The tool parses the PE file structure to locate and patch:
1. **DOS Header** (offset 0x3C) → PE signature offset
2. **PE Signature** (offset varies) → Verify "PE\0\0"
3. **COFF Header** (after PE sig) → TimeDateStamp at +4
4. **Optional Header** (after COFF) → Contains Data Directories
5. **Data Directory #6** → Debug Directory (RVA + Size)
6. **Debug Directory Entries** → 28-byte structures with timestamps
7. **CODEVIEW RSDS Structure** → GUID at +4, Age at +20
8. **REPRO Hash** → Full hash data
### Why /Brepro Isn't Enough
MSVC's `/Brepro` flag:
- ✅ Removes **some** non-determinism
- ✅ Uses hash-based timestamps instead of wall clock time
- ❌ Still produces **different hashes** for each build
- ❌ GUID remains random
- ❌ Age field increments
This is because `/Brepro` computes a hash of build inputs, but includes random/variable data in that hash.
## Comparison with Alternatives
### vs. ducible
[ducible](https://github.com/jasonwhite/ducible) is an older tool with similar goals:
- ❌ **Unmaintained** (last update 2018)
- ❌ Only patches COFF header timestamp
- ❌ Does not patch Debug Directory timestamps
- ❌ Does not patch GUIDs or Age fields
### vs. clang-cl + lld-link
Using LLVM's toolchain:
- ✅ **Fully reproducible** (including PDB files)
- ✅ Supports `/TIMESTAMP:` flag
- ❌ Not always possible (may need native MSVC for compatibility)
This tool fills the gap when you **must use native MSVC** but still want reproducible .exe files.
## Research and References
The non-determinism of MSVC builds with debug symbols is well-documented:
- **Microsoft PDB Repository Issue #9**: PDB non-determinism issues (GUIDs, padding, uninitialized buffers)
- **Chromium Project**: Uses clang-cl + lld-link specifically for reproducible builds
- **Bazel Team**: Marked `/experimental:deterministic` as "not planned" because "PDBs are not deterministic"
- **Reproducible Builds Mailing List** (Dec 2024): "there is no way to really solve this issue" with MSVC
- **Stack Overflow** (Nov 2024): "No complete solution currently exists for achieving fully reproducible MSVC builds with debug symbols"
## License
Apache License 2.0 - See LICENSE file
## Contributing
Contributions welcome! Please test thoroughly with your build system before submitting PRs.
## Credits
Developed as part of the [ghidra-optimized-stdvector-decompiler](https://github.com/mithro/ghidra-optimized-stdvector-decompiler) project to enable CI verification of demo binaries compiled with multiple MSVC versions.