https://github.com/bramalkema/openxml-audit
Validate Office files in pure Python with Open XML SDK parity, pytest fixtures, and CI hooks.
https://github.com/bramalkema/openxml-audit
docx odf openxml pptx pytest python validation xlsx
Last synced: about 1 month ago
JSON representation
Validate Office files in pure Python with Open XML SDK parity, pytest fixtures, and CI hooks.
- Host: GitHub
- URL: https://github.com/bramalkema/openxml-audit
- Owner: BramAlkema
- License: mit
- Created: 2026-01-14T21:40:37.000Z (5 months ago)
- Default Branch: main
- Last Pushed: 2026-04-28T00:39:15.000Z (about 1 month ago)
- Last Synced: 2026-04-28T02:31:01.853Z (about 1 month ago)
- Topics: docx, odf, openxml, pptx, pytest, python, validation, xlsx
- Language: Python
- Homepage: https://bramalkema.github.io/openxml-audit/
- Size: 40.4 MB
- Stars: 1
- Watchers: 0
- Forks: 0
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- Contributing: CONTRIBUTING.md
- Funding: .github/FUNDING.yml
- License: LICENSE
Awesome Lists containing this project
README
OpenXML Audit
[](https://pypi.org/project/openxml-audit/)
[](https://pypi.org/project/openxml-audit/)
[](https://pypi.org/project/openxml-audit/)
[](LICENSE)
[](https://github.com/BramAlkema/openxml-audit/actions/workflows/parity-gate.yml)
[](docs/parity_contract.md)
[](docs/odf_validation_contract.md)
[](https://pypi.org/project/openxml-audit/)
Validate OOXML (PPTX/DOCX/XLSX) and ODF files in pure Python — no .NET required.
A Python port of Microsoft's [Open XML SDK](https://github.com/OfficeDev/Open-XML-SDK) validation logic. Check whether generated or modified Office files will open cleanly, directly from Python scripts, CI pipelines, or anywhere .NET isn't practical.
Also supports OASIS OpenDocument Format (ODT/ODS/ODP) with staged conformance levels.
## Evidence ladder
Validation is the floor tier. Whether a file *survives* depends on more than ECMA legality — it also has to load in the target app, survive a save, behave correctly at runtime, and ideally match what the app itself would author. `openxml-audit` organizes this as an evidence ladder (`openxml_audit.EvidenceTier`):
1. **`schema-valid`** — parses against ECMA/OASIS schemas *(this is what `openxml-audit validate` checks)*
2. **`loadable`** — the target app opens without repair
3. **`roundtrip-preserved`** — the app's save does not rewrite the intent
4. **`slideshow-verified`** — runtime behavior matches intent
5. **`ui-authored`** — the app itself produced this structure
Tiers 2–5 are backed by curated corpora of target-app-authored XML. The first corpus lives at [`docs/pptx_oracle/`](docs/pptx_oracle/README.md) — PowerPoint animation/timing, where "schema-valid but silently rewritten" is the dominant failure mode. DOCX and XLSX corpora can follow the same layout when the research starts.
```python
from openxml_audit import EvidenceTier
from openxml_audit.pptx import check_capability
check_capability("pptx.anim.effect.entr.fade", minimum_tier=EvidenceTier.LOADABLE)
```
## Features
- **OOXML Validation**: Package structure, schema, semantic, properties, and format-specific checks for PPTX/DOCX/XLSX — 100% parity with Open XML SDK v3.4.1 without the .NET dependency
- **ODF Validation**: Staged conformance levels — foundation, schema-core (Relax NG), semantic-core, and security-core for ODT/ODS/ODP
- **Evidence ladder**: Validation is the floor tier. Curated PPTX corpora (`docs/pptx_oracle/`) verify loadability, roundtrip preservation, and runtime behavior above it — for features like animation/timing where "schema-valid" isn't enough
- **Fast**: 1.2x the .NET SDK cold, 2.2x warm — validates a 798K DOCX in 101ms
- **pytest Plugin**: `assert_valid_pptx`, `assert_valid_docx`, `assert_valid_xlsx`, `assert_valid_odf` — zero config
- **CI Ready**: GitHub Action, pre-commit hook, and parallel batch validation
- **Multiple Output Formats**: Text, JSON, and XML output
## Why validate?
Libraries that generate Office files routinely produce corrupt output — python-pptx has 12+ open corruption issues, docxtpl has 7, XlsxWriter 25+. These surface as "PowerPoint found a problem" dialogs for end users or silent failures in CI. With AI agents now generating slides and reports, the problem is getting worse.
openxml-audit catches these before your users do — same checks Microsoft's SDK runs, in pure Python.
| Ecosystem | Examples | How openxml-audit helps |
|-----------|----------|------------------------|
| File generators | python-pptx, python-docx, openpyxl, XlsxWriter | Validate output in tests and CI — catch corruption before release |
| Template engines | docxtpl, pptx-template | Jinja2 rendering can break XML structure — validate after render |
| Data pipelines | pandas `to_excel`, tablib, django-import-export | Assert valid exports in pipeline tests |
| AI/LLM agents | Auto-PPT, GenFilesMCP, Docling | AI-generated Office files are unreliable — validate and retry |
| Government / ODF | Suite Numerique, odfpy | ODF conformance for EU regulatory requirements |
## Performance
Pure Python, but close to .NET — lxml does the heavy XML lifting in C.
| Benchmark | .NET SDK | openxml-audit | Ratio |
|-----------|----------|---------------|-------|
| Cold start (6 files, mixed formats) | 994ms | 1,175ms | 1.2x |
| Warm (798K DOCX) | 46ms | 101ms | 2.2x |
| Warm (1.4MB PPTX) | — | 83ms | — |
| Warm (114K XLSX) | — | 29ms | — |
Batch validation supports `--parallel N` for multiprocess speedup. The pytest plugin uses session-scoped fixtures so schema loading happens once per test run.
## Installation
```bash
pip install openxml-audit
```
Or install from source:
```bash
git clone https://github.com/BramAlkema/openxml-audit.git
cd openxml-audit
pip install -e .
```
## Quick Start
### Command Line
```bash
# Validate a single file
openxml-audit presentation.pptx
# Validate an OASIS OpenDocument file
openxml-audit document.odt
# Validate with JSON output
openxml-audit presentation.pptx --output json
# Validate with XML output
openxml-audit presentation.pptx --output xml
# Validate all matching files in a directory
openxml-audit ./presentations/ --recursive
# Validate against a specific Office version
openxml-audit presentation.pptx --format Office2007
# Limit maximum errors reported
openxml-audit presentation.pptx --max-errors 10
```
### Python API
```python
from openxml_audit import validate_pptx, is_valid_pptx, OpenXmlValidator
# Quick check
if is_valid_pptx("presentation.pptx"):
print("File is valid!")
# Detailed validation
result = validate_pptx("presentation.pptx")
if not result.is_valid:
print(f"Found {result.error_count} errors, {result.warning_count} warnings")
for error in result.errors:
print(f" [{error.severity.value}] {error.description}")
# With custom options
from openxml_audit import FileFormat
validator = OpenXmlValidator(
file_format=FileFormat.OFFICE_2019,
max_errors=100,
schema_validation=True,
semantic_validation=True,
)
result = validator.validate("presentation.pptx")
```
## Documentation
- [ADRs](docs/adr/README.md) — evidence-ladder mission and PPTX evidence ownership
- [PPTX oracle corpus](docs/pptx_oracle/README.md) — curated PowerPoint timing
fixtures and XML-first methodology
- [Parity contract](docs/parity_contract.md) — SDK calibration and drift rules
## ODF Validation Depth
ODF validation is staged by explicit conformance level.
| Level | Includes | Does not include |
|---|---|---|
| `foundation` | package/manifest integrity + XML parse sweep | Relax NG schema-core routing, semantic-core rules, security-core checks |
| `schema-core` | foundation + Relax NG validation for routed XML members | semantic-core and security-core checks |
| `semantic-core` | foundation + semantic-core rule families (`ODFSEM*`) | Relax NG schema-core routing, security-core checks |
| `security-core` | semantic-core + signature/encryption structural checks (`ODFSEC*`) | full cryptographic trust guarantees unless crypto verification backend is configured |
Rule registry and policy references:
- semantic rule IDs: `openxml_audit.odf.get_odf_semantic_rules()`
- security policy: `docs/odf_security_policy.md`
- reference calibration/drift contract: `docs/odf_validation_contract.md`
### CLI Conformance Selection
Use `--odf-level` when validating ODF files:
```bash
# foundation
openxml-audit file.odt --validator odf --odf-level foundation
# semantic-core (default)
openxml-audit file.odt --validator odf --odf-level semantic-core
# security-core
openxml-audit file.odt --validator odf --odf-level security-core
```
Schema-core uses bundled OASIS Relax NG schemas by default:
```bash
openxml-audit file.odt \
--validator odf \
--odf-level schema-core
```
Pass `--odf-schema-routes` only when you want to override or extend routing. It accepts either
shape:
- versioned mapping:
- `{"1.3": {"content.xml": "schemas/odf/1.3/content.rng"}}`
- flat legacy mapping:
- `{"content.xml": "schemas/odf/content.rng"}`
Security-core crypto verification hook:
```bash
openxml-audit file.odt \
--validator odf \
--odf-level security-core \
--odf-verify-cryptography
```
### API Conformance Selection
```python
from openxml_audit import FileFormat
from openxml_audit.odf import OdfValidator
# foundation
foundation = OdfValidator(
file_format=FileFormat.ODF_1_3,
schema_validation=False,
semantic_validation=False,
security_validation=False,
)
# schema-core (bundled schemas by default)
schema_core = OdfValidator(
file_format=FileFormat.ODF_1_3,
schema_validation=True,
semantic_validation=False,
security_validation=False,
relaxng_validation=True,
)
# schema-core with custom routes
schema_core_custom = OdfValidator(
file_format=FileFormat.ODF_1_3,
schema_validation=True,
semantic_validation=False,
security_validation=False,
relaxng_validation=True,
schema_routes={"1.3": {"content.xml": "schemas/odf/1.3/content.rng"}},
)
# semantic-core
semantic_core = OdfValidator(
file_format=FileFormat.ODF_1_3,
schema_validation=True,
semantic_validation=True,
security_validation=False,
)
# security-core
security_core = OdfValidator(
file_format=FileFormat.ODF_1_3,
schema_validation=True,
semantic_validation=True,
security_validation=True,
verify_cryptography=False, # set True when crypto backend is available
)
```
### ODF Benchmarking
```bash
# Benchmark an ODF file (5 iterations by default)
python scripts/odf/benchmark_validation.py document.odt
# More iterations, with security checks
python scripts/odf/benchmark_validation.py document.odt --iterations 20 --security
# Foundation-only (skip schema/semantic)
python scripts/odf/benchmark_validation.py document.odt --no-schema --no-semantic
```
Reports avg/min/max/P95 with per-phase breakdown (package_structure, xml_parse, schema, semantic, security).
OOXML benchmark: `python scripts/benchmark_validation.py presentation.pptx`
### Known ODF Limitations
- Schema-core validates bundled routed members by default; use `schema_routes` to extend or
override routing for additional XML parts.
- Security-core validates structure/policy, not full cryptographic trust by default.
- CLI `--odf-level` only applies when the selected/auto-detected validator is ODF.
### ODF Reference Calibration
Compare Python results against external validators (ODF Toolkit, OPF) using the scripts in `scripts/odf/`:
| Script | Purpose |
|--------|---------|
| `run_reference_validators.py` | Run Python + external validators on pinned corpus |
| `compare_reference_results.py` | Diff results into mismatch families |
| `check_reference_drift.py` | Enforce drift policy against baseline |
| `bootstrap_reference_validators.py` | Auto-build external validator commands |
CI workflow: `.github/workflows/odf-reference-calibration.yml` — builds ODF Toolkit and OPF at runtime via Maven/Docker.
Set command templates via `--odf-toolkit-cmd` / `--opf-cmd` or env vars `ODF_TOOLKIT_CMD` / `OPF_ODF_VALIDATOR_CMD`. Placeholders: `{file}`, `{file_dir}`, `{file_name}`, `{file_stem}`, `{file_suffix}`.
## Google Workspace Roundtrip Oracle
The `gsuite` engine in the oracle dispatcher rounds OOXML files
through Google's import/export pipeline (upload → convert to native
Google Slides → export back to .pptx → diff) and classifies what
GSuite drops, transforms, or normalizes. See
[`specs/031-gsuite-roundtrip-oracle.md`](specs/031-gsuite-roundtrip-oracle.md)
for the full design.
### One-time setup
GSuite uploads require **domain-wide delegation** because service
accounts have zero storage quota since Google's 2024 policy change.
The setup is a one-time per-Workspace ceremony:
1. **Create a GCP project** at
(e.g., `openxml-audit-oracle`).
2. **APIs & Services → Library**, enable **Google Drive API**.
3. **IAM & Admin → Service Accounts**, create one (e.g.,
`oracle-roundtrip`); skip the project IAM role grant.
4. On the new SA → **Keys → Add key → JSON**. Save to
`~/.config/openxml-audit/google_service_account.json` and
`chmod 600` it.
5. Note the SA's OAuth client ID (in **Show domain-wide delegation**
on the SA page).
6. In Google Workspace Admin Console
() → **Security → Access and data
control → API controls → Domain-wide Delegation → Add new**.
Paste the OAuth client ID; scope:
`https://www.googleapis.com/auth/drive`. Requires Workspace
super-admin rights — one-time per Workspace.
7. In Drive, create a folder owned by the impersonation subject
(e.g., `openxml-audit-oracle-staging`) to hold in-flight oracle
uploads. Copy its folder ID from the URL.
Install the optional dependency group:
```bash
pip install -e ".[gsuite]"
```
### Running
Three env vars wire it up:
```bash
export GSUITE_ORACLE_CREDS=~/.config/openxml-audit/google_service_account.json # default; override only if elsewhere
export GSUITE_ORACLE_SUBJECT=info@yourdomain.example # the user the SA impersonates
export GSUITE_ORACLE_FOLDER_ID=1abcDEFghijKLM... # the staging folder ID
```
Then:
```bash
python -m openxml_audit.oracle gsuite presentation.pptx
python -m openxml_audit.oracle gsuite ./corpus/ --output gsuite-report.json
```
The report classifies each roundtrip across a `LossClass` taxonomy:
`theme_loss`, `master_loss`, `style_loss`, `font_loss`,
`media_re_encoded`, `metadata_churn`, `structural_normalization`
(parts GSuite *added*), `content_preserved_lossy`,
`content_changed`, `unmapped`. Multiple classes may fire per file.
Drive uploads are deleted in `finally` after each roundtrip — the
oracle never leaves files in your account.
## Open XML SDK (Standalone)
Run the .NET SDK validator separately (requires .NET SDK 8.x or Docker):
```bash
dotnet run --project scripts/sdk_check/sdk_check.csproj -- /path/to/file.pptx
dotnet run --project scripts/sdk_compare/OpenXmlSdkValidator.csproj -- /path/to/file.pptx # JSON
# Via Docker
docker run --rm -v "$PWD:/work" -w /work mcr.microsoft.com/dotnet/sdk:8.0 \
dotnet run --project scripts/sdk_check/sdk_check.csproj -- /work/path/to/file.pptx
```
Supports PPTX/DOCX/XLSX and variants. Configured for Office 2019.
## GitHub Action
Validate Office files in your PRs automatically:
```yaml
# .github/workflows/validate-office-files.yml
name: Validate Office Files
on: [pull_request]
jobs:
validate:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with:
python-version: "3.12"
- uses: BramAlkema/openxml-audit@main
with:
changed-only: "true" # only validate files changed in the PR
```
Options:
| Input | Default | Description |
|-------|---------|-------------|
| `path` | `.` | Directory or file to validate |
| `format` | `Office2019` | Office version to validate against |
| `changed-only` | `false` | Only validate files changed in the PR |
| `recursive` | `true` | Search subdirectories |
| `max-errors` | `100` | Maximum errors per file |
## Pre-commit Hook
```yaml
# .pre-commit-config.yaml
repos:
- repo: https://github.com/BramAlkema/openxml-audit
rev: v0.5.0
hooks:
- id: openxml-audit
```
Validates any `.pptx`, `.docx`, `.xlsx`, `.odt`, `.ods`, or `.odp` file before commit.
## Examples
Ready-to-run scripts in [`examples/`](examples/):
| Script | Description |
|--------|-------------|
| [`validate_python_pptx.py`](examples/validate_python_pptx.py) | Generate a PPTX with python-pptx and validate it |
| [`validate_openpyxl.py`](examples/validate_openpyxl.py) | Generate an XLSX with openpyxl and validate it |
| [`validate_odf.py`](examples/validate_odf.py) | Validate an ODF file (ODT/ODS/ODP) |
| [`ci_validation.py`](examples/ci_validation.py) | Validate all Office files in a directory (CI-ready, OOXML + ODF) |
## CI Workflows
| Workflow | Trigger | Purpose |
|----------|---------|---------|
| `parity-gate.yml` | PR / push | Enforce OOXML parity + perf budget against SDK baseline |
| `calibrate-parity.yml` | Weekly / dispatch | Calibrate against Open XML SDK upstream |
| `sdk-update.yml` | Quarterly / dispatch | Track upstream SDK version changes |
| `odf-reference-calibration.yml` | Dispatch | Run ODF reference validators and drift checks |
| `validate-inputs.yml` | Push to `inputs/` | Validate dropped files with both Python and .NET SDK |
| `release.yml` | Tag push (`v*`) | Build and publish to PyPI |
| `pages.yml` | Push to `main` | Deploy documentation site |
OOXML parity details: `docs/parity_contract.md`. ODF reference contract: `docs/odf_validation_contract.md`.
## pytest Plugin
Fixtures are registered automatically — just `pip install openxml-audit` and use them:
```python
def test_my_presentation(assert_valid_pptx, tmp_path):
output = tmp_path / "output.pptx"
generate_pptx(output)
assert_valid_pptx(output) # fails with detailed errors if invalid
def test_my_document(assert_valid_docx, tmp_path):
output = tmp_path / "output.docx"
generate_docx(output)
assert_valid_docx(output)
def test_my_spreadsheet(assert_valid_xlsx, tmp_path):
output = tmp_path / "output.xlsx"
generate_xlsx(output)
assert_valid_xlsx(output)
def test_odf_file(assert_valid_odf, tmp_path):
output = tmp_path / "output.odt"
generate_odt(output)
assert_valid_odf(output)
```
CLI options:
```bash
# Validate against a specific Office version
pytest --openxml-format Office2007
# Limit errors collected per file
pytest --openxml-max-errors 50
```
Available fixtures: `openxml_validator`, `assert_valid_pptx`, `assert_valid_docx`, `assert_valid_xlsx`, `assert_valid_odf`.
## Integration Helpers
```python
# Context manager
from openxml_audit import validation_context
with validation_context(raise_on_invalid=True) as validator:
result = validator.validate("presentation.pptx")
# Decorator — validate after save
from openxml_audit import validate_on_save
@validate_on_save(raise_on_invalid=True)
def create_presentation(output_path: str) -> None:
Presentation().save(output_path)
# Decorator — require valid input
from openxml_audit import require_valid_pptx
@require_valid_pptx()
def process(input_path: str) -> dict: ...
```
## API Reference
### `OpenXmlValidator` / `OdfValidator`
```python
OpenXmlValidator(file_format=FileFormat.OFFICE_2019, max_errors=1000,
schema_validation=True, semantic_validation=True)
OdfValidator(file_format=FileFormat.ODF_1_3, max_errors=1000,
schema_validation=True, semantic_validation=True,
security_validation=False, strict=True)
```
Both expose:
- `validate(path) -> ValidationResult`
- `validate_with_timings(path) -> (ValidationResult, dict[str, float])`
- `is_valid(path) -> bool`
### `ValidationResult`
| Property | Type | Description |
|----------|------|-------------|
| `is_valid` | `bool` | No ERROR-severity issues |
| `errors` | `list[ValidationError]` | All errors and warnings |
| `error_count` / `warning_count` | `int` | Counts by severity |
| `file_path` | `str` | Validated file path |
| `file_format` | `FileFormat` | Version validated against |
### `ValidationError`
| Property | Type | Description |
|----------|------|-------------|
| `error_type` | `ValidationErrorType` | `PACKAGE`, `BINARY`, `SCHEMA`, `SEMANTIC`, `RELATIONSHIP`, `MARKUP_COMPATIBILITY` |
| `severity` | `ValidationSeverity` | `ERROR`, `WARNING`, `INFO` |
| `description` | `str` | Human-readable message |
| `part_uri` | `str \| None` | Affected part URI |
| `path` | `str \| None` | XPath to affected element |
### Supported Formats
| OOXML | ODF |
|-------|-----|
| `OFFICE_2007` through `MICROSOFT_365` (default: `OFFICE_2019`) | `ODF_1_2`, `ODF_1_3` (default: `ODF_1_3`) |
### Convenience Functions
- `validate_pptx(path) -> ValidationResult`
- `is_valid_pptx(path) -> bool`
## Works Well With
These libraries create Office files — openxml-audit checks them:
| Library | Format | Link |
|---------|--------|------|
| [python-pptx](https://github.com/scanny/python-pptx) | PPTX | Create and update PowerPoint files |
| [python-docx](https://github.com/python-openxml/python-docx) | DOCX | Create and update Word files |
| [openpyxl](https://openpyxl.readthedocs.io/) | XLSX | Create and update Excel files |
```python
from pptx import Presentation
from openxml_audit import validate_pptx
Presentation().save("output.pptx")
result = validate_pptx("output.pptx")
if not result.is_valid:
print(f"{result.error_count} issues found")
```
## Contributing
Contributions are welcome! See [CONTRIBUTING.md](CONTRIBUTING.md) for dev setup and guidelines.
## Looking for Maintainers
This project is actively looking for co-maintainers — especially people working with:
- Office file generation pipelines (python-pptx, python-docx, openpyxl)
- ODF tooling and OASIS conformance
- Open XML SDK internals
If you're interested, open an issue or reach out.
## Funding
If this project saves you time, consider sponsoring its development:
[](https://github.com/sponsors/BramAlkema)
## Changelog
See [CHANGELOG.md](CHANGELOG.md) for a full list of changes by version.
## License
[MIT](LICENSE)
## Acknowledgments
Based on the validation logic from Microsoft's [Open XML SDK](https://github.com/OfficeDev/Open-XML-SDK) for .NET.