https://github.com/talker93/visqol-python

A pure Python implementation of Google's ViSQOL (Virtual Speech Quality Objective Listener) for objective audio/speech quality assessment.
https://github.com/talker93/visqol-python

audio-analysis audio-codec audio-processing audio-quality mos numba objective-metric perceptual-audio pesq polqa speech-quality tflite visqol

Last synced: 2 months ago
JSON representation

A pure Python implementation of Google's ViSQOL (Virtual Speech Quality Objective Listener) for objective audio/speech quality assessment.

Host: GitHub
URL: https://github.com/talker93/visqol-python
Owner: talker93
License: apache-2.0
Created: 2026-03-23T04:47:41.000Z (4 months ago)
Default Branch: main
Last Pushed: 2026-05-26T04:36:13.000Z (2 months ago)
Last Synced: 2026-05-26T06:28:52.644Z (2 months ago)
Topics: audio-analysis, audio-codec, audio-processing, audio-quality, mos, numba, objective-metric, perceptual-audio, pesq, polqa, speech-quality, tflite, visqol
Language: Python
Homepage: https://pypi.org/project/visqol-python/
Size: 994 KB
Stars: 2
Watchers: 0
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- Contributing: CONTRIBUTING.md
- License: LICENSE

Awesome Lists containing this project

awesome-python-scientific-audio - visqol-python - python) [:package:](https://pypi.org/project/visqol-python/) - Port of Google's ViSQOL audio/speech quality metric (MOS-LQO) that installs without Bazel. (Audio Related Packages)

README

          # ViSQOL (Python)

[![PyPI version](https://img.shields.io/pypi/v/visqol-python)](https://pypi.org/project/visqol-python/)

[![CI](https://github.com/talker93/visqol-python/actions/workflows/ci.yml/badge.svg)](https://github.com/talker93/visqol-python/actions/workflows/ci.yml)

[![Python](https://img.shields.io/pypi/pyversions/visqol-python)](https://pypi.org/project/visqol-python/)

[![License](https://img.shields.io/github/license/talker93/visqol-python)](LICENSE)

A pure Python implementation of [Google's ViSQOL](https://github.com/google/visqol) (Virtual Speech Quality Objective Listener) for objective audio/speech quality assessment.

ViSQOL compares a reference audio signal with a degraded version and outputs a **MOS-LQO** (Mean Opinion Score - Listening Quality Objective) score on a scale of **1.0 – 5.0**.

## Features

- **Two modes**: Audio mode (music/general audio at 48 kHz) and Speech mode (speech at 16 kHz)

- **High accuracy**: 12/12 conformance tests pass against the official C++ implementation

  - Audio mode: 9/10 tests produce **identical** MOS scores (diff = 0.000000), 1 test diff = 0.000117

  - Speech mode (polynomial): diff = 0.001057

  - Speech mode (lattice TFLite): diff = 0.002341

- **Two speech quality mappers** matching C++ ViSQOL:

  - **Lattice (default)** — deep-lattice TFLite network (`--use_lattice_model=true` in C++); requires the optional `[lattice]` extra

  - **Polynomial (fallback)** — legacy exponential fit (`--use_lattice_model=false` in C++)

- **Pure Python**: no C/C++ compilation required (the optional `[lattice]` extra adds the Google `ai-edge-litert` TFLite runtime as a binary wheel)

- **Minimal dependencies**: 4 core pip packages (`numpy`, `scipy`, `soundfile`, `libsvm-official`)

- **Optional Numba acceleration**: `pip install visqol-python[accel]` for JIT-compiled Gammatone filterbank (parallel) and a fused NSIM + DP patch matching kernel

- **Optional pyFFTW backend**: `pip install visqol-python[fftw]` routes alignment / xcorr FFTs through FFTW3 — **~16× overall speedup**, RTF 0.036 (vs C++ estimate 0.093)

- **Batch & parallel evaluation**: `measure_batch(parallel=True)` for multi-process execution across CPU cores

- **Fully typed**: PEP 561 `py.typed`, strict mypy, ruff-enforced code style

## Installation

```bash

pip install visqol-python

```

For **C++-default-equivalent speech mode** (deep-lattice TFLite mapper):

```bash

pip install visqol-python[lattice]   # requires Python ≥ 3.10

```

For **Numba-accelerated** Gammatone filtering and the fused NSIM + DP kernel:

```bash

pip install visqol-python[accel]

```

For **FFTW3-backed alignment FFTs** via pyFFTW:

```bash

pip install visqol-python[fftw]

```

Install everything (lattice + numba + fftw):

```bash

pip install visqol-python[all]

```

Or install from source:

```bash

git clone https://github.com/talker93/visqol-python.git

cd visqol-python

pip install -e ".[dev]"

```

> **Note on speech mode parity**: Without the `[lattice]` extra, speech mode falls back to the polynomial mapping (equivalent to running C++ ViSQOL with `--use_lattice_model=false`). The polynomial can over-predict MOS by 1–2 points on degraded speech vs the C++ default. Install `[lattice]` whenever you need numbers that line up with the C++ default behaviour (see [issue #1](https://github.com/talker93/visqol-python/issues/1)).

## Quick Start

### Python API

```python

from visqol import VisqolApi

# Audio mode (default) - for music and general audio

api = VisqolApi()

api.create(mode="audio")

result = api.measure("reference.wav", "degraded.wav")

print(f"MOS-LQO: {result.moslqo:.4f}")

# Speech mode - for speech signals

api = VisqolApi()

api.create(mode="speech")

result = api.measure("ref_speech.wav", "deg_speech.wav")

print(f"MOS-LQO: {result.moslqo:.4f}")

```

### Using NumPy Arrays

```python

import numpy as np

import soundfile as sf

from visqol import VisqolApi

ref, sr = sf.read("reference.wav")

deg, _  = sf.read("degraded.wav")

api = VisqolApi()

api.create(mode="audio")

result = api.measure_from_arrays(ref, deg, sample_rate=sr)

print(f"MOS-LQO: {result.moslqo:.4f}")

```

### Batch Evaluation

```python

from visqol import VisqolApi

api = VisqolApi()

api.create(mode="audio")

file_pairs = [

    ("ref1.wav", "deg1.wav"),

    ("ref2.wav", "deg2.wav"),

    ("ref3.wav", "deg3.wav"),

]

# Sequential with progress callback

results = api.measure_batch(

    file_pairs,

    progress_callback=lambda done, total: print(f"{done}/{total}"),

)

# Multi-process parallel (uses all CPU cores)

results = api.measure_batch(file_pairs, parallel=True, max_workers=4)

for pair, result in zip(file_pairs, results):

    if isinstance(result, Exception):

        print(f"{pair}: FAILED — {result}")

    else:

        print(f"{pair}: MOS-LQO = {result.moslqo:.4f}")

```

### Command Line

```bash

# Audio mode (default)

python -m visqol -r reference.wav -d degraded.wav

# Speech mode

python -m visqol -r reference.wav -d degraded.wav --speech_mode

# Verbose output (per-patch details)

python -m visqol -r reference.wav -d degraded.wav -v

```

**CLI options:**

| Flag | Description |

|------|-------------|

| `-r`, `--reference` | Path to reference WAV file (required) |

| `-d`, `--degraded` | Path to degraded WAV file (required) |

| `--speech_mode` | Use speech mode (16 kHz) |

| `--no_lattice_model` | Speech mode: disable lattice TFLite mapper, use polynomial fallback |

| `--lattice_model` | Custom path to lattice `.tflite` model (speech mode) |

| `--unscaled_speech` | Don't scale polynomial speech MOS to 5.0 (polynomial only) |

| `--model` | Custom SVR model file path (audio mode only) |

| `--search_window` | Search window radius (default: 60) |

| `--verbose`, `-v` | Show detailed per-patch results |

## Output

The `measure()` method returns a `SimilarityResult` object with:

| Field | Description |

|-------|-------------|

| `moslqo` | MOS-LQO score (1.0 – 5.0) |

| `vnsim` | Mean NSIM across all patches |

| `fvnsim` | Per-frequency-band mean NSIM |

| `fstdnsim` | Per-frequency-band std of NSIM |

| `fvdegenergy` | Per-frequency-band degraded energy |

| `patch_sims` | List of per-patch similarity details |

## Modes

### Audio Mode (default)

- Target sample rate: **48 kHz**

- 32 Gammatone frequency bands (50 Hz – 15 000 Hz)

- Quality mapping: SVR (Support Vector Regression) model

- Best for: music, environmental audio, codecs

### Speech Mode

- Target sample rate: **16 kHz**

- 21 Gammatone frequency bands (50 Hz – 8 000 Hz)

- VAD (Voice Activity Detection) based patch selection

- Quality mapping (choose one):

  - **Deep-lattice TFLite (default)** — same mapper as C++ ViSQOL's default `--use_lattice_model=true`; requires `pip install visqol-python[lattice]`

  - **Exponential polynomial (fallback)** — same as C++ `--use_lattice_model=false`; used automatically when the lattice runtime is not installed

- Toggle from Python: `api.create(mode="speech", use_lattice_model=False)`

- Toggle from CLI: `--no_lattice_model`

- Best for: speech, VoIP, telephony

## Performance

Measured on Apple M-series, Python 3.13, audio mode on the `guitar48_stereo` 12.5 s conformance case (3-run average):

| Configuration | RTF | Typical Time | Speedup vs pure Python |

|---|---|---|---|

| Pure Python + NumPy/SciPy | 0.58 | ~7 s | 1.0× |

| + `[accel]` (Numba JIT) | 0.067 | ~0.84 s | 8.7× |

| + `[accel] [fftw]` (Numba + FFTW3) | **0.036** | **~0.45 s** | **16×** |

> RTF (Real-Time Factor) < 1.0 means faster than real-time.

> With Numba + pyFFTW the Python implementation runs at **2.6× the C++ estimated speed** (C++ RTF ≈ 0.093).

Stage-level breakdown of the v3.6.0 fully-accelerated path:

| Stage | Time | % |

|---|---|---|

| Gammatone filterbank | 0.179 s | 40% |

| DP Patch matching (fused NSIM kernel) | 0.131 s | 29% |

| Global alignment (pyFFTW rfft/irfft) | 0.091 s | 20% |

| Fine alignment + NSIM | 0.043 s | 10% |

| Other (SPL, postproc, SVR, …) | 0.003 s | < 1% |

## Project Structure

```

visqol-python/

├── visqol/                    # Main package

│   ├── __init__.py            # Package exports & version

│   ├── api.py                 # Public API (VisqolApi)

│   ├── visqol_manager.py      # Pipeline orchestrator

│   ├── visqol_core.py         # Core algorithm

│   ├── audio_utils.py         # Audio I/O & SPL normalization

│   ├── signal_utils.py        # Envelope, cross-correlation

│   ├── analysis_window.py     # Hann window

│   ├── gammatone.py           # ERB + Gammatone filterbank + spectrogram

│   ├── patch_creator.py       # Patch creation (Image + VAD modes)

│   ├── patch_selector.py      # DP-based optimal patch matching

│   ├── alignment.py           # Global alignment via cross-correlation

│   ├── nsim.py                # NSIM similarity metric

│   ├── quality_mapper.py      # SVR & exponential quality mapping

│   ├── numba_accel.py         # Optional Numba JIT kernels (DP, NSIM, Gammatone)

│   ├── __main__.py            # CLI entry point

│   ├── py.typed               # PEP 561 type marker

│   └── model/                 # Bundled SVR model

│       └── libsvm_nu_svr_model.txt

├── tests/                     # Tests & benchmarks (pytest)

│   ├── conftest.py            # Shared fixtures & CLI options

│   ├── test_quick.py          # Smoke tests (no external data needed)

│   ├── test_conformance.py    # Full conformance tests (needs testdata)

│   ├── test_parallel_correctness.py  # Numba parallel correctness tests

│   └── bench_*.py             # Performance benchmarks

├── .github/workflows/

│   ├── ci.yml                 # CI: lint + type-check + matrix test (Python × NumPy)

│   └── publish.yml            # Auto-publish to PyPI on tag push

├── pyproject.toml             # Package metadata & build config

├── CHANGELOG.md

├── CONTRIBUTING.md

├── LICENSE

└── README.md

```

## Conformance Test Results

Tested against the [official C++ ViSQOL v3.3.3](https://github.com/google/visqol) expected values:

| Test Case | Mode | Expected MOS | Python MOS | Δ |

|-----------|------|-------------|------------|---|

| strauss_lp35 | Audio | 1.3889 | 1.3889 | 0.000000 |

| steely_lp7 | Audio | 2.2502 | 2.2502 | 0.000000 |

| sopr_256aac | Audio | 4.6823 | 4.6823 | 0.000000 |

| ravel_128opus | Audio | 4.4651 | 4.4651 | 0.000000 |

| moonlight_128aac | Audio | 4.6843 | 4.6843 | 0.000000 |

| harpsichord_96mp3 | Audio | 4.2237 | 4.2237 | 0.000000 |

| guitar_64aac | Audio | 4.3497 | 4.3497 | 0.000000 |

| glock_48aac | Audio | 4.3325 | 4.3325 | 0.000000 |

| contrabassoon_24aac | Audio | 2.3469 | 2.3468 | 0.000117 |

| castanets_identity | Audio | 4.7321 | 4.7321 | 0.000000 |

| speech_CA01 (polynomial) | Speech | 3.3745 | 3.3756 | 0.001057 |

| speech_CA01 (lattice) | Speech | 3.3130 | 3.3153 | 0.002341 |

Both speech values come from running the C++ ViSQOL binary directly with the corresponding `--use_lattice_model` flag, so they represent ground-truth parity targets.

## References

- [Google ViSQOL (C++)](https://github.com/google/visqol) — the original implementation this project is ported from

- Hines, A., Gillen, E., Kelly, D., Skoglund, J., Kokaram, A., & Harte, N. (2015). *ViSQOLAudio: An Objective Audio Quality Metric for Low Bitrate Codecs.* The Journal of the Acoustical Society of America.

- Chinen, M., Lim, F. S., Skoglund, J., Gureev, N., O'Gorman, F., & Hines, A. (2020). *ViSQOL v3: An Open Source Production Ready Objective Speech and Audio Metric.* 2020 Twelfth International Conference on Quality of Multimedia Experience (QoMEX).

## License

Apache License 2.0. See [LICENSE](LICENSE) for details.

This project is a Python port of [Google's ViSQOL](https://github.com/google/visqol), which is also licensed under Apache 2.0.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/talker93/visqol-python

Awesome Lists containing this project

README