https://github.com/dima-xd/tonara
Pure-Dart Mandarin tone detection (tones 1–4) from raw PCM - pYIN pitch tracking + an MLP classifier hitting ~93.7% accuracy. No native code, no FFI.
https://github.com/dima-xd/tonara
audio chinese dart dsp flutter language-learning machine-learning mandarin pitch-detection pyin speech tone-detection
Last synced: 11 days ago
JSON representation
Pure-Dart Mandarin tone detection (tones 1–4) from raw PCM - pYIN pitch tracking + an MLP classifier hitting ~93.7% accuracy. No native code, no FFI.
- Host: GitHub
- URL: https://github.com/dima-xd/tonara
- Owner: dima-xd
- License: mit
- Created: 2026-05-30T13:40:03.000Z (15 days ago)
- Default Branch: master
- Last Pushed: 2026-05-30T13:45:08.000Z (15 days ago)
- Last Synced: 2026-05-30T14:03:12.181Z (15 days ago)
- Topics: audio, chinese, dart, dsp, flutter, language-learning, machine-learning, mandarin, pitch-detection, pyin, speech, tone-detection
- Language: Dart
- Homepage: https://abuchi.lol
- Size: 0 Bytes
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- License: LICENSE
Awesome Lists containing this project
README
# Tonara
Pure-Dart **Mandarin Chinese tone detection** (tones 1–4) from raw PCM audio.
No native code, no FFI - only `dart:typed_data` and `dart:math`.
## Pipeline
```
PCM audio
-> VAD (RMS gate)
-> pre-emphasis (y[n] = x[n] − 0.97·x[n−1])
-> Hann framing (frame 1024, hop 256)
-> pYIN pitch tracking (CMNDF + Beta(2,18) prior + Viterbi)
-> voiced F0 contour (octave/spike/edge cleanup)
-> semitone-relative contour (resample to N=20) + shape/duration features
-> learned MLP classifier (default - ~93.7% LOSO-CV, every tone ≥90%)
or rule-based + KNN k=5 (opt-in, transparent - ~82%)
-> optional DTW reference comparison
-> ToneResult
```
## Install
```yaml
dependencies:
tonara: ^0.1.0
```
## Usage
```dart
import 'dart:typed_data';
import 'package:tonara/tonara.dart';
final analyzer = TonaraAnalyzer(sampleRate: 16000);
// Single syllable.
final ToneResult result = analyzer.analyze(samples); // Float32List in [-1, 1]
if (result.error == null) {
print('Tone ${result.tone} (confidence ${result.confidence})');
print(result.feedback);
}
// Compare against a reference recording.
final scored = analyzer.analyzeWithReference(
samples,
reference: nativeSpeakerSamples,
expectedTone: 2,
);
print('similarity: ${scored.similarityScore}');
// Real-time streaming - one ToneFrame per detected syllable.
await for (final ToneFrame frame in analyzer.stream(micChunks)) {
print('syllable ${frame.syllableIndex}: tone ${frame.result.tone}');
}
```
## The seven features
| Feature | Meaning |
| --- | --- |
| `linearSlope` | overall least-squares slope |
| `quadraticCoeff` | x² coefficient (positive ⇒ U-shape ⇒ tone 3) |
| `midpointDip` | midpoint minus endpoint mean (negative ⇒ dip) |
| `pitchRange` | max − min of the raw Hz contour |
| `startToMidSlope` | slope of the first half |
| `midToEndSlope` | slope of the second half |
| `normalizedVariance` | variance of the z-scored contour |
## Classification
Two classifiers are available; `TonaraAnalyzer(useModel: ...)` selects between
them (default `true`):
- **Learned model** (`tone_model.dart`) - a two-layer MLP (32 -> 48 -> 24 -> 4)
over the semitone-relative contour plus shape/duration summary features.
**~93.7% leave-one-speaker-out** on a corpus of 2500+ labeled Mandarin
clips, every tone ≥90%. Default.
- **Rule-based + KNN** (`classify`) - a transparent decision tree on 7 shape
features with a k = 5 KNN fallback. ~82%. Use it when you want interpretable
decisions or no embedded weights.
### Rule-based decision tree (`useModel: false`)
Slope/curvature features are measured on a normalized **[-1, 1]** x-axis, so the
thresholds are independent of the contour length. The cut points were tuned
against the training corpus (see below).
1. `pitchRange < 5` -> tone 0 (neutral / unvoiced)
2. `pitchRange < 22 || normalizedVariance < 0.08` -> tone 1 (level - a level tone
has the least movement; `pitchRange` in Hz is its only robust cue, since
z-scoring inflates a flat contour's slope)
3. `startToMidSlope < −0.2 && midToEndSlope > 0.4 && quadraticCoeff > 0.4`
-> tone 3 (dip: does not rise in the first half, then rises)
4. `linearSlope < −0.4` -> tone 4 (falling)
5. `linearSlope > 1.0 && startToMidSlope > −0.1` -> tone 2 (rising throughout)
6. otherwise -> KNN (k = 5) over 40 hand-tuned prototypes
> These differ from a naive reading of the original design in ways the data
> forced: (a) the x-axis is normalized so the `linearSlope`/`quadraticCoeff`
> thresholds are reachable at all; (b) tone 3 is separated from tone 2 by the
> **first-half** slope (a citation third tone also ends higher than it starts,
> so overall slope can't tell them apart); (c) a small pitch range means *level*
> tone 1, not tone 0.
## Validation on real audio
The learned model was trained and validated on a corpus of **2500+ labeled
single-syllable Mandarin recordings** (multiple native speakers; the tone and
speaker are encoded in each filename). The audio itself is not distributed -
only the trained weights ship, in `lib/src/tone_model.dart`. Drop your own
labeled `.wav` clips into `audio/train/` to retrain:
```bash
dart run tool/train_model.dart # prints LOSO-CV, regenerates tone_model.dart
```
`train_model.dart` reports honest accuracy via **leave-one-speaker-out
cross-validation** (each speaker is classified by a model trained only on the
others), then ships weights trained on every speaker.
**Learned model - 93.7% LOSO-CV** with **every tone above 90%**:
| | t1 | t2 | t3 | t4 |
|--|----|----|----|----|
| accuracy | 98% | **90%** | **91%** | 96% |
The model is a two-hidden-layer MLP (32 -> 48 -> 24 -> 4) over the
semitone-relative contour plus shape/duration summary features.
Tones 2 and 3 are the hard pair. Tone-3 citation recordings include both full
dipping (˅) and *reduced* realizations - a low fall (no final rise, looks like
tone 4) or a low rise (no initial fall, looks like tone 2). These
"half-third-tones" are acoustically ambiguous from F0 alone, so the raw model
makes *confident* errors on the fuzzy tone-2/3 boundary that no amount of extra
features, network depth, or loss weighting could fix (all plateaued tone 3 at
~88%). Because tones 1 and 4 carry large margins (98% / 96%), the classifier
applies a **per-class decision bias** (`decisionBias` in `train_model.dart`)
that favours tones 2 and 3 at the boundary, pulling slack from tones 1/4 so all
four clear 90%. This is a deliberate balance choice, not a raw accuracy gain;
overall sits at ~93.7%.
The **rule-based fallback** (`useModel: false`) reaches ~82%. Its main
confusions come from z-score normalization erasing the level-tone flatness cue.
The learned model avoids this by classifying the *semitone-relative* contour,
which preserves both shape and the small magnitude of a level tone.
## Pitch & preprocessing notes
- **Pre-emphasis is off by default** (`applyPreEmphasis: false`). It is a
high-pass that attenuates the fundamental and roughly halves voiced-frame
detection, so the pitch path runs on the clean signal. Enable it only for
spectral experiments.
- The raw F0 contour is cleaned before feature extraction: octave-error repair,
a 3-point median filter, and a one-frame edge trim (`refineF0`).
- Real recordings vary widely in level; peak-normalise input before `analyze`
(the harness does this) so the fixed RMS gate behaves consistently.
## Development
```bash
dart pub get
dart analyze
dart test
dart run example/main.dart
```
## Notes & limitations
- The KNN prototypes in `lib/src/reference_data.dart` are hand-tuned from the
phonetics literature, not trained on a corpus; classification is heuristic.
- pYIN frequency resolution is sharpened by parabolic interpolation of the
CMNDF minimum (~ sub-Hertz on a clean tone).
- Pre-emphasis is applied in the full pipeline; the single-frame `pyinFrame`
entry point operates on whatever frame you pass it.