https://github.com/dima-xd/tonara

Pure-Dart Mandarin tone detection (tones 1–4) from raw PCM - pYIN pitch tracking + an MLP classifier hitting ~93.7% accuracy. No native code, no FFI.
https://github.com/dima-xd/tonara

audio chinese dart dsp flutter language-learning machine-learning mandarin pitch-detection pyin speech tone-detection

Last synced: about 1 month ago
JSON representation

Pure-Dart Mandarin tone detection (tones 1–4) from raw PCM - pYIN pitch tracking + an MLP classifier hitting ~93.7% accuracy. No native code, no FFI.

Host: GitHub
URL: https://github.com/dima-xd/tonara
Owner: dima-xd
License: mit
Created: 2026-05-30T13:40:03.000Z (about 1 month ago)
Default Branch: master
Last Pushed: 2026-05-30T13:45:08.000Z (about 1 month ago)
Last Synced: 2026-05-30T14:03:12.181Z (about 1 month ago)
Topics: audio, chinese, dart, dsp, flutter, language-learning, machine-learning, mandarin, pitch-detection, pyin, speech, tone-detection
Language: Dart
Homepage: https://abuchi.lol
Size: 0 Bytes
Stars: 0
Watchers: 0
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- License: LICENSE

Awesome Lists containing this project

README

          # Tonara

Pure-Dart **Mandarin Chinese tone detection** (tones 1–4) from raw PCM audio.

No native code, no FFI - only `dart:typed_data` and `dart:math`.

## Pipeline

```

PCM audio

  -> VAD (RMS gate)

  -> pre-emphasis (y[n] = x[n] − 0.97·x[n−1])

  -> Hann framing (frame 1024, hop 256)

  -> pYIN pitch tracking (CMNDF + Beta(2,18) prior + Viterbi)

  -> voiced F0 contour (octave/spike/edge cleanup)

  -> semitone-relative contour (resample to N=20) + shape/duration features

  -> learned MLP classifier   (default - ~93.7% LOSO-CV, every tone ≥90%)

     or rule-based + KNN k=5  (opt-in, transparent - ~82%)

  -> optional DTW reference comparison

  -> ToneResult

```

## Install

```yaml

dependencies:

  tonara: ^0.1.0

```

## Usage

```dart

import 'dart:typed_data';

import 'package:tonara/tonara.dart';

final analyzer = TonaraAnalyzer(sampleRate: 16000);

// Single syllable.

final ToneResult result = analyzer.analyze(samples); // Float32List in [-1, 1]

if (result.error == null) {

  print('Tone ${result.tone} (confidence ${result.confidence})');

  print(result.feedback);

}

// Compare against a reference recording.

final scored = analyzer.analyzeWithReference(

  samples,

  reference: nativeSpeakerSamples,

  expectedTone: 2,

);

print('similarity: ${scored.similarityScore}');

// Real-time streaming - one ToneFrame per detected syllable.

await for (final ToneFrame frame in analyzer.stream(micChunks)) {

  print('syllable ${frame.syllableIndex}: tone ${frame.result.tone}');

}

```

## The seven features

| Feature | Meaning |

| --- | --- |

| `linearSlope` | overall least-squares slope |

| `quadraticCoeff` | x² coefficient (positive ⇒ U-shape ⇒ tone 3) |

| `midpointDip` | midpoint minus endpoint mean (negative ⇒ dip) |

| `pitchRange` | max − min of the raw Hz contour |

| `startToMidSlope` | slope of the first half |

| `midToEndSlope` | slope of the second half |

| `normalizedVariance` | variance of the z-scored contour |

## Classification

Two classifiers are available; `TonaraAnalyzer(useModel: ...)` selects between

them (default `true`):

- **Learned model** (`tone_model.dart`) - a two-layer MLP (32 -> 48 -> 24 -> 4)

  over the semitone-relative contour plus shape/duration summary features.

  **~93.7% leave-one-speaker-out** on a corpus of 2500+ labeled Mandarin

  clips, every tone ≥90%. Default.

- **Rule-based + KNN** (`classify`) - a transparent decision tree on 7 shape

  features with a k = 5 KNN fallback. ~82%. Use it when you want interpretable

  decisions or no embedded weights.

### Rule-based decision tree (`useModel: false`)

Slope/curvature features are measured on a normalized **[-1, 1]** x-axis, so the

thresholds are independent of the contour length. The cut points were tuned

against the training corpus (see below).

1. `pitchRange < 5` -> tone 0 (neutral / unvoiced)

2. `pitchRange < 22 || normalizedVariance < 0.08` -> tone 1 (level - a level tone

   has the least movement; `pitchRange` in Hz is its only robust cue, since

   z-scoring inflates a flat contour's slope)

3. `startToMidSlope < −0.2 && midToEndSlope > 0.4 && quadraticCoeff > 0.4`

   -> tone 3 (dip: does not rise in the first half, then rises)

4. `linearSlope < −0.4` -> tone 4 (falling)

5. `linearSlope > 1.0 && startToMidSlope > −0.1` -> tone 2 (rising throughout)

6. otherwise -> KNN (k = 5) over 40 hand-tuned prototypes

> These differ from a naive reading of the original design in ways the data

> forced: (a) the x-axis is normalized so the `linearSlope`/`quadraticCoeff`

> thresholds are reachable at all; (b) tone 3 is separated from tone 2 by the

> **first-half** slope (a citation third tone also ends higher than it starts,

> so overall slope can't tell them apart); (c) a small pitch range means *level*

> tone 1, not tone 0.

## Validation on real audio

The learned model was trained and validated on a corpus of **2500+ labeled

single-syllable Mandarin recordings** (multiple native speakers; the tone and

speaker are encoded in each filename). The audio itself is not distributed -

only the trained weights ship, in `lib/src/tone_model.dart`. Drop your own

labeled `.wav` clips into `audio/train/` to retrain:

```bash

dart run tool/train_model.dart   # prints LOSO-CV, regenerates tone_model.dart

```

`train_model.dart` reports honest accuracy via **leave-one-speaker-out

cross-validation** (each speaker is classified by a model trained only on the

others), then ships weights trained on every speaker.

**Learned model - 93.7% LOSO-CV** with **every tone above 90%**:

| | t1 | t2 | t3 | t4 |

|--|----|----|----|----|

| accuracy | 98% | **90%** | **91%** | 96% |

The model is a two-hidden-layer MLP (32 -> 48 -> 24 -> 4) over the

semitone-relative contour plus shape/duration summary features.

Tones 2 and 3 are the hard pair. Tone-3 citation recordings include both full

dipping (˅) and *reduced* realizations - a low fall (no final rise, looks like

tone 4) or a low rise (no initial fall, looks like tone 2). These

"half-third-tones" are acoustically ambiguous from F0 alone, so the raw model

makes *confident* errors on the fuzzy tone-2/3 boundary that no amount of extra

features, network depth, or loss weighting could fix (all plateaued tone 3 at

~88%). Because tones 1 and 4 carry large margins (98% / 96%), the classifier

applies a **per-class decision bias** (`decisionBias` in `train_model.dart`)

that favours tones 2 and 3 at the boundary, pulling slack from tones 1/4 so all

four clear 90%. This is a deliberate balance choice, not a raw accuracy gain;

overall sits at ~93.7%.

The **rule-based fallback** (`useModel: false`) reaches ~82%. Its main

confusions come from z-score normalization erasing the level-tone flatness cue.

The learned model avoids this by classifying the *semitone-relative* contour,

which preserves both shape and the small magnitude of a level tone.

## Pitch & preprocessing notes

- **Pre-emphasis is off by default** (`applyPreEmphasis: false`). It is a

  high-pass that attenuates the fundamental and roughly halves voiced-frame

  detection, so the pitch path runs on the clean signal. Enable it only for

  spectral experiments.

- The raw F0 contour is cleaned before feature extraction: octave-error repair,

  a 3-point median filter, and a one-frame edge trim (`refineF0`).

- Real recordings vary widely in level; peak-normalise input before `analyze`

  (the harness does this) so the fixed RMS gate behaves consistently.

## Development

```bash

dart pub get

dart analyze

dart test

dart run example/main.dart

```

## Notes & limitations

- The KNN prototypes in `lib/src/reference_data.dart` are hand-tuned from the

  phonetics literature, not trained on a corpus; classification is heuristic.

- pYIN frequency resolution is sharpened by parabolic interpolation of the

  CMNDF minimum (~ sub-Hertz on a clean tone).

- Pre-emphasis is applied in the full pipeline; the single-frame `pyinFrame`

  entry point operates on whatever frame you pass it.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/dima-xd/tonara

Awesome Lists containing this project

README