https://github.com/helloooideeeeea/realtimecutvadlibrary

A real-time Voice Activity Detection (VAD) library for iOS and macOS using Silero models powered by ONNX Runtime. Includes advanced noise suppression and audio preprocessing with WebRTC APM, supporting seamless WAV data output with header metadata.
https://github.com/helloooideeeeea/realtimecutvadlibrary

ios macos onnxruntime silero-vad vad webrtc-audio-processing

Last synced: 7 months ago
JSON representation

Host: GitHub
URL: https://github.com/helloooideeeeea/realtimecutvadlibrary
Owner: helloooideeeeea
License: mit
Created: 2025-02-03T12:55:14.000Z (9 months ago)
Default Branch: main
Last Pushed: 2025-04-06T08:34:39.000Z (7 months ago)
Last Synced: 2025-04-12T08:59:16.778Z (7 months ago)
Topics: ios, macos, onnxruntime, silero-vad, vad, webrtc-audio-processing
Language: Swift
Homepage:
Size: 4.02 MB
Stars: 7
Watchers: 1
Forks: 3
Open Issues: 1
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

          # RealTime Silero VAD iOS/macOS Library

A real-time Voice Activity Detection (VAD) library for iOS and macOS using Silero models. This library helps detect human voice in real-time, allowing developers to implement efficient voice-based features in their applications.

---

## Features

- **Real-time Voice Activity Detection (VAD)**

- **Supports Silero Model Versions v4 and v5**

- **Customizable audio sample rates**

- **Outputs WAV data with automatic sample rate conversion to 16 kHz**

- **iOS and macOS support**

- **Supports CocoaPods and Swift Package Manager (SPM)**

- **🆕 Real-time PCM stream callback (`voiceDidContinueWithPCMFloatData`)**

---

## Sample iOS App Demo

Check out the sample iOS app demonstrating real-time VAD:

[Sample iOS App Demo](https://github.com/user-attachments/assets/6e4d6ae5-4d34-4114-930b-f399bcf123ba)

---

## Installation

### Using CocoaPods

Add the following to your `Podfile` to integrate the library:

```ruby

pod 'RealTimeCutVADLibrary', '~> 1.0.9'

```

Then, run:

```bash

pod install

```

### Using Swift Package Manager (SPM)

You can also integrate the library using Swift Package Manager. Add the following to your `Package.swift` file:

```swift

.dependencies: [

    .package(url: "https://github.com/helloooideeeeea/RealTimeCutVADLibrary.git", from: "1.0.9")

]

```

Or, add the URL directly through Xcode's **File > Swift Packages > Add Package Dependency**.

---

## Usage

Import the library and set up VAD in your `ViewController`:

```swift

import RealTimeCutVADLibrary

class ViewController: UIViewController {

    var vadManager: VADWrapper?

    override func viewDidLoad() {

        super.viewDidLoad()

        // Initialize VAD Manager

        vadManager = VADWrapper()

        // Set VAD delegate to receive callbacks

        vadManager?.delegate = self

        // Set Silero model version (v4 or v5). Version v5 is recommended.

        vadManager?.setSileroModel(.v5)

        // Calling setVADThreshold is optional. If not called, the recommended default values will be used.

        // vadManager?.setThresholdWithVadStartDetectionProbability(0.7,0.7,0.5,0.95,10,57)

        // Set audio sample rate (8, 16, 24, or 48 kHz)

        vadManager?.setSamplerate(.SAMPLERATE_48)

        // Retrieve audio channel data from Microphone

        guard let channelData = buffer.floatChannelData else {

            return

        }

        // Extract frame length from the audio buffer

        let frameLength = UInt(buffer.frameLength)

        // Select the first channel as mono audio data

        let monoralData = channelData[0] // This is UnsafeMutablePointer

        // Send the audio data directly to VAD processing

        vadManager?.processAudioData(withBuffer: monoralData, count: frameLength)

        // ❌ Deprecated Usage. Do NOT use this method: Slow due to NSNumber conversion

        var monoralDataArray: [NSNumber] = []

        for i in 0.., 

    vadEndDetectionProbability: <#T##Float#>, 

    voiceStartVadTrueRatio: <#T##Float#>, 

    voiceEndVadFalseRatio: <#T##Float#>, 

    voiceStartFrameCount: <#T##Int32#>, 

    voiceEndFrameCount: <#T##Int32#>)

```

By adjusting these parameters, you can fine-tune the strictness of voice segmentation to better suit your application needs.

- **Silero v5 Performance**:

The performance of Silero model v5 may vary, and adjusting the thresholds might be necessary to achieve optimal results. There are also discussions on this topic, such as [this one](https://github.com/SYSTRAN/faster-whisper/issues/934#issuecomment-2439340290).

## Algorithm Explanation

### ONNX Runtime for Silero VAD

This library leverages **ONNX Runtime (C++)** to run the Silero VAD models efficiently. By utilizing ONNX Runtime, the library achieves high-performance inference across different platforms (iOS/macOS), ensuring fast and accurate voice activity detection.

### Why Use WebRTC's Audio Processing Module (APM)?

This library utilizes WebRTC's APM for several key reasons:

- **High-pass Filtering**: Removes low-frequency noise.

- **Noise Suppression**: Reduces background noise for clearer voice detection.

- **Gain Control**: Adaptive digital gain control enhances audio levels.

- **Sample Rate Conversion**: Silero VAD requires a sample rate of 16 kHz, and APM ensures conversion from other sample rates (8, 24, or 48 kHz).

### Audio Processing Workflow

1. **Input Audio Configuration**: The library supports sample rates of 8 kHz, 16 kHz, 24 kHz, and 48 kHz.

2. **Audio Preprocessing**:

   - The audio is split into chunks based on the sample rate.

   - APM processes these chunks with filters and gain adjustments.

   - Audio is converted to 16 kHz for Silero VAD compatibility.

3. **Voice Activity Detection**:

   - The processed audio chunks are passed to Silero VAD.

   - VAD outputs a probability score indicating voice activity.

4. **Algorithm for Voice Detection**:

   - **Voice Start Detection**: When the VAD probability exceeds the threshold, a pre-buffer stores audio frames to capture speech onset.

   - **Voice End Detection**: Once silence is detected over a set number of frames, recording stops, and the audio is output as WAV data.

5. **Output**:

   - The resulting audio data is provided as WAV with a sample rate of 16 kHz.

### WebRTC APM Configuration

The following configurations are applied to optimize voice detection:

```cpp

config.gain_controller1.enabled = true;

config.gain_controller1.mode = webrtc::AudioProcessing::Config::GainController1::kAdaptiveDigital;

config.gain_controller2.enabled = true;

config.high_pass_filter.enabled = true;

config.noise_suppression.enabled = true;

config.transient_suppression.enabled = true;

config.voice_detection.enabled = false;

```

---

## Additional Resources

- [RealTimeCutVADCXXLibrary](https://github.com/helloooideeeeea/RealTimeCutVADCXXLibrary)

---

## License

This project is licensed under the MIT License. See the [LICENSE](LICENSE) file for details.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/helloooideeeeea/realtimecutvadlibrary

Awesome Lists containing this project

README