An open API service indexing awesome lists of open source software.

https://github.com/saadarazzaq/spectrogram-rnnrbm-generation

NeuroVocal RNN-RBM: A Hybrid Deep Architecture for Spectrogram Modeling and Vocalization Synthesis
https://github.com/saadarazzaq/spectrogram-rnnrbm-generation

audio-spectrum-visualizer numpy pylab rbm rng rnn scipy spectogram tensorflow theano vocalization

Last synced: 3 months ago
JSON representation

NeuroVocal RNN-RBM: A Hybrid Deep Architecture for Spectrogram Modeling and Vocalization Synthesis

Awesome Lists containing this project

README

          

# NeuroVocal RNN-RBM: A Hybrid Deep Architecture for Temporal Sequence Modeling of Audio Spectrograms

## Abstract

This work presents a novel hybrid deep learning architecture that integrates **Recurrent Neural Networks (RNNs)** with **Restricted Boltzmann Machines (RBMs)** for unsupervised temporal modeling of audio spectrograms. The proposed RNN-RBM framework enables robust feature learning, temporal dependency modeling, and generative synthesis of vocalization patterns. By conditioning RBM parameters on temporal context through multi-layer gated RNNs, the model captures both short-term acoustic features and long-term structural patterns in audio data.

## Table of Contents
1. [Architecture Overview](#architecture-overview)
2. [Mathematical Foundations](#mathematical-foundations)
3. [Implementation Details](#implementation-details)
4. [Experimental Setup](#experimental-setup)
5. [Results & Analysis](#results--analysis)
6. [Applications](#applications)
7. [Installation & Usage](#installation--usage)
8. [Citation](#citation)
9. [Future Work](#future-work)

## Architecture Overview

### Hybrid RNN-RBM Framework

The core innovation lies in the tight coupling of temporal modeling (RNN) and generative modeling (RBM) components:

```
Input Spectrograms
→ [RNN Temporal Encoder]
→ Dynamic RBM Parameters
→ [Conditional RBM Decoder]
→ Generated Sequences
```

### Component Specifications

- **Spectrogram Input**: `(time_steps × freq_bins)` where `freq_bins = 129` (from 256-point FFT)
- **RBM Hidden Layers**: Scalable architecture with `n_hidden = 1.7 × freq_bins`
- **RNN Context Encoding**: Multi-layer gated units with `n_recurrent = 1.3 × freq_bins`
- **Parameter Conditioning**: Real-time adaptation of RBM biases based on temporal context

## Mathematical Foundations

### 1. Conditional RBM Formulation

image

### 2. Multi-Layer Gated RNN Architecture

The temporal context `u_t` is computed through a three-layer hierarchical RNN:

#### Layer 1: Input Processing
```python
def build_rnn_layer_1(v_t, u1_tm1, params):
W_in_update, W_hidden_update, b_update, W_in_reset, W_hidden_reset, b_reset, W_in_hidden, W_reset_hidden, b_hidden = params

update_gate = tanh(dot(v_t, W_in_update) + dot(u1_tm1, W_hidden_update) + b_update)
reset_gate = tanh(dot(v_t, W_in_reset) + dot(u1_tm1, W_hidden_reset) + b_reset)
u1_t_temp = tanh(dot(v_t, W_in_hidden) + dot(u1_tm1 * reset_gate, W_reset_hidden) + b_hidden)
u1_t = (1 - update_gate) * u1_t_temp + update_gate * u1_tm1
return u1_t
```

#### Layers 2 & 3: Context Refinement
Similar gated mechanisms process the output of previous layers, enabling multi-scale temporal representation learning.

### 3. Training Objective

image

## Implementation Details

### Spectrogram Preprocessing Pipeline (`parser.py`)

#### Adaptive Vocalization Detection
```python
def parse_segments(data, rate, threshold=90, buffer=5, min_length=15, max_length=500):
# Multi-stage processing:
# 1. Signal conditioning with linear smoothing
rectified = np.abs(data)
smoothed = linear_smooth(rectified, window_length)

# 2. Percentile-based thresholding
threshold_value = np.percentile(smoothed, threshold)
indices = smoothed >= threshold_value

# 3. Morphological operations for segment cleaning
bounded = np.hstack(([0], indices, [0]))
diffs = np.diff(bounded)
run_starts = np.where(diffs > 0)[0]
run_ends = np.where(diffs < 0)[0]

# 4. Segment validation and spectrogram computation
for start, end in valid_segments:
f, t, spec = spectrogram(data[start-buffer:end+buffer], rate,
noverlap=128, nperseg=256)
yield f, t, spec
```

#### Key Parameters:
- **Window Length**: 4ms converted to samples
- **Minimum Spacing**: 2ms between segments
- **Spectrogram**: 256-point FFT, 128 overlap
- **Frequency Range**: 0 - Nyquist (rate/2 Hz)

### Advanced Training Techniques (`rbm.py`)

#### 1. Robust Optimization
```python
# Gradient conditioning with NaN/Inf protection
not_finite = T.or_(T.isnan(gradient), T.isinf(gradient))
gradient = T.switch(not_finite, 0.1 * param, gradient)

# RMSProp with adaptive learning
accu_new = 0.9 * accu + 0.1 * gradient ** 2
param_update = lr * gradient / T.sqrt(accu_new + 1e-6)
```

#### 2. Multi-phase Learning Schedule
The training implements a curriculum learning strategy:
- Phase 1: `lr = 3e-4` - Rapid feature acquisition
- Phase 2: `lr = 1e-4` - Refinement learning
- Phase 3: `lr = 5e-5` to `1e-5` - Fine-tuning

#### 3. Regularization Strategy
- **L1 Regularization**: `λ₁ = 1e-4` for feature selection
- **L2 Regularization**: `λ₂ = None` (configurable)
- **Dropout**: Implicit through stochastic hidden units

## Experimental Setup

### Dataset Specifications
- **Input Format**: Raw WAV files with variable sampling rates
- **Preprocessing**: Automatic segmentation, normalization, spectrogram computation
- **Training/Validation**: Temporal cross-validation within sequences

### Model Configuration
```python
model_config = {
'n_visible': 129, # Fixed by spectrogram resolution
'n_hidden': 219, # 1.7 × n_visible
'n_hidden_recurrent': 167, # 1.3 × n_visible
'learning_rates': [3e-4, 1e-4, 5e-5, 3e-5, 1e-5],
'batch_size': 20,
'gibbs_steps': {
'training': 15,
'generation': 20
}
}
```

### Evaluation Metrics
1. **Training Convergence**: Negative log-likelihood bounds
2. **Generation Quality**: Visual inspection of spectrogram coherence
3. **Temporal Consistency**: Long-range dependency modeling
4. **Feature Learning**: Hidden unit activation patterns

## Results & Analysis

### Training Behavior
- **Stable Convergence**: Protected gradients prevent training divergence
- **Multi-timescale Learning**: RNN captures both frame-level and sequence-level patterns
- **Regularization Efficacy**: L1 norm promotes sparse, interpretable features

### Generation Capabilities
- **Temporal Coherence**: Generated sequences maintain structural consistency beyond training length
- **Multi-scale Patterns**: Captures both fine-grained spectral features and broader temporal contours
- **Mode Coverage**: Diverse sampling through temperature-based activation

## Applications

### 1. Bioacoustic Research
- **Animal Vocalization Analysis**: Unsupervised discovery of call types and sequences
- **Species Identification**: Learning distinctive acoustic signatures
- **Behavioral Studies**: Temporal pattern analysis in communication

### 2. Speech Technology
- **Unsupervised Phoneme Learning**: Discovering speech units from raw audio
- **Prosody Modeling**: Capturing rhythm and intonation patterns
- **Pathological Speech Analysis**: Identifying atypical temporal patterns

### 3. Audio Synthesis
- **Generative Sound Design**: Creating novel audio textures and sequences
- **Music Information Retrieval**: Learning musical structure and style
- **Voice Conversion**: Modeling speaker characteristics

### 4. Neuroscience Applications
- **Neural Coding**: Modeling temporal dependencies in neural recordings
- **Sensory Processing**: Understanding hierarchical feature extraction
- **Motor Sequence Generation**: Modeling complex temporal behaviors

## Usage

### Customization Guide
```python
# For custom datasets, modify:
config = {
'threshold': 85, # Detection sensitivity
'min_length': 10, # Minimum segment length (ms)
'max_length': 1000, # Maximum segment length (ms)
'hidden_scalar': 2.0, # RBM hidden unit scaling
'recurrent_scalar': 1.5 # RNN hidden unit scaling
}
```

## Future Work

### Architectural Extensions
1. **Attention Mechanisms**: Content-based temporal focusing
2. **Hierarchical RNNs**: Multi-resolution temporal processing
3. **Variational Extensions**: Explicit latent variable modeling

### Algorithmic Improvements
1. **Advanced Sampling**: Parallel tempering for better mixing
2. **Structured Regularization**: Temporal smoothness constraints
3. **Multi-modal Learning**: Joint audio-text representation learning

### Applications Development
1. **Real-time Synthesis**: Streaming audio generation
2. **Transfer Learning**: Pre-trained models for new domains
3. **Interpretability Tools**: Visualization of learned features

---


✨ Author


Saad Abdur Razzaq

Machine Learning Engineer | Effixly AI



LinkedIn


Email


Website


GitHub


---