An open API service indexing awesome lists of open source software.

https://github.com/Saganaki22/ComfyUI-AudioSR

ComfyUI node for AudioSR - Versatile Audio Super Resolution upscales audio to 48kHz using latent diffusion
https://github.com/Saganaki22/ComfyUI-AudioSR

audio audio-processing comfyui comfyui-custom-node comfyui-nodes spectrogram upscale upscale-mp3

Last synced: 5 days ago
JSON representation

ComfyUI node for AudioSR - Versatile Audio Super Resolution upscales audio to 48kHz using latent diffusion

Awesome Lists containing this project

README

          

ComfyUI AudioSR (Versatile Audio Super Resolution)













Native ComfyUI node for **AudioSR (Versatile Audio Super Resolution)** - Upscale any audio to 48kHz using state-of-the-art latent diffusion.

**Based on the original AudioSR implementation by [Haohe Liu](https://github.com/haoheliu) et al.**

## 🎯 Key Features

- **🎧 Audio Super Resolution**: Upsample low-quality audio to 48kHz with enhanced high frequencies
- **🎛️ Native ComfyUI Integration**: Works seamlessly with Load Audio, Preview Audio, and Save Audio nodes
- **📊 Built-in Spectrogram Visualization**: Before/after comparison with time and frequency axes
- **🔄 Automatic Sample Rate Handling**: Accepts any input sample rate (8kHz - 48kHz)
- **🧩 Stereo Support**: Processes both mono and stereo audio with independent channel handling
- **📏 Long Audio Support**: Smart chunking with overlap for unlimited audio length
- **⚡ Model Caching**: Model stays in memory for fast subsequent generations
- **🚀 torch.compile Optimization**: Optional PyTorch compilation for 20-30% speed boost (FP32 models)
- **🎛️ VRAM Management**: Optional model unloading to free GPU memory between runs
- **⏸️ Interruptible**: Cancel processing mid-run through ComfyUI's interrupt button
- **📈 Progress Reporting**: Real-time progress bar shows chunk processing status

---

## ⚡ Performance Optimizations

### Attention Backends

The node supports multiple attention computation backends for optimal performance:

| Backend | Speed | Requirements | Best For |
|---------|-------|--------------|----------|
| **sdpa** | Fast | PyTorch 2.0+ | Default, most compatible |
| **sageattn** | Fastest | fp16/bf16 dtype, `pip install sageattention` | RTX 30/40 series, maximum speed |
| **eager** | Slowest | None | Debugging, maximum compatibility |

**SageAttention**: GPU-optimized attention kernels with automatic architecture detection (SM80+ required). Falls back gracefully if not installed or incompatible.

### Compute Dtypes

| Dtype | VRAM | Speed | Notes |
|-------|------|-------|-------|
| **fp32** | Higher | Baseline | Default, most accurate |
| **fp16** | Lower | Faster | Requires GPU with good FP16 support |
| **bf16** | Lower | Fastest | Best on RTX 30/40 series (Ampere+) |

**Note**: SageAttention requires fp16 or bf16 dtype. Selecting fp32 with SageAttention auto-falls back to sdpa.

### torch.compile Mode

The node includes an optional `use_torch_compile` toggle that applies PyTorch's `torch.compile()` optimization to the model for faster inference.

**Speed Boost**: After an initial compilation overhead (~10-30 seconds on first run), you'll see:
- **20-30% faster** inference for subsequent generations
- Best performance with **FP32 models** (recommended for torch.compile)
- Grows more valuable with repeated processing (cached compiled model)

**When to Use**:
- Processing multiple audio files in a session
- Longer audio requiring multiple chunks
- When speed is critical and you can wait for the initial compilation
- **Recommended**: Enable if using FP32 models and processing multiple clips

**Caveats**:
- Only works with FP32 models (will skip compilation for FP16/FP8 models)
- First generation takes longer due to compilation overhead
- Not recommended for one-off processing

### Optimized Defaults

The node has been tuned with performance-focused default values:

| Parameter | Default | Previous | Performance Impact |
|-----------|---------|----------|-------------------|
| **chunk_size** | 15.0s | 5.12s | Fewer chunks = ~60% faster for long audio |
| **overlap** | 0.0s | 0.04s | No overlap = faster (smoother audio with 2.0-3.0s overlap) |
| **attention_backend** | sdpa | sdpa | PyTorch native attention (fastest available) |

**Optimizations Applied**:
- Removed unnecessary tensor conversions (torch.from_numpy→numpy)
- Smart model caching with automatic recompilation detection
- Improved dtype detection for quantized models (FP8 → FP16 conversion)
- Safe division handling to prevent numerical errors
- Memory-efficient overlap processing with optional crossfade

### Security Improvements

- `weights_only=True` for safe model loading (prevents arbitrary code execution)
- Validates tensor dtypes before model conversion

---

## Requirements

All Python dependencies are installed automatically. No external tools required.

**Minimum: 6GB VRAM, 12GB RAM recommended**

---

image


ComfyUI_temp_bildo_00002_

## Audio Examples

| Original | AudioSR |
|----------|----------|
| [speech_up_4.wav](https://huggingface.co/drbaph/AudioSR/resolve/main/samples/speech_up_4.wav) | [speech_audiosr_4.wav](https://huggingface.co/drbaph/AudioSR/resolve/main/samples/speech_audiosr_4.wav) |
| [event_up_2.wav](https://huggingface.co/drbaph/AudioSR/resolve/main/samples/event_up_2.wav) | [event_audiosr_2.wav](https://huggingface.co/drbaph/AudioSR/resolve/main/samples/event_audiosr_2.wav) |

---

## 📦 Installation

### Method 1: ComfyUI Manager (Recommended)

1. Open ComfyUI Manager
2. Search for **"AudioSR"**
3. Click **Install**
4. Restart ComfyUI

That's it! All dependencies are installed automatically.

### Method 2: Manual Installation

**Standard Python:**
```bash
cd ComfyUI/custom_nodes
git clone https://github.com/Saganaki22/ComfyUI-AudioSR.git
cd ComfyUI-AudioSR
pip install -r requirements.txt
```

**ComfyUI Portable (Windows with embedded Python):**
```bash
cd ComfyUI/custom_nodes
git clone https://github.com/Saganaki22/ComfyUI-AudioSR
cd ComfyUI-AudioSR
..\python_embeded\python.exe -s -m pip install -r requirements.txt
```

### 📥 Download Models

**Important:** Models must be placed in `ComfyUI/models/AudioSR/`

**Download from HuggingFace:**
https://huggingface.co/drbaph/AudioSR/tree/main/AudioSR

Download one or both models and place them in your ComfyUI models directory:

**Required Folder Structure:**
```
ComfyUI/
└── models/
└── AudioSR/
├── audiosr_basic_fp32.safetensors (for general audio)
└── audiosr_speech_fp32.safetensors (for voice content)
```

**Available Models (FP32):**
- `audiosr_basic_fp32.safetensors` - General purpose (music, sound effects, etc.)
- `audiosr_speech_fp32.safetensors` - Optimized for voice/speech

---

## 💻 VRAM Requirements

| Configuration | VRAM Usage | Recommended For |
|---------------|------------|-----------------|
| Standard | ~6GB | RTX 3060+ (6GB+) |
| With unload_model enabled | ~0.5GB (when idle) | Systems with limited VRAM |

**Minimum: 6GB VRAM required, 8GB+ recommended**

---

## 🎚️ AudioSR Node

📖 Click to expand: Overview & Parameters

### Overview

Upscale audio to 48kHz using the AudioSR latent diffusion model. The model analyzes low-quality audio and generates enhanced high-frequency details for a cleaner, fuller sound.

**What it does:**
- Resamples audio to 48kHz (if needed)
- Enhances high frequencies and adds clarity
- Reduces artifacts from low-bitrate compression
- Works on any audio: music, speech, sound effects

**Use Cases:**
- Upsample old/low-quality audio recordings
- Enhance compressed audio (MP3, low-bitrate streams)
- Improve audio for video production
- Restore archived audio content

### Parameters

#### Required Inputs

| Parameter | Type | Range | Default | Description |
|-----------|------|-------|---------|-------------|
| **audio** | AUDIO | - | - | Audio input from Load Audio node |
| **ddim_steps** | INT | 10-500 | 50 | Number of denoising steps (higher = better quality, slower) |
| **guidance_scale** | FLOAT | 1.0-20.0 | 3.5 | CFG scale - higher = more faithful to input |
| **seed** | INT | 0-4.29B | 0 | Random seed (0 = random) |

#### Optional Inputs

| Parameter | Type | Range | Default | Description |
|-----------|------|-------|---------|-------------|
| **model** | COMBO | - | basic | Model file from `ComfyUI/models/AudioSR/` (supports .bin, .safetensors) |
| **chunk_size** | FLOAT | 2.56-30.0 | 15.0 | Chunk duration in seconds (for audio >10.24s) |
| **overlap** | FLOAT | 0.0-5.0 | 0.0 | Overlap in seconds between chunks (helps smooth transitions, 2.0-3.0 recommended for long audio) |
| **unload_model** | BOOLEAN | - | False | Free VRAM after generation (slower next run) |
| **show_spectrogram** | BOOLEAN | - | True | Generate before/after spectrogram image |
| **attention_backend** | COMBO | - | sdpa | Attention: sdpa (fast), sageattn (fastest, requires fp16/bf16), eager (compatible) |
| **dtype** | COMBO | - | fp32 | Compute dtype: fp32 (default), fp16 (faster), bf16 (RTX 30/40 series) |
| **use_torch_compile** | BOOLEAN | - | False | Use torch.compile() for 20-30% speed boost (FP32 only, requires warmup) |

### Outputs

| Output | Type | Description |
|--------|------|-------------|
| **audio** | AUDIO | Upscaled audio at 48kHz (connect to Preview/Save) |
| **spectrogram** | IMAGE | Before/after spectrogram comparison (optional) |

---

## 🎨 Workflow Examples

📖 Click to expand: Workflow Examples

### Basic Audio Upscaling

```
Load Audio → AudioSR → Preview Audio / Save Audio
```

1. Add **Load Audio** node and select your audio file
2. Add **AudioSR** node
3. Connect audio output to AudioSR input
4. Set `ddim_steps: 50` (default)
5. Set `guidance_scale: 3.5` (default)
6. Connect AudioSR audio output to **Preview Audio** or **Save Audio**
7. Queue and generate!

### High Quality Upscaling

```
Settings:
- ddim_steps: 100
- guidance_scale: 5.0
- model: speech (for voice content)
- show_spectrogram: True
```

### Low VRAM Mode

```
Settings:
- unload_model: True
- ddim_steps: 50 (default)
```

This frees VRAM after each generation but makes subsequent runs slower.

### Long Audio Processing

For audio longer than ~10 seconds, the node automatically:
- Splits audio into chunks
- Processes each chunk
- Crossfades overlap regions
- Stitches into seamless output

**Defaults from main repo** (recommended):
```
- chunk_size: 15.0 (seconds per chunk)
- overlap: 2.0 (seconds overlap between chunks)
```

For faster processing with more VRAM:
```
- chunk_size: 20.0-30.0 (fewer chunks = faster)
- overlap: 2.0-3.0 (smoother transitions)
```

---

## ⚙️ Parameter Guide

📖 Click to expand: Detailed Parameter Guide

### ddim_steps (10-500)

Number of denoising steps during generation.

| Value | Quality | Speed | Use Case |
|-------|---------|-------|----------|
| 10-30 | Lower | Fast | Quick previews |
| 50 | Good | Medium | **Default recommendation** |
| 100 | Better | Slow | High-quality output |
| 200+ | Best | Very Slow | Maximum quality |

### guidance_scale (1.0-20.0)

Classifier-free guidance scale. Controls how closely the output follows the input.

| Value | Effect |
|-------|--------|
| 1.0-2.0 | More creative/variant |
| 3.0-4.0 | **Balanced** (default) |
| 5.0-8.0 | More faithful to input |
| 10.0+ | Very conservative (may sound artificial) |

### chunk_size (2.56-30.0)

Chunk duration in seconds for long audio processing.

| Value | Effect |
|-------|--------|
| 2.56-5.0 | More chunks, slower, less memory per chunk |
| 15.0 | **Default** (from main repo, balanced) |
| 20.0-30.0 | Fewer chunks, faster, more memory per chunk |

### overlap (0.0-5.0)

Overlap duration in seconds between chunks.

| Value | Effect |
|-------|--------|
| 0.0 | No overlap (may have seams) |
| 2.0 | **Recommended** (smoother transitions) |
| 3.0-5.0 | Smoothest stitching, slower processing |

### use_torch_compile (BOOLEAN)

Enable PyTorch's torch.compile() optimization for faster inference.

| Value | Effect |
|-------|--------|
| False | **Default** - Standard inference |
| True | 20-30% faster after compilation (FP32 models only) |

**Notes**:
- First run takes ~10-30s longer for compilation
- Only effective with FP32 models (recommended: `audiosr_basic_fp32.safetensors`, `audiosr_speech_fp32.safetensors`)
- Best for batch processing or repeated use
- Model auto-recompiles if toggle state changes

---

## 🐛 Troubleshooting

📖 Click to expand: Troubleshooting

### Model Not Found Error

**Symptom**: "Model not found. Please download the AudioSR model..."

**Solution**:
1. Download model from: https://huggingface.co/drbaph/AudioSR/tree/main/AudioSR
2. Place in `ComfyUI/models/AudioSR/`
3. Restart ComfyUI

### CUDA Out of Memory

**Symptom**: RuntimeError: CUDA out of memory

**Solutions**:
1. Enable `unload_model: True`
2. Close other GPU-intensive applications
3. Reduce `chunk_size` (try 2.56)
4. Use a smaller audio file

### Poor Audio Quality

**Symptom**: Output sounds distorted or artificial

**Solutions**:
1. Lower `guidance_scale` (try 2.5-3.0)
2. Reduce `ddim_steps` (too high can sound artificial)
3. Try the other model variant (basic ↔ speech)
4. Ensure input audio is reasonably clean

### No Output Audio

**Symptom**: Node completes but no audio is produced

**Solutions**:
1. Check ComfyUI console for error messages
2. Verify audio input is properly connected
3. Ensure output is connected to Preview/Save Audio node
4. Try a different audio file

### Spectrogram Not Showing

**Symptom**: Spectrogram output is empty or shows noise

**Solutions**:
1. Ensure `show_spectrogram: True`
2. Install matplotlib: `pip install matplotlib`
3. Check ComfyUI console for errors

### Slow Processing

**Symptom**: Node takes very long to process

**Solutions**:
1. Disable `unload_model` (keeps model cached)
2. Enable `use_torch_compile` (20-30% faster after warmup, FP32 models only)
3. Increase `chunk_size` (fewer chunks = faster)
4. Ensure GPU is being used (not CPU)
5. Try `ddim_steps: 30` for faster processing

### torch.compile Not Working

**Symptom**: `use_torch_compile` enabled but no speedup or message says compilation failed

**Solutions**:
1. Ensure using FP32 model (`*_fp32.safetensors`)
2. Check PyTorch version (2.0+ required for torch.compile)
3. First run always includes compilation overhead - try second run
4. Check console for error messages (some operations not supported)

### SageAttention Not Working

**Symptom**: `sageattn` selected but falling back to sdpa

**Solutions**:
1. Install SageAttention: `pip install sageattention`
2. Ensure `dtype` is set to `fp16` or `bf16` (fp32 auto-falls back)
3. Check GPU architecture (requires SM80+ = RTX 30 series or newer)
4. Check console message for specific fallback reason

### Model Reloaded Repeatedly

**Symptom**: Model recompiles every time, defeating performance gains

**Causes**:
- Switching `use_torch_compile` on/off forces model reload
- Switching between FP32 and FP16 models
- ComfyUI restart (model cache cleared)

### Volume Drops at Chunk Boundaries

**Symptom**: Periodic volume drops or glitches in long audio (every 10-30 seconds depending on chunk_size)

**Causes**:
- The model internally pads audio to 5.12s multiples, causing output length to differ from input
- Improper chunk positioning when stitching outputs together

**Solutions** (v1.0.6+):
- This issue is now fixed automatically
- If using older version, update to v1.0.6 or later
- For smoothest transitions, use `overlap: 2.0-3.0` when processing long audio

---

## 📊 Spectrogram Visualization

📖 Click to expand: Spectrogram Details

The node generates a side-by-side spectrogram comparison when `show_spectrogram: True`:

**Top panel**: Input audio (before) - Shows limited high frequencies
**Bottom panel**: Output audio (after) - Shows enhanced frequency content

The spectrogram uses the **magma** colormap:
- **Purple/Black**: Low energy (silence/quiet)
- **Red/Orange**: Medium energy
- **Yellow**: High energy (loud frequencies)

**Axes**:
- **X-axis**: Time in seconds
- **Y-axis**: Frequency in Hz (0-24kHz visible range)

---

## 🔗 Links

- [AudioSR Paper (arXiv)](https://arxiv.org/abs/2309.07314)
- [Project Page](https://audioldm.github.io/audiosr/)
- [Original Repository](https://github.com/haoheliu/versatile_audio_super_resolution)
- [Models (HuggingFace)](https://huggingface.co/drbaph/AudioSR/tree/main/AudioSR)
- [ComfyUI](https://github.com/comfyanonymous/ComfyUI)

---

## 💡 Tips for Best Results

- **Input Quality**: The model can't create what isn't there - extremely low-quality audio may still sound artificial
- **Guidance Scale**: Start at 3.5 and adjust based on results
- **Steps**: 50 steps is usually sufficient; 100+ for critical applications
- **Speech Audio**: Use the `audiosr_speech_fp32.safetensors` model for voice content
- **Music/General**: Use the `audiosr_basic_fp32.safetensors` model for music and sound effects
- **Long Audio**: Let the auto-chunking handle files >10 seconds
- **VRAM**: Enable `unload_model` if you need GPU memory for other tasks
- **Speed Optimization**: Enable `use_torch_compile` when processing multiple files or long audio (FP32 models only)
- **Maximum Speed**: Use `attention_backend: sageattn` + `dtype: bf16` on RTX 30/40 series (install: `pip install sageattention`)
- **Smooth Transitions**: Use `overlap: 2.0` for seamless chunk stitching on long audio
- **Chunk Size**: Default 15s is optimized; increase to 20-30s for fewer chunks if VRAM allows

---

## 📚 Credits & License

**Original Research**: [AudioSR: Versatile Audio Super-Resolution](https://arxiv.org/abs/2309.07314) by [Haohe Liu](https://github.com/haoheliu) et al.

**Original Implementation**: [versatile_audio_super_resolution](https://github.com/haoheliu/versatile_audio_super_resolution) by Haohe Liu

**ComfyUI Integration**: This custom node implementation

**License**: MIT (same as original AudioSR project)

---

## 📝 Changelog

### Version 1.1.1

- ✅ **Fixed tensor dimension mismatch**: Small chunks (<5.12s) now get padded to minimum size before processing, then trimmed to original length. Fixes "Expected size 64 but got size 63" errors on audio with uneven chunk splits.

### Version 1.1.0

- ✅ **SageAttention support**: Added `sageattn` attention backend with GPU-arch auto-detection (SM80+). Falls back to sdpa if not installed or incompatible.
- ✅ **Dtype selector**: New `dtype` dropdown (fp32/fp16/bf16) to control compute precision. SageAttention auto-falls back to sdpa when fp32 is selected.
- ✅ Added `.gitignore` for `__pycache__` and common Python artifacts

### Version 1.0.6

- ✅ **Fixed chunk positioning bug**: Model internally pads audio to 5.12s multiples, causing output length to differ from input. Fixed by positioning chunks based on INPUT boundaries rather than model output length. This eliminates volume drops/glitches at chunk boundaries (30s, 60s, etc.) in long audio files.
- ✅ **Fixed overlap-add normalization**: Improved chunk stitching with proper weight tracking for overlap regions. Ensures consistent amplitude across chunk boundaries when using `overlap > 0`.
- ✅ Improved handling of stereo audio processing with independent channel reconstruction.

### Version 1.0.5

- ✅ Model directory and file scanning now cached (no repeated folder scans on browser refresh/reload)
- ✅ Removed verbose console messages on tab refresh ("Checking path", "Found model", etc.)
- ✅ Improved startup performance by avoiding redundant file system operations

### Version 1.0.4

- ✅ Minor fixes and improvements

### Version 1.0.3

- ✅ Added `use_torch_compile` toggle for 20-30% speed boost (FP32 models)
- ✅ Optimized default `chunk_size` to 15.0s (was 5.12s) for faster long audio processing
- ✅ Optimized default `overlap` to 0.0s (was 0.04s) with configurable 2-3s recommended for smooth transitions
- ✅ Removed unnecessary tensor conversions for faster inference
- ✅ Improved dtype detection for quantized model support (FP8 → FP16 auto-conversion)
- ✅ Security fix: Added `weights_only=True` for safe model loading
- ✅ Smart model caching with automatic recompilation detection
- ✅ Safe division handling in overlap normalization

### Version 1.0.2

- ✅ Added HuggingFace model link for easy access
- ✅ Improved documentation and examples

### Version 1.0.1

- ✅ Native ComfyUI AUDIO type support
- ✅ Automatic sample rate conversion (any input rate → 48kHz)
- ✅ Stereo audio processing
- ✅ Longer audio support with smart chunking
- ✅ Before/after spectrogram visualization
- ✅ Progress reporting and interrupt support
- ✅ Model caching and optional VRAM unloading
- ✅ Time and frequency axes on spectrograms