LightDiffusion-Next / docs /wavespeed.md
Aatricks's picture
Deploy ZeroGPU Gradio Space snapshot
b701455

A newer version of the Gradio SDK is available: 6.14.0

Upgrade

WaveSpeed Caching

Overview

WaveSpeed is the project's caching-oriented optimization layer for reusing work across denoising steps. In the current codebase, the integrated path is DeepCache for UNet-based models, and the repository also contains groundwork for a Flux-oriented First Block Cache path.

LightDiffusion-Next contains two WaveSpeed-related implementations:

  1. DeepCache β€” Integrated for UNet-based models (SD1.5, SDXL)
  2. First Block Cache (FBCache) β€” Flux-oriented cache machinery present in the codebase

Both are training-free. DeepCache is the user-facing path today; First Block Cache is codebase groundwork for a more specialized transformer caching path.

How It Works

Core Insight

Diffusion models denoise images iteratively over 20-50 steps. Researchers observed that:

  • High-level features (semantic structure, composition) change slowly across steps
  • Low-level features (fine details, textures) require frequent updates

WaveSpeed aims to reduce repeated computation across nearby denoising steps by reusing information from earlier steps where practical.

DeepCache (UNet Models) {#deepcache}

DeepCache is the integrated WaveSpeed path for UNet models.

Cache step (every N steps):

  1. Run the full denoiser path
  2. Store the output for later reuse

Reuse step (intermediate steps):

  1. Reuse the cached denoiser output
  2. Skip the full model recomputation for that step

Speedup: ~50-70% time saved per reuse step β†’ 2-3x total speedup with interval=3

First Block Cache (Flux Models)

Flux uses Transformer blocks instead of UNet convolutions. The repository includes a First Block Cache implementation for this architecture family:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ First Transformer Block (always run)    β”‚ ← Computes initial features
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Remaining Blocks (cached if similar)    β”‚ ← FBCache caching zone
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Cache decision logic:

  1. Run first Transformer block
  2. Compare output to previous step's output
  3. If difference < threshold: reuse cached remaining blocks
  4. If difference β‰₯ threshold: run all blocks and update cache

In the current project structure, this cache path is implementation groundwork rather than a standard generation toggle like DeepCache.

DeepCache Configuration

Parameters

Parameter Type Default Description
cache_interval int 3 Steps between cache updates (higher = faster, lower quality)
cache_depth int 2 UNet depth for caching (0-12, higher = more aggressive)
start_step int 0 Timestep to start caching (0-1000)
end_step int 1000 Timestep to stop caching (0-1000)

Streamlit UI

Enable in the ⚑ DeepCache Acceleration expander:

  1. Check Enable DeepCache
  2. Adjust sliders:
    • Cache Interval: 1-10 (default: 3)
    • Cache Depth: 0-12 (default: 2)
    • Start/End Steps: 0-1000 (default: 0/1000)
  3. Generate images β€” caching applies transparently

REST API

curl -X POST http://localhost:7861/api/generate \
  -H "Content-Type: application/json" \
  -d '{
        "prompt": "a misty forest at twilight",
        "width": 768,
        "height": 512,
        "deepcache_enabled": true,
        "deepcache_interval": 3,
        "deepcache_depth": 2
      }'

Recommended Presets

Balanced (Default)

cache_interval: 3
cache_depth: 2
start_step: 0
end_step: 1000
  • Speedup: 2-2.3x
  • Quality loss: Very slight (1-2%)
  • Use case: Everyday generation

Maximum Speed

cache_interval: 5
cache_depth: 3
start_step: 0
end_step: 1000
  • Speedup: 2.5-3x
  • Quality loss: Noticeable (5-7%)
  • Use case: Rapid prototyping, batch jobs

Maximum Quality

cache_interval: 2
cache_depth: 1
start_step: 0
end_step: 1000
  • Speedup: 1.5-2x
  • Quality loss: Minimal (<1%)
  • Use case: Final renders, client work

Partial Caching (Critical Steps Only)

cache_interval: 3
cache_depth: 2
start_step: 200
end_step: 800
  • Speedup: 1.8-2.2x
  • Quality loss: Minimal
  • Use case: Preserve early structure, late details

First Block Cache (FBCache) Configuration

Parameters

Parameter Type Default Description
residual_diff_threshold float 0.05 Max feature difference to trigger cache reuse (0.0-1.0)

Usage

First Block Cache is not currently exposed as a standard per-generation toggle. The implementation is available in the codebase for specialized integration work:

# In src/user/pipeline.py
from src.WaveSpeed import fbcache_nodes

# Create cache context
cache_context = fbcache_nodes.create_cache_context()

# Apply caching to a Flux-style model
with fbcache_nodes.cache_context(cache_context):
    patched_model = fbcache_nodes.create_patch_flux_forward_orig(
        flux_model,
        residual_diff_threshold=0.05,  # Lower = stricter caching
    )
    # Generate images...

Tuning Threshold

  • Lower threshold (0.01-0.03): Stricter caching, recomputes more often, higher quality
  • Higher threshold (0.05-0.1): Looser caching, reuses more often, higher speedup
  • Recommended: 0.05 (balances quality and speed)

Performance

Speedup Guidance

Speedup scales with cache interval and depth:

Model Cache Interval Expected Behavior
SD1.5 2 Moderate speedup, minimal quality loss
SD1.5 3 Good speedup, slight quality loss
SD1.5 5 High speedup, noticeable quality loss
SDXL 3 Good speedup, slight quality loss
Flux-style caching paths implementation-specific Depends on the integration path

Performance varies based on:

  • GPU architecture
  • Model size
  • Resolution
  • Sampler choice
  • Number of steps

Recommendation: Start with interval=3 and adjust based on your quality requirements.### VRAM Impact

Caching increases VRAM usage slightly (50-200MB depending on resolution):

Model Baseline VRAM + DeepCache Increase
SD1.5 (768Γ—512) 3.2 GB 3.4 GB +200 MB
SDXL (1024Γ—1024) 6.8 GB 7.0 GB +200 MB
Flux (832Γ—1216) 12.5 GB 12.6 GB +100 MB

Stacking with Other Optimizations

WaveSpeed is fully compatible with SageAttention, SpargeAttn and Stable-Fast:

DeepCache + SageAttention

deepcache_enabled: true
deepcache_interval: 3
# SageAttention auto-detected

Result: 2.2x (DeepCache) Γ— 1.15 (SageAttention) = ~2.5x total speedup

DeepCache + SpargeAttn

deepcache_enabled: true
deepcache_interval: 3
# SpargeAttn auto-detected

Result: Enhanced speedup from caching and sparse attention

DeepCache + Stable-Fast + SpargeAttn

stable_fast: true
deepcache_enabled: true
deepcache_interval: 3
# SpargeAttn auto-detected

Result: Maximum combined speedup (all optimizations active, batch operations only)

Compatibility

DeepCache Compatible With

  • βœ… Stable Diffusion 1.5
  • βœ… Stable Diffusion 2.1
  • βœ… SDXL
  • βœ… All samplers (Euler, DPM++, etc.)
  • βœ… LoRA adapters
  • βœ… Textual inversion embeddings
  • βœ… HiresFix
  • βœ… ADetailer
  • βœ… Multi-scale diffusion
  • βœ… SageAttention/SpargeAttn
  • βœ… Stable-Fast

DeepCache NOT Compatible With

  • ❌ Flux models (use FBCache instead)
  • ❌ Img2Img mode (can cause artifacts)

FBCache Compatible With

  • βœ… Flux models
  • βœ… SageAttention/SpargeAttn
  • βœ… All Flux-compatible features

FBCache NOT Compatible With

  • ❌ SD1.5/SDXL (use DeepCache instead)
  • ❌ Stable-Fast (Flux not supported by Stable-Fast)

Troubleshooting

No Speedup Observed

Causes:

  1. DeepCache disabled or not applied to correct model type
  2. Cache interval too low (interval=1 provides no caching)
  3. Model loaded incorrectly

Fixes:

# Check logs for DeepCache activation
cat logs/server.log | grep -i "deepcache\|cache"

# Verify UI toggle is enabled
# Streamlit: Check "Enable DeepCache" checkbox
# API: Ensure "deepcache_enabled": true in payload

# Try higher interval
deepcache_interval: 3  # Instead of 1 or 2

Quality Degradation

Symptoms:

  • Blurry details
  • Smoothed textures
  • Loss of fine patterns

Causes:

  1. Cache interval too high
  2. Cache depth too aggressive
  3. Wrong model type (Flux using DeepCache)

Fixes:

# Reduce cache interval
deepcache_interval: 2  # Down from 5

# Reduce cache depth
deepcache_depth: 1  # Down from 3

# Disable caching for critical phases
deepcache_start_step: 200  # Skip early structure formation
deepcache_end_step: 800    # Skip late detail refinement

Artifacts in Img2Img

Symptom: Visible seams, inconsistent styles when using DeepCache with Img2Img.

Cause: Img2Img starts from a noisy input image, which violates DeepCache's assumptions about feature consistency.

Fix: Disable DeepCache for Img2Img:

deepcache_enabled: false  # When img2img_enabled: true

VRAM Increase

Symptom: OOM errors after enabling DeepCache.

Cause: Cached features consume additional VRAM.

Fixes:

  1. Reduce batch size
  2. Lower resolution
  3. Disable other VRAM-heavy features (Stable-Fast CUDA graphs)
  4. Use lower cache depth:
    deepcache_depth: 1  # Minimal caching
    

Flux FBCache Not Working

Symptom: No speedup with Flux generation.

Cause: FBCache implementation is more subtle β€” check logs for cache hit rate.

Debugging:

# Enable debug logging
export LD_SERVER_LOGLEVEL=DEBUG

# Check cache statistics
cat logs/server.log | grep "cache"

If no cache hits, try adjusting threshold:

# In pipeline.py
residual_diff_threshold=0.1  # Increase from 0.05 for more cache reuse

Quality Comparison

Visual impact of different cache intervals:

Interval Speed Visual Difference
Disabled Baseline Baseline (100% quality)
2 Faster Virtually identical
3 Much faster Very subtle smoothing
5 Very fast Noticeable detail loss
7+ Fastest Obvious quality degradation

Recommendation: Start with interval=3 and adjust based on visual results.

Technical Details

DeepCache Implementation

Simplified pseudocode:

class DeepCacheWrapper:
    def __init__(self, model, interval, depth):
        self.model = model
        self.interval = interval
        self.cached_output = None
        self.current_step = 0
    
    def forward(self, x, timestep):
        is_cache_step = (self.current_step % self.interval == 0)
        
        if is_cache_step:
            # Run full model, cache output
            output = self.model(x, timestep)
            self.cached_output = output.clone()
        else:
            # Reuse cached output (skip expensive computation)
            output = self.cached_output
        
        self.current_step += 1
        return output

Actual implementation in src/WaveSpeed/deepcache_nodes.py includes:

  • Proper timestep tracking
  • Cache invalidation on batch changes
  • Error handling and fallback to full forward

FBCache Residual Comparison

# Compute first block output
first_output = first_transformer_block(hidden_states)

# Compare to previous step
residual = first_output - previous_first_output
residual_norm = residual.abs().mean() / first_output.abs().mean()

if residual_norm < threshold:
    # Feature change is small β€” reuse cached blocks
    hidden_states = apply_cached_residual(first_output)
else:
    # Feature change is large β€” recompute all blocks
    hidden_states = run_remaining_blocks(first_output)
    cache_residual(hidden_states)

Best Practices

For Everyday Use

  1. Enable DeepCache with default settings (interval=3, depth=2)
  2. Stack with SageAttention for 2.5x+ total speedup
  3. Disable for final client renders if absolute quality is critical

For Batch Processing

  1. Use aggressive caching (interval=5, depth=3)
  2. Pre-generate previews at high speed, re-render winners at full quality
  3. Disable TAESD previews to avoid overhead (set enable_preview=false)

For Low VRAM

  1. Use conservative caching (interval=2, depth=1)
  2. Avoid stacking with Stable-Fast CUDA graphs
  3. Monitor VRAM via /api/telemetry endpoint

Citation

If you use WaveSpeed/DeepCache in your work:

@inproceedings{ma2023deepcache,
  title={DeepCache: Accelerating Diffusion Models for Free},
  author={Ma, Xinyin and Fang, Gongfan and Wang, Xinchao},
  booktitle={CVPR},
  year={2024}
}

Resources