# WaveSpeed Caching ## Overview WaveSpeed is the project's caching-oriented optimization layer for reusing work across denoising steps. In the current codebase, the integrated path is DeepCache for UNet-based models, and the repository also contains groundwork for a Flux-oriented First Block Cache path. LightDiffusion-Next contains two WaveSpeed-related implementations: 1. **DeepCache** — Integrated for UNet-based models (SD1.5, SDXL) 2. **First Block Cache (FBCache)** — Flux-oriented cache machinery present in the codebase Both are training-free. DeepCache is the user-facing path today; First Block Cache is codebase groundwork for a more specialized transformer caching path. ## How It Works ### Core Insight Diffusion models denoise images iteratively over 20-50 steps. Researchers observed that: - **High-level features** (semantic structure, composition) change slowly across steps - **Low-level features** (fine details, textures) require frequent updates WaveSpeed aims to reduce repeated computation across nearby denoising steps by reusing information from earlier steps where practical. ### DeepCache (UNet Models) {#deepcache} DeepCache is the integrated WaveSpeed path for UNet models. **Cache step (every N steps):** 1. Run the full denoiser path 2. Store the output for later reuse **Reuse step (intermediate steps):** 1. Reuse the cached denoiser output 2. Skip the full model recomputation for that step **Speedup:** ~50-70% time saved per reuse step → 2-3x total speedup with `interval=3` ### First Block Cache (Flux Models) Flux uses Transformer blocks instead of UNet convolutions. The repository includes a First Block Cache implementation for this architecture family: ``` ┌─────────────────────────────────────────┐ │ First Transformer Block (always run) │ ← Computes initial features ├─────────────────────────────────────────┤ │ Remaining Blocks (cached if similar) │ ← FBCache caching zone └─────────────────────────────────────────┘ ``` **Cache decision logic:** 1. Run first Transformer block 2. Compare output to previous step's output 3. If difference < threshold: reuse cached remaining blocks 4. If difference ≥ threshold: run all blocks and update cache In the current project structure, this cache path is implementation groundwork rather than a standard generation toggle like DeepCache. ## DeepCache Configuration ### Parameters | Parameter | Type | Default | Description | |-----------|------|---------|-------------| | `cache_interval` | int | 3 | Steps between cache updates (higher = faster, lower quality) | | `cache_depth` | int | 2 | UNet depth for caching (0-12, higher = more aggressive) | | `start_step` | int | 0 | Timestep to start caching (0-1000) | | `end_step` | int | 1000 | Timestep to stop caching (0-1000) | ### Streamlit UI Enable in the **⚡ DeepCache Acceleration** expander: 1. Check **Enable DeepCache** 2. Adjust sliders: - **Cache Interval**: 1-10 (default: 3) - **Cache Depth**: 0-12 (default: 2) - **Start/End Steps**: 0-1000 (default: 0/1000) 3. Generate images — caching applies transparently ### REST API ```bash curl -X POST http://localhost:7861/api/generate \ -H "Content-Type: application/json" \ -d '{ "prompt": "a misty forest at twilight", "width": 768, "height": 512, "deepcache_enabled": true, "deepcache_interval": 3, "deepcache_depth": 2 }' ``` ### Recommended Presets #### Balanced (Default) ```yaml cache_interval: 3 cache_depth: 2 start_step: 0 end_step: 1000 ``` - **Speedup:** 2-2.3x - **Quality loss:** Very slight (1-2%) - **Use case:** Everyday generation #### Maximum Speed ```yaml cache_interval: 5 cache_depth: 3 start_step: 0 end_step: 1000 ``` - **Speedup:** 2.5-3x - **Quality loss:** Noticeable (5-7%) - **Use case:** Rapid prototyping, batch jobs #### Maximum Quality ```yaml cache_interval: 2 cache_depth: 1 start_step: 0 end_step: 1000 ``` - **Speedup:** 1.5-2x - **Quality loss:** Minimal (<1%) - **Use case:** Final renders, client work #### Partial Caching (Critical Steps Only) ```yaml cache_interval: 3 cache_depth: 2 start_step: 200 end_step: 800 ``` - **Speedup:** 1.8-2.2x - **Quality loss:** Minimal - **Use case:** Preserve early structure, late details ## First Block Cache (FBCache) Configuration ### Parameters | Parameter | Type | Default | Description | |-----------|------|---------|-------------| | `residual_diff_threshold` | float | 0.05 | Max feature difference to trigger cache reuse (0.0-1.0) | ### Usage First Block Cache is not currently exposed as a standard per-generation toggle. The implementation is available in the codebase for specialized integration work: ```python # In src/user/pipeline.py from src.WaveSpeed import fbcache_nodes # Create cache context cache_context = fbcache_nodes.create_cache_context() # Apply caching to a Flux-style model with fbcache_nodes.cache_context(cache_context): patched_model = fbcache_nodes.create_patch_flux_forward_orig( flux_model, residual_diff_threshold=0.05, # Lower = stricter caching ) # Generate images... ``` ### Tuning Threshold - **Lower threshold (0.01-0.03)**: Stricter caching, recomputes more often, higher quality - **Higher threshold (0.05-0.1)**: Looser caching, reuses more often, higher speedup - **Recommended:** 0.05 (balances quality and speed) ## Performance ### Speedup Guidance Speedup scales with cache interval and depth: | Model | Cache Interval | Expected Behavior | |-------|---------------|-------------------| | SD1.5 | 2 | Moderate speedup, minimal quality loss | | SD1.5 | 3 | Good speedup, slight quality loss | | SD1.5 | 5 | High speedup, noticeable quality loss | | SDXL | 3 | Good speedup, slight quality loss | | Flux-style caching paths | implementation-specific | Depends on the integration path | **Performance varies based on:** - GPU architecture - Model size - Resolution - Sampler choice - Number of steps **Recommendation:** Start with `interval=3` and adjust based on your quality requirements.### VRAM Impact Caching increases VRAM usage slightly (50-200MB depending on resolution): | Model | Baseline VRAM | + DeepCache | Increase | |-------|--------------|-------------|----------| | SD1.5 (768×512) | 3.2 GB | 3.4 GB | +200 MB | | SDXL (1024×1024) | 6.8 GB | 7.0 GB | +200 MB | | Flux (832×1216) | 12.5 GB | 12.6 GB | +100 MB | ## Stacking with Other Optimizations WaveSpeed is **fully compatible** with SageAttention, SpargeAttn and Stable-Fast: ### DeepCache + SageAttention ```yaml deepcache_enabled: true deepcache_interval: 3 # SageAttention auto-detected ``` **Result:** 2.2x (DeepCache) × 1.15 (SageAttention) = **~2.5x total speedup** ### DeepCache + SpargeAttn ```yaml deepcache_enabled: true deepcache_interval: 3 # SpargeAttn auto-detected ``` **Result:** Enhanced speedup from caching and sparse attention ### DeepCache + Stable-Fast + SpargeAttn ```yaml stable_fast: true deepcache_enabled: true deepcache_interval: 3 # SpargeAttn auto-detected ``` **Result:** Maximum combined speedup (all optimizations active, batch operations only) ## Compatibility ### DeepCache Compatible With - ✅ Stable Diffusion 1.5 - ✅ Stable Diffusion 2.1 - ✅ SDXL - ✅ All samplers (Euler, DPM++, etc.) - ✅ LoRA adapters - ✅ Textual inversion embeddings - ✅ HiresFix - ✅ ADetailer - ✅ Multi-scale diffusion - ✅ SageAttention/SpargeAttn - ✅ Stable-Fast ### DeepCache NOT Compatible With - ❌ Flux models (use FBCache instead) - ❌ Img2Img mode (can cause artifacts) ### FBCache Compatible With - ✅ Flux models - ✅ SageAttention/SpargeAttn - ✅ All Flux-compatible features ### FBCache NOT Compatible With - ❌ SD1.5/SDXL (use DeepCache instead) - ❌ Stable-Fast (Flux not supported by Stable-Fast) ## Troubleshooting ### No Speedup Observed **Causes:** 1. DeepCache disabled or not applied to correct model type 2. Cache interval too low (interval=1 provides no caching) 3. Model loaded incorrectly **Fixes:** ```bash # Check logs for DeepCache activation cat logs/server.log | grep -i "deepcache\|cache" # Verify UI toggle is enabled # Streamlit: Check "Enable DeepCache" checkbox # API: Ensure "deepcache_enabled": true in payload # Try higher interval deepcache_interval: 3 # Instead of 1 or 2 ``` ### Quality Degradation **Symptoms:** - Blurry details - Smoothed textures - Loss of fine patterns **Causes:** 1. Cache interval too high 2. Cache depth too aggressive 3. Wrong model type (Flux using DeepCache) **Fixes:** ```yaml # Reduce cache interval deepcache_interval: 2 # Down from 5 # Reduce cache depth deepcache_depth: 1 # Down from 3 # Disable caching for critical phases deepcache_start_step: 200 # Skip early structure formation deepcache_end_step: 800 # Skip late detail refinement ``` ### Artifacts in Img2Img **Symptom:** Visible seams, inconsistent styles when using DeepCache with Img2Img. **Cause:** Img2Img starts from a noisy input image, which violates DeepCache's assumptions about feature consistency. **Fix:** Disable DeepCache for Img2Img: ```yaml deepcache_enabled: false # When img2img_enabled: true ``` ### VRAM Increase **Symptom:** OOM errors after enabling DeepCache. **Cause:** Cached features consume additional VRAM. **Fixes:** 1. Reduce batch size 2. Lower resolution 3. Disable other VRAM-heavy features (Stable-Fast CUDA graphs) 4. Use lower cache depth: ```yaml deepcache_depth: 1 # Minimal caching ``` ### Flux FBCache Not Working **Symptom:** No speedup with Flux generation. **Cause:** FBCache implementation is more subtle — check logs for cache hit rate. **Debugging:** ```bash # Enable debug logging export LD_SERVER_LOGLEVEL=DEBUG # Check cache statistics cat logs/server.log | grep "cache" ``` If no cache hits, try adjusting threshold: ```python # In pipeline.py residual_diff_threshold=0.1 # Increase from 0.05 for more cache reuse ``` ## Quality Comparison Visual impact of different cache intervals: | Interval | Speed | Visual Difference | |----------|-------|-------------------| | Disabled | Baseline | Baseline (100% quality) | | 2 | Faster | Virtually identical | | 3 | Much faster | Very subtle smoothing | | 5 | Very fast | Noticeable detail loss | | 7+ | Fastest | Obvious quality degradation | **Recommendation:** Start with `interval=3` and adjust based on visual results. ## Technical Details ### DeepCache Implementation Simplified pseudocode: ```python class DeepCacheWrapper: def __init__(self, model, interval, depth): self.model = model self.interval = interval self.cached_output = None self.current_step = 0 def forward(self, x, timestep): is_cache_step = (self.current_step % self.interval == 0) if is_cache_step: # Run full model, cache output output = self.model(x, timestep) self.cached_output = output.clone() else: # Reuse cached output (skip expensive computation) output = self.cached_output self.current_step += 1 return output ``` Actual implementation in `src/WaveSpeed/deepcache_nodes.py` includes: - Proper timestep tracking - Cache invalidation on batch changes - Error handling and fallback to full forward ### FBCache Residual Comparison ```python # Compute first block output first_output = first_transformer_block(hidden_states) # Compare to previous step residual = first_output - previous_first_output residual_norm = residual.abs().mean() / first_output.abs().mean() if residual_norm < threshold: # Feature change is small — reuse cached blocks hidden_states = apply_cached_residual(first_output) else: # Feature change is large — recompute all blocks hidden_states = run_remaining_blocks(first_output) cache_residual(hidden_states) ``` ## Best Practices ### For Everyday Use 1. **Enable DeepCache** with default settings (`interval=3`, `depth=2`) 2. **Stack with SageAttention** for 2.5x+ total speedup 3. **Disable for final client renders** if absolute quality is critical ### For Batch Processing 1. **Use aggressive caching** (`interval=5`, `depth=3`) 2. **Pre-generate previews** at high speed, re-render winners at full quality 3. **Disable TAESD previews** to avoid overhead (set `enable_preview=false`) ### For Low VRAM 1. **Use conservative caching** (`interval=2`, `depth=1`) 2. **Avoid stacking** with Stable-Fast CUDA graphs 3. **Monitor VRAM** via `/api/telemetry` endpoint ## Citation If you use WaveSpeed/DeepCache in your work: ```bibtex @inproceedings{ma2023deepcache, title={DeepCache: Accelerating Diffusion Models for Free}, author={Ma, Xinyin and Fang, Gongfan and Wang, Xinchao}, booktitle={CVPR}, year={2024} } ``` ## Resources - [DeepCache Paper](https://arxiv.org/abs/2312.00858) - [DeepCache Repository](https://github.com/horseee/DeepCache) - [ComfyUI DeepCache Implementation](https://gist.github.com/laksjdjf/435c512bc19636e9c9af4ee7bea9eb86) (reference for LightDiffusion-Next) - [First Block Cache Discussion](https://github.com/comfyanonymous/ComfyUI/discussions/3491)