LightDiffusion-Next / docs /optimizations.md
Aatricks's picture
Deploy ZeroGPU Gradio Space snapshot
b701455
# Performance Optimizations
LightDiffusion-Next achieves its industry-leading inference speed through a layered stack of training-free optimizations that can be selectively enabled based on your hardware and quality requirements. This page provides an overview of each acceleration technique and links to detailed guides.
For a detailed source-based report on what is implemented today, including server-side throughput optimizations and practical implementation notes, see the [Implemented Optimizations Report](implemented-optimizations-report.md).
## Optimization Stack Overview
The pipeline orchestrates six primary acceleration paths:
| Technique | Type | Speedup | Quality Impact | Requirements |
|-----------|------|---------|----------------|---------------|
| [AYS Scheduler](#ays-scheduler) | Sampling schedule | ~2x | None/Better | All models |
| [Prompt Caching](#prompt-caching) | Embedding cache | 5-15% | None | All models |
| [SageAttention](#sageattention--spargeattn) | Attention kernel | Moderate | None | All CUDA GPUs |
| [SpargeAttn](#sageattention--spargeattn) | Sparse attention | Significant | Minimal | Compute 8.0-9.0 |
| [Stable-Fast](#stable-fast) | Graph compilation | Significant* | None | >8GB VRAM, batch jobs |
| [WaveSpeed](#wavespeed-caching) | Feature caching | High | Tunable | All models |
*Speedup depends heavily on batch size and generation count
These optimizations **work together** β€” enabling multiple techniques simultaneously can provide substantial cumulative speedup with tunable quality trade-offs.
## Quick Comparison
### AYS Scheduler
**What it does:** Uses research-backed optimal timestep distributions that allow equivalent quality in approximately half the steps. Instead of uniform sigma spacing, AYS concentrates samples on noise levels that contribute most to image formation.
**When to use:**
- Always recommended for SD1.5, SDXL, and Flux models
- Txt2Img generation
- Production workflows where speed matters
- Any scenario where you'd normally use 20+ steps
**Trade-offs:** Images will differ slightly from standard schedulers (different sampling path), but quality is equivalent or better. Not ideal when exact reproduction of old results is required.
[β†’ Full AYS Scheduler guide](ays-scheduler.md)
---
### Prompt Caching
**What it does:** Caches CLIP text embeddings for prompts that have been encoded before. When generating multiple images with the same or similar prompts, embeddings are retrieved from cache instead of being recomputed.
**When to use:**
- Batch generation with same prompt
- Testing different seeds or settings
- Iterative prompt refinement
- Any workflow with repeated prompts
**Trade-offs:** None β€” minimal memory overhead (~50-200MB), negligible CPU cost, automatically enabled by default.
[β†’ Full Prompt Caching guide](prompt-caching.md)
---
### SageAttention & SpargeAttn {#sageattention--spargeattn}
**What it does:** Replaces PyTorch's default scaled dot-product attention with highly optimized CUDA kernels. SageAttention uses INT8 quantization for key/value tensors while maintaining FP16 query precision. SpargeAttn extends this with dynamic sparsity pruning, skipping redundant attention computations.
**When to use:**
- Always enable SageAttention if available (no quality loss, pure speed gain)
- SpargeAttn for maximum speed on supported hardware (RTX 30xx/40xx, A100, H100)
- Both work seamlessly with all samplers, LoRAs and post-processing stages
**Trade-offs:** None for SageAttention. SpargeAttn may introduce subtle texture variations at very high sparsity thresholds (default is conservative).
[β†’ Full SageAttention/SpargeAttn guide](sageattention.md)
---
### CFG Samplers {#cfg-samplers}
CFG++ Samplers are advanced sampling algorithms that incorporate Classifier-Free Guidance directly into the sampling process, providing better quality and stability compared to standard CFG.
---
### Multi-Scale Diffusion {#multi-scale}
Multi-Scale Diffusion optimizes performance by processing images at multiple resolutions during generation, reducing computation for high-resolution areas.
**When to use:**
- High-resolution generation (>1024px)
- When memory is limited
- For faster previews
**Trade-offs:** May reduce detail in fine areas.
**Note:** In most cases, Multi-Scale Diffusion in quality mode gives better results than standard diffusion while giving a small speedup (this is explained by the upsampling process).
---
### Stable-Fast
**What it does:** JIT-compiles the UNet diffusion model into optimized TorchScript with optional CUDA graphs. The first forward pass traces execution, caches kernel launches and fuses operators for reduced overhead.
**When to use:**
- **Systems with >8GB VRAM** (preferably 12GB+)
- Batch jobs or workflows generating 50+ images with identical settings
- Long-running operations where 30-60s compilation amortizes over time
- Fixed resolutions and batch sizes
**When NOT to use:**
- Normal 20-step single image generation (compilation overhead > speedup gains)
- Systems with <8GB VRAM
- Flux workflows (different architecture)
- Quick prototyping or frequent model/resolution changes
**Trade-offs:** Compilation time on first run (30-60s), VRAM overhead (~500MB), reduced flexibility for dynamic shapes.
[β†’ Full Stable-Fast guide](stablefast.md)
---
### WaveSpeed Caching
**What it does:** Exploits temporal redundancy in diffusion processes by reusing work across denoising steps. In the current project stack this primarily means DeepCache on supported UNet models, with additional Flux-oriented cache groundwork present in the codebase.
1. **DeepCache** β€” Reuses prior denoiser outputs on selected steps in UNet models (SD1.5, SDXL)
2. **First Block Cache (FBCache)** β€” Flux-oriented cache machinery available for specialized integration work
**When to use:**
- Any workflow where you can tolerate slight smoothing in exchange for 2-3x speedup
- Combine with conservative cache intervals (2-3) for minimal quality loss
- Works alongside SageAttention and Stable-Fast
**Trade-offs:** Reduced fine detail if interval is too high, slight VRAM increase for cached tensors.
[β†’ Full WaveSpeed guide](wavespeed.md)
---
## Priority & Fallback System
LightDiffusion-Next automatically selects the best available attention backend at runtime:
```
SpargeAttn > SageAttention > xformers > PyTorch SDPA
```
If a kernel fails (e.g., unsupported head dimension), the system gracefully falls back to the next option. You can force PyTorch SDPA by setting `LD_DISABLE_SAGE_ATTENTION=1` for debugging.
Stable-Fast and WaveSpeed are opt-in toggles controlled via the UI or REST API.
## Recommended Configurations
### Maximum Speed - Batch Jobs (SD1.5, >8GB VRAM, 50+ images)
```yaml
stable_fast: true # Only for batch operations
sageattention: auto # or spargeattn if available
deepCache:
enabled: true
interval: 3
depth: 2
```
**Expected:** Maximum speedup for batch operations, some quality loss
**Note:** Disable stable_fast for single 20-step generations
### Balanced - Quick Generation (SD1.5, any VRAM)
```yaml
scheduler: ays # NEW: Use AYS for 2x speedup
steps: 10 # Reduced from 20 (same quality with AYS)
stable_fast: false # Disabled for normal generations
sageattention: auto
prompt_cache_enabled: true # Enabled by default
deepcache:
enabled: true
interval: 2
depth: 1
```
**Expected:** ~2-3x speedup with minimal quality loss
**Note:** AYS scheduler provides the main speedup; enable stable_fast only for batch jobs (50+ images)
### Quality-First (Flux)
```yaml
scheduler: ays_flux # NEW: Optimized for Flux models
steps: 10 # Reduced from 15 (same quality with AYS)
stable_fast: false # not supported
sageattention: auto
prompt_cache_enabled: true
deepcache:
enabled: true
interval: 2
```
**Expected:** ~2x speedup with minimal quality impact
### Production API - High Volume (>8GB VRAM)
```yaml
stable_fast: true # Only for sustained high-volume APIs
sageattention: auto
deepCache:
enabled: false # avoid variability across batch sizes
keep_models_loaded: true
```
**Expected:** Consistent latency for repeated identical requests
**Note:** For low-volume or single-shot APIs, use `stable_fast: false`
## Hardware-Specific Tips
### RTX 30xx / 40xx (Ampere/Ada)
- Enable SpargeAttn for best results
- Stable-Fast only for batch jobs (disable for quick 20-step generations)
- Stable-Fast + SpargeAttn + DeepCache stacks well for long operations
- Watch VRAM β€” Stable-Fast graphs consume ~500MB
### RTX 50xx (Blackwell)
- SageAttention only (SpargeAttn support pending)
- Stable-Fast works but recompiles for new CUDA arch
- DeepCache is your best additional speedup
### A100 / H100 (Datacenter)
- SpargeAttn + Stable-Fast + aggressive WaveSpeed
- Prefer larger batch sizes to amortize kernel overhead
- Use CUDA graphs (`enable_cuda_graph=True` in Stable-Fast config)
### Low VRAM (<8GB)
- **Always disable Stable-Fast** (requires >8GB VRAM)
- Use SageAttention (minimal overhead)
- Enable DeepCache with conservative intervals
- Set `vae_on_cpu=True` for HiRes workflows
## Debugging & Profiling
Check which optimizations are active:
```bash
# View startup logs
cat logs/server.log | grep -i "using\|enabled"
# Sample output:
# Using SpargeAttn (Sparse + SageAttention) cross attention
# Using SpargeAttn (Sparse + SageAttention) in VAE
# Stable-Fast compilation enabled
# DeepCache active: interval=3, depth=2
```
Monitor telemetry:
```bash
curl http://localhost:7861/api/telemetry | jq '.vram_usage_mb, .average_latency_ms'
```
Disable individual optimizations to isolate issues:
```bash
export LD_DISABLE_SAGE_ATTENTION=1 # Forces PyTorch SDPA
export LD_DISABLE_STABLE_FAST=1 # Skips compilation
export LD_DISABLE_WAVESPEED=1 # Disables all caching
```
## Further Reading
- [AYS Scheduler Deep Dive](ays-scheduler.md) β€” Theory, implementation, quality tuning
- [Prompt Caching Deep Dive](prompt-caching.md) β€” Implementation details, cache management, performance impact
- [SageAttention & SpargeAttn Deep Dive](sageattention.md) β€” Installation, technical details, head dimension handling
- [Stable-Fast Compilation Guide](stablefast.md) β€” Configuration, CUDA graphs, troubleshooting
- [WaveSpeed Caching Strategies](wavespeed.md) β€” DeepCache vs FBCache, tuning parameters, compatibility matrix
- [Performance Tuning](quirks.md) β€” VRAM management, slow first runs, recompilation fixes
---
Armed with this overview, dive into the technique-specific guides or experiment directly in the UI to find your optimal speed/quality balance.