Spaces:

Aatricks
/

LightDiffusion-Next

Running on Zero

App Files Files Community

LightDiffusion-Next / docs /optimizations.md

Aatricks

Deploy ZeroGPU Gradio Space snapshot

b701455 21 days ago

preview code

raw

history blame contribute delete

10.6 kB

	# Performance Optimizations

	LightDiffusion-Next achieves its industry-leading inference speed through a layered stack of training-free optimizations that can be selectively enabled based on your hardware and quality requirements. This page provides an overview of each acceleration technique and links to detailed guides.

	For a detailed source-based report on what is implemented today, including server-side throughput optimizations and practical implementation notes, see the [Implemented Optimizations Report](implemented-optimizations-report.md).

	## Optimization Stack Overview

	The pipeline orchestrates six primary acceleration paths:

	\| Technique \| Type \| Speedup \| Quality Impact \| Requirements \|
	\|-----------\|------\|---------\|----------------\|---------------\|
	\| [AYS Scheduler](#ays-scheduler) \| Sampling schedule \| ~2x \| None/Better \| All models \|
	\| [Prompt Caching](#prompt-caching) \| Embedding cache \| 5-15% \| None \| All models \|
	\| [SageAttention](#sageattention--spargeattn) \| Attention kernel \| Moderate \| None \| All CUDA GPUs \|
	\| [SpargeAttn](#sageattention--spargeattn) \| Sparse attention \| Significant \| Minimal \| Compute 8.0-9.0 \|
	\| [Stable-Fast](#stable-fast) \| Graph compilation \| Significant* \| None \| >8GB VRAM, batch jobs \|
	\| [WaveSpeed](#wavespeed-caching) \| Feature caching \| High \| Tunable \| All models \|

	*Speedup depends heavily on batch size and generation count

	These optimizations work together — enabling multiple techniques simultaneously can provide substantial cumulative speedup with tunable quality trade-offs.

	## Quick Comparison

	### AYS Scheduler

	What it does: Uses research-backed optimal timestep distributions that allow equivalent quality in approximately half the steps. Instead of uniform sigma spacing, AYS concentrates samples on noise levels that contribute most to image formation.

	When to use:
	- Always recommended for SD1.5, SDXL, and Flux models
	- Txt2Img generation
	- Production workflows where speed matters
	- Any scenario where you'd normally use 20+ steps

	Trade-offs: Images will differ slightly from standard schedulers (different sampling path), but quality is equivalent or better. Not ideal when exact reproduction of old results is required.

	[→ Full AYS Scheduler guide](ays-scheduler.md)

	---

	### Prompt Caching

	What it does: Caches CLIP text embeddings for prompts that have been encoded before. When generating multiple images with the same or similar prompts, embeddings are retrieved from cache instead of being recomputed.

	When to use:
	- Batch generation with same prompt
	- Testing different seeds or settings
	- Iterative prompt refinement
	- Any workflow with repeated prompts

	Trade-offs: None — minimal memory overhead (~50-200MB), negligible CPU cost, automatically enabled by default.

	[→ Full Prompt Caching guide](prompt-caching.md)

	---

	### SageAttention & SpargeAttn {#sageattention--spargeattn}

	What it does: Replaces PyTorch's default scaled dot-product attention with highly optimized CUDA kernels. SageAttention uses INT8 quantization for key/value tensors while maintaining FP16 query precision. SpargeAttn extends this with dynamic sparsity pruning, skipping redundant attention computations.

	When to use:
	- Always enable SageAttention if available (no quality loss, pure speed gain)
	- SpargeAttn for maximum speed on supported hardware (RTX 30xx/40xx, A100, H100)
	- Both work seamlessly with all samplers, LoRAs and post-processing stages

	Trade-offs: None for SageAttention. SpargeAttn may introduce subtle texture variations at very high sparsity thresholds (default is conservative).

	[→ Full SageAttention/SpargeAttn guide](sageattention.md)

	---

	### CFG Samplers {#cfg-samplers}

	CFG++ Samplers are advanced sampling algorithms that incorporate Classifier-Free Guidance directly into the sampling process, providing better quality and stability compared to standard CFG.

	---

	### Multi-Scale Diffusion {#multi-scale}

	Multi-Scale Diffusion optimizes performance by processing images at multiple resolutions during generation, reducing computation for high-resolution areas.

	When to use:
	- High-resolution generation (>1024px)
	- When memory is limited
	- For faster previews

	Trade-offs: May reduce detail in fine areas.

	Note: In most cases, Multi-Scale Diffusion in quality mode gives better results than standard diffusion while giving a small speedup (this is explained by the upsampling process).

	---

	### Stable-Fast

	What it does: JIT-compiles the UNet diffusion model into optimized TorchScript with optional CUDA graphs. The first forward pass traces execution, caches kernel launches and fuses operators for reduced overhead.

	When to use:
	- Systems with >8GB VRAM (preferably 12GB+)
	- Batch jobs or workflows generating 50+ images with identical settings
	- Long-running operations where 30-60s compilation amortizes over time
	- Fixed resolutions and batch sizes

	When NOT to use:
	- Normal 20-step single image generation (compilation overhead > speedup gains)
	- Systems with <8GB VRAM
	- Flux workflows (different architecture)
	- Quick prototyping or frequent model/resolution changes

	Trade-offs: Compilation time on first run (30-60s), VRAM overhead (~500MB), reduced flexibility for dynamic shapes.

	[→ Full Stable-Fast guide](stablefast.md)

	---

	### WaveSpeed Caching

	What it does: Exploits temporal redundancy in diffusion processes by reusing work across denoising steps. In the current project stack this primarily means DeepCache on supported UNet models, with additional Flux-oriented cache groundwork present in the codebase.

	1. DeepCache — Reuses prior denoiser outputs on selected steps in UNet models (SD1.5, SDXL)
	2. First Block Cache (FBCache) — Flux-oriented cache machinery available for specialized integration work

	When to use:
	- Any workflow where you can tolerate slight smoothing in exchange for 2-3x speedup
	- Combine with conservative cache intervals (2-3) for minimal quality loss
	- Works alongside SageAttention and Stable-Fast

	Trade-offs: Reduced fine detail if interval is too high, slight VRAM increase for cached tensors.

	[→ Full WaveSpeed guide](wavespeed.md)

	---

	## Priority & Fallback System

	LightDiffusion-Next automatically selects the best available attention backend at runtime:

	```
	SpargeAttn > SageAttention > xformers > PyTorch SDPA
	```

	If a kernel fails (e.g., unsupported head dimension), the system gracefully falls back to the next option. You can force PyTorch SDPA by setting `LD_DISABLE_SAGE_ATTENTION=1` for debugging.

	Stable-Fast and WaveSpeed are opt-in toggles controlled via the UI or REST API.

	## Recommended Configurations

	### Maximum Speed - Batch Jobs (SD1.5, >8GB VRAM, 50+ images)
	```yaml
	stable_fast: true # Only for batch operations
	sageattention: auto # or spargeattn if available
	deepCache:
	enabled: true
	interval: 3
	depth: 2
	```
	Expected: Maximum speedup for batch operations, some quality loss
	Note: Disable stable_fast for single 20-step generations

	### Balanced - Quick Generation (SD1.5, any VRAM)
	```yaml
	scheduler: ays # NEW: Use AYS for 2x speedup
	steps: 10 # Reduced from 20 (same quality with AYS)
	stable_fast: false # Disabled for normal generations
	sageattention: auto
	prompt_cache_enabled: true # Enabled by default
	deepcache:
	enabled: true
	interval: 2
	depth: 1
	```
	Expected: ~2-3x speedup with minimal quality loss
	Note: AYS scheduler provides the main speedup; enable stable_fast only for batch jobs (50+ images)

	### Quality-First (Flux)
	```yaml
	scheduler: ays_flux # NEW: Optimized for Flux models
	steps: 10 # Reduced from 15 (same quality with AYS)
	stable_fast: false # not supported
	sageattention: auto
	prompt_cache_enabled: true
	deepcache:
	enabled: true
	interval: 2
	```
	Expected: ~2x speedup with minimal quality impact

	### Production API - High Volume (>8GB VRAM)
	```yaml
	stable_fast: true # Only for sustained high-volume APIs
	sageattention: auto
	deepCache:
	enabled: false # avoid variability across batch sizes
	keep_models_loaded: true
	```
	Expected: Consistent latency for repeated identical requests
	Note: For low-volume or single-shot APIs, use `stable_fast: false`

	## Hardware-Specific Tips

	### RTX 30xx / 40xx (Ampere/Ada)
	- Enable SpargeAttn for best results
	- Stable-Fast only for batch jobs (disable for quick 20-step generations)
	- Stable-Fast + SpargeAttn + DeepCache stacks well for long operations
	- Watch VRAM — Stable-Fast graphs consume ~500MB

	### RTX 50xx (Blackwell)
	- SageAttention only (SpargeAttn support pending)
	- Stable-Fast works but recompiles for new CUDA arch
	- DeepCache is your best additional speedup

	### A100 / H100 (Datacenter)
	- SpargeAttn + Stable-Fast + aggressive WaveSpeed
	- Prefer larger batch sizes to amortize kernel overhead
	- Use CUDA graphs (`enable_cuda_graph=True` in Stable-Fast config)

	### Low VRAM (<8GB)
	- Always disable Stable-Fast (requires >8GB VRAM)
	- Use SageAttention (minimal overhead)
	- Enable DeepCache with conservative intervals
	- Set `vae_on_cpu=True` for HiRes workflows

	## Debugging & Profiling

	Check which optimizations are active:

	```bash
	# View startup logs
	cat logs/server.log \| grep -i "using\\|enabled"

	# Sample output:
	# Using SpargeAttn (Sparse + SageAttention) cross attention
	# Using SpargeAttn (Sparse + SageAttention) in VAE
	# Stable-Fast compilation enabled
	# DeepCache active: interval=3, depth=2
	```

	Monitor telemetry:

	```bash
	curl http://localhost:7861/api/telemetry \| jq '.vram_usage_mb, .average_latency_ms'
	```

	Disable individual optimizations to isolate issues:

	```bash
	export LD_DISABLE_SAGE_ATTENTION=1 # Forces PyTorch SDPA
	export LD_DISABLE_STABLE_FAST=1 # Skips compilation
	export LD_DISABLE_WAVESPEED=1 # Disables all caching
	```

	## Further Reading
	- [AYS Scheduler Deep Dive](ays-scheduler.md) — Theory, implementation, quality tuning
	- [Prompt Caching Deep Dive](prompt-caching.md) — Implementation details, cache management, performance impact
	- [SageAttention & SpargeAttn Deep Dive](sageattention.md) — Installation, technical details, head dimension handling
	- [Stable-Fast Compilation Guide](stablefast.md) — Configuration, CUDA graphs, troubleshooting
	- [WaveSpeed Caching Strategies](wavespeed.md) — DeepCache vs FBCache, tuning parameters, compatibility matrix
	- [Performance Tuning](quirks.md) — VRAM management, slow first runs, recompilation fixes

	---

	Armed with this overview, dive into the technique-specific guides or experiment directly in the UI to find your optimal speed/quality balance.