Spaces:
Running
Running
metadata
title: ACE-Step 1.5 XL Music Generation (CPU)
emoji: 🎵
colorFrom: indigo
colorTo: yellow
sdk: docker
pinned: false
license: mit
tags:
- music-generation
- ace-step
- gguf
- lora
- training
- cpu
- mcp-server
short_description: ACE-Step 1.5 XL - CPU music generation + LoRA training
models:
- ACE-Step/Ace-Step1.5
startup_duration_timeout: 2h
ACE-Step 1.5 XL Music Generation (CPU)
GGUF inference + LoRA training on free CPU Spaces. Powered by acestep.cpp.
Features
- Music Generation -- text/lyrics to stereo 48kHz MP3 via GGUF quantized models
- LoRA Training -- fine-tune on your own audio (~11s/epoch CPU, ~1.4s/epoch GPU)
- Auto-Captioning -- librosa BPM/key/signature + LM understand mode (caption + lyrics extraction)
- Multiple LM Sizes -- 0.6B / 1.7B / 4B language models (on-demand download)
- Cancel + Download -- cancel training mid-epoch, download trained LoRA adapter
Music Generation
- Enter a music description
- Enter lyrics or check Instrumental
- Adjust BPM, duration, steps, seed
- Select LoRA adapter if trained
- Click Generate Music
Timing: ~270s for 10s audio with 1.7B LM, 8 steps on CPU.
LoRA Training
- Upload audio files (any length, auto-tiled at 30s chunks by VAE)
- Set LoRA name, epochs, learning rate, rank
- Click Train -- ace-server stops during training, restarts after
- Use Cancel to stop early (saves checkpoint)
- Download the trained adapter file
- Trained adapter appears in the LoRA dropdown
Timing: ~170s preprocessing + ~11s/epoch on CPU. GPU: ~1.4s/epoch.
Limits: 30 min total audio across all files. Files exceeding the cap are truncated with a warning. 50 files max. 8h training timeout.
Settings (per Side-Step author recommendations):
- LR: 3e-4
- Rank: 32, Alpha: 64
- Epochs: 200-500 for 3-10 files
- Optimizer: Adafactor (minimal memory)
- Variant: standard turbo (not XL -- XL swaps on 18GB)
Captioning Pipeline
Training audio is auto-captioned before preprocessing:
| Method | What it extracts | Speed |
|---|---|---|
| librosa | BPM, key, time signature | ~3s/file |
| LM understand (GPU) | Rich caption + lyrics + metadata | ~52s/file |
| ace-server /understand (Space) | Same as LM, via GGUF | ~30s/file |
| .txt/.json sidecar | User-provided caption (if present) | instant |
On Space: uses ace-server /understand before training. Locally: uses PyTorch LM understand.
Models
| Component | GGUF | Size | Purpose |
|---|---|---|---|
| DiT XL turbo | acestep-v15-xl-turbo-Q4_K_M | 2.8 GB | Music generation (no LoRA) |
| DiT standard turbo | acestep-v15-turbo-Q4_K_M | 1.1 GB | Music generation (with LoRA) |
| LM 1.7B | acestep-5Hz-lm-1.7B-Q8_0 | 1.7 GB | Caption understanding |
| Text Encoder | Qwen3-Embedding-0.6B-Q8_0 | 0.75 GB | Text encoding |
| VAE | vae-BF16 | 0.32 GB | Audio encode/decode |
API
Generate Music
from gradio_client import Client
client = Client("WeReCooking/ACE-Step-CPU")
result = client.predict(
caption="upbeat electronic dance music",
lyrics="[Instrumental]",
instrumental=True, bpm=120, duration=10, seed=-1, steps=8,
lora_select="None (no LoRA)",
lm_model_select="acestep-5Hz-lm-1.7B-Q8_0.gguf",
api_name="/generate"
)
Train LoRA
from gradio_client import Client, handle_file
client = Client("WeReCooking/ACE-Step-CPU")
result = client.predict(
audio_files=[handle_file("song.mp3")],
lora_name="my-style", epochs=200, lr=0.0003, rank=32,
api_name="/train_lora"
)
MCP (Model Context Protocol)
{
"mcpServers": {
"ace-step": {"url": "https://werecooking-ace-step-cpu.hf.space/gradio_api/mcp/"}
}
}
CLI
python app.py "upbeat electronic dance music" --duration 10 --steps 8
python app.py "jazz piano" --adapter my-style --seed 42
Architecture
- Inference: GGUF via acestep.cpp
- Training: PyTorch, ported from Side-Step (commit ecd13bd)
- Captioning: librosa + LM understand (PyTorch or ace-server /understand)
- Training stops ace-server to free RAM, restarts after with new adapters
- Inference blocked during training with clear message