Nekochu commited on
Commit
a5741b1
·
1 Parent(s): ff9f4ad

update README with final state, full pipeline inference, LM generation step

Browse files
Files changed (1) hide show
  1. README.md +60 -82
README.md CHANGED
@@ -26,100 +26,98 @@ startup_duration_timeout: 2h
26
 
27
  ## Features
28
 
29
- - **Music Generation** - Text/lyrics to stereo 48kHz MP3 via GGUF quantized models
30
- - **LoRA Training** - Fine-tune on your own audio (Side-Step engine, Adafactor optimizer)
31
- - **Multiple LM Sizes** - 0.6B / 1.7B / 4B language models (on-demand download)
32
- - **CPU Only** - Runs on free HuggingFace Spaces (2 vCPU, 18GB RAM)
 
33
 
34
  ## Music Generation
35
 
36
- 1. Enter a music description (e.g. "upbeat electronic dance music")
37
  2. Enter lyrics or check **Instrumental**
38
  3. Adjust BPM, duration, steps, seed
39
- 4. Select LM model (1.7B default, fastest on CPU)
40
- 5. Select LoRA adapter if trained
41
- 6. Click **Generate Music**
42
 
43
- **Timing:** ~270s for 10s audio with 1.7B LM, 8 steps.
44
 
45
  ## LoRA Training
46
 
47
- 1. Go to **Train LoRA** tab
48
- 2. Upload audio files (WAV/MP3, max 240s each)
49
- 3. Set LoRA name, epochs (1-10), rank (default 16)
50
- 4. Click **Train** - ace-server stops during training, restarts after
51
- 5. Use **Cancel** to stop early (saves checkpoint)
52
- 6. Trained adapter appears in the LoRA dropdown for inference
53
 
54
- **Timing:** ~170s preprocessing + ~10s/epoch on CPU.
55
 
56
- ## Models
57
 
58
- | Component | GGUF | Size |
59
- |-----------|------|------|
60
- | DiT (music) | acestep-v15-xl-turbo-Q4_K_M | 2.8 GB |
61
- | LM (captions) | acestep-5Hz-lm-1.7B-Q8_0 | 1.7 GB |
62
- | Text Encoder | Qwen3-Embedding-0.6B-Q8_0 | 0.75 GB |
63
- | VAE | vae-BF16 | 0.32 GB |
64
 
65
- LM alternatives (on-demand download): 0.6B Q8_0 (slow), 4B Q5_K_M (best quality, ~515s).
66
 
67
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
68
 
69
  ## API
70
 
71
- ### Python Client - Generate Music
72
 
73
  ```python
74
  from gradio_client import Client
75
 
76
  client = Client("WeReCooking/ACE-Step-CPU")
77
-
78
  result = client.predict(
79
  caption="upbeat electronic dance music",
80
  lyrics="[Instrumental]",
81
- instrumental=True,
82
- bpm=120,
83
- duration=10,
84
- seed=-1, # -1 = random
85
- steps=8, # 1-32, fewer = faster
86
- lora_select="None (no LoRA)", # or trained adapter name
87
  lm_model_select="acestep-5Hz-lm-1.7B-Q8_0.gguf",
88
  api_name="/generate"
89
  )
90
- print(result) # (audio_path, status_message)
91
  ```
92
 
93
- ### Python Client - Train LoRA
94
 
95
  ```python
96
  from gradio_client import Client, handle_file
97
 
98
  client = Client("WeReCooking/ACE-Step-CPU")
99
-
100
  result = client.predict(
101
  audio_files=[handle_file("song.mp3")],
102
- lora_name="my-style",
103
- epochs=3,
104
- lr=0.0001,
105
- rank=16,
106
  api_name="/train_lora"
107
  )
108
- print(result) # (log_text, train_btn, cancel_btn)
109
- ```
110
-
111
- ### Python Client - Server Status
112
-
113
- ```python
114
- result = client.predict(api_name="/server_status")
115
- print(result) # JSON with model info
116
  ```
117
 
118
  ### MCP (Model Context Protocol)
119
 
120
- This Space supports MCP for AI assistants (Claude Desktop, Cursor, VS Code).
121
-
122
- **MCP Config:**
123
  ```json
124
  {
125
  "mcpServers": {
@@ -128,44 +126,24 @@ This Space supports MCP for AI assistants (Claude Desktop, Cursor, VS Code).
128
  }
129
  ```
130
 
131
- ---
132
-
133
- ## CLI Usage
134
 
135
  ```bash
136
- # Generate music
137
- python app.py "upbeat electronic dance music" --duration 10 --steps 8 --format mp3
138
-
139
- # With lyrics
140
- python app.py "pop ballad" --lyrics "Hello world\nThis is a test" -d 30
141
-
142
- # With LoRA adapter
143
  python app.py "jazz piano" --adapter my-style --seed 42
144
-
145
- # Custom server URL
146
- python app.py "ambient" --server http://localhost:8085
147
  ```
148
 
149
- ---
150
-
151
  ## Architecture
152
 
153
- ```
154
- ace-server (C++ GGUF) Gradio UI (Python)
155
- /lm -> LM generate app.py
156
- /synth -> DiT + VAE train_engine.py (Side-Step)
157
- /health |
158
- /props +-- preprocess_audio()
159
- /job +-- train_lora_generator()
160
- ```
161
-
162
- - **Inference:** GGUF via [acestep.cpp](https://github.com/ServeurpersoCom/acestep.cpp) HTTP API
163
- - **Training:** PyTorch via ported [Side-Step](https://github.com/koda-dernet/Side-Step) engine
164
- - Training stops ace-server (free RAM), restarts after with new adapters
165
 
166
  ## Credits
167
 
168
- - [ACE-Step 1.5](https://github.com/ace-step/ACE-Step-1.5) - Model architecture
169
- - [acestep.cpp](https://github.com/ServeurpersoCom/acestep.cpp) - GGUF inference engine
170
- - [Side-Step](https://github.com/koda-dernet/Side-Step) - Training engine (ported)
171
- - [Serveurperso/ACE-Step-1.5-GGUF](https://huggingface.co/Serveurperso/ACE-Step-1.5-GGUF) - Quantized models
 
26
 
27
  ## Features
28
 
29
+ - **Music Generation** -- text/lyrics to stereo 48kHz MP3 via GGUF quantized models
30
+ - **LoRA Training** -- fine-tune on your own audio (~11s/epoch CPU, ~1.4s/epoch GPU)
31
+ - **Auto-Captioning** -- librosa BPM/key/signature + LM understand mode (caption + lyrics extraction)
32
+ - **Multiple LM Sizes** -- 0.6B / 1.7B / 4B language models (on-demand download)
33
+ - **Cancel + Download** -- cancel training mid-epoch, download trained LoRA adapter
34
 
35
  ## Music Generation
36
 
37
+ 1. Enter a music description
38
  2. Enter lyrics or check **Instrumental**
39
  3. Adjust BPM, duration, steps, seed
40
+ 4. Select LoRA adapter if trained
41
+ 5. Click **Generate Music**
 
42
 
43
+ **Timing:** ~270s for 10s audio with 1.7B LM, 8 steps on CPU.
44
 
45
  ## LoRA Training
46
 
47
+ 1. Upload audio files (any length, auto-tiled at 30s chunks by VAE)
48
+ 2. Set LoRA name, epochs, learning rate, rank
49
+ 3. Click **Train** -- ace-server stops during training, restarts after
50
+ 4. Use **Cancel** to stop early (saves checkpoint)
51
+ 5. **Download** the trained adapter file
52
+ 6. Trained adapter appears in the LoRA dropdown
53
 
54
+ **Timing:** ~170s preprocessing + ~11s/epoch on CPU. GPU: ~1.4s/epoch.
55
 
56
+ **Limits:** 30 min total audio across all files. Files exceeding the cap are truncated with a warning. 50 files max. 8h training timeout.
57
 
58
+ **Settings (per Side-Step author recommendations):**
59
+ - LR: 3e-4
60
+ - Rank: 32, Alpha: 64
61
+ - Epochs: 200-500 for 3-10 files
62
+ - Optimizer: Adafactor (minimal memory)
63
+ - Variant: standard turbo (not XL -- XL swaps on 18GB)
64
 
65
+ ## Captioning Pipeline
66
 
67
+ Training audio is auto-captioned before preprocessing:
68
+
69
+ | Method | What it extracts | Speed |
70
+ |--------|-----------------|-------|
71
+ | **librosa** | BPM, key, time signature | ~3s/file |
72
+ | **LM understand** (GPU) | Rich caption + lyrics + metadata | ~52s/file |
73
+ | **ace-server /understand** (Space) | Same as LM, via GGUF | ~30s/file |
74
+ | **.txt/.json sidecar** | User-provided caption (if present) | instant |
75
+
76
+ On Space: uses ace-server /understand before training. Locally: uses PyTorch LM understand.
77
+
78
+ ## Models
79
+
80
+ | Component | GGUF | Size | Purpose |
81
+ |-----------|------|------|---------|
82
+ | DiT XL turbo | acestep-v15-xl-turbo-Q4_K_M | 2.8 GB | Music generation (no LoRA) |
83
+ | DiT standard turbo | acestep-v15-turbo-Q4_K_M | 1.1 GB | Music generation (with LoRA) |
84
+ | LM 1.7B | acestep-5Hz-lm-1.7B-Q8_0 | 1.7 GB | Caption understanding |
85
+ | Text Encoder | Qwen3-Embedding-0.6B-Q8_0 | 0.75 GB | Text encoding |
86
+ | VAE | vae-BF16 | 0.32 GB | Audio encode/decode |
87
 
88
  ## API
89
 
90
+ ### Generate Music
91
 
92
  ```python
93
  from gradio_client import Client
94
 
95
  client = Client("WeReCooking/ACE-Step-CPU")
 
96
  result = client.predict(
97
  caption="upbeat electronic dance music",
98
  lyrics="[Instrumental]",
99
+ instrumental=True, bpm=120, duration=10, seed=-1, steps=8,
100
+ lora_select="None (no LoRA)",
 
 
 
 
101
  lm_model_select="acestep-5Hz-lm-1.7B-Q8_0.gguf",
102
  api_name="/generate"
103
  )
 
104
  ```
105
 
106
+ ### Train LoRA
107
 
108
  ```python
109
  from gradio_client import Client, handle_file
110
 
111
  client = Client("WeReCooking/ACE-Step-CPU")
 
112
  result = client.predict(
113
  audio_files=[handle_file("song.mp3")],
114
+ lora_name="my-style", epochs=200, lr=0.0003, rank=32,
 
 
 
115
  api_name="/train_lora"
116
  )
 
 
 
 
 
 
 
 
117
  ```
118
 
119
  ### MCP (Model Context Protocol)
120
 
 
 
 
121
  ```json
122
  {
123
  "mcpServers": {
 
126
  }
127
  ```
128
 
129
+ ## CLI
 
 
130
 
131
  ```bash
132
+ python app.py "upbeat electronic dance music" --duration 10 --steps 8
 
 
 
 
 
 
133
  python app.py "jazz piano" --adapter my-style --seed 42
 
 
 
134
  ```
135
 
 
 
136
  ## Architecture
137
 
138
+ - **Inference:** GGUF via [acestep.cpp](https://github.com/ServeurpersoCom/acestep.cpp)
139
+ - **Training:** PyTorch, ported from [Side-Step](https://github.com/koda-dernet/Side-Step) (commit ecd13bd)
140
+ - **Captioning:** librosa + LM understand (PyTorch or ace-server /understand)
141
+ - Training stops ace-server to free RAM, restarts after with new adapters
142
+ - Inference blocked during training with clear message
 
 
 
 
 
 
 
143
 
144
  ## Credits
145
 
146
+ - [ACE-Step 1.5](https://github.com/ace-step/ACE-Step-1.5)
147
+ - [acestep.cpp](https://github.com/ServeurpersoCom/acestep.cpp)
148
+ - [Side-Step](https://github.com/koda-dernet/Side-Step)
149
+ - [Serveurperso/ACE-Step-1.5-GGUF](https://huggingface.co/Serveurperso/ACE-Step-1.5-GGUF)