Locke commited on
Commit
341db92
·
1 Parent(s): 3e934f1

set model_type; add nmm_infer/config.json

Browse files
README.md CHANGED
@@ -268,9 +268,10 @@ messages = [
268
 
269
  ```python
270
  # Simply replace the messages in the main example with the messages below.
 
271
  messages = [
272
  {"role": "system", "content": ""},
273
- {"role": "user", "content": "A small kitten sitting naturally on a moss-covered forest floor, centered in the frame, holding a rectangular wooden sign gently with its front paws resting over the top edge. The kitten has soft, fluffy fur, a natural relaxed posture, and a calm, curious expression with a slightly open mouth (not exaggerated), looking directly at the camera.\n\nThe sign is positioned firmly in front of the kitten\'s chest, supported by its paws, with realistic contact and no floating effect. The board reads \"LongCat-Next: When Modalities Internalize as Multilingual Tokens\" in clean, sharp black text, perfectly legible.\n\nThe environment is a lush forest with tall trees, ferns, and soft green foliage. The ground is covered with moss and small plants. Background softly blurred with natural depth of field. Lighting is soft, diffused sunlight filtering through the trees, creating gentle highlights and shadows. Realistic photography style, natural colors, high detail, no cartoonish exaggeration.<longcat_img_start>"}
274
  ]
275
  ```
276
 
@@ -295,6 +296,7 @@ messages = [
295
 
296
  ```python
297
  # Simply replace the messages in the main example with the messages below.
 
298
  messages = [
299
  {"role": "system", "content": "Replicate the voice in the audio clip to formulate an answer:<longcat_audio_start>./assets/system_audio.wav<longcat_audio_end>"},
300
  {"role": "user", "content": "<longcat_audio_start>./assets/math1.wav<longcat_audio_end><longcat_audiogen_start>"}
@@ -308,6 +310,7 @@ messages = [
308
 
309
  ```python
310
  # Simply replace the messages in the main example with the messages below.
 
311
  messages = [
312
  {"role": "system", "content": "Replicate the voice in the audio clip to formulate an answer:<longcat_audio_start>./assets/vc_zh3.wav<longcat_audio_end>"},
313
  {"role": "user", "content": "用这个声音合成以下内容:明天的meeting在三楼的Conference Room举行。<longcat_audiogen_start>"}
@@ -317,8 +320,7 @@ messages = [
317
  </details>
318
 
319
 
320
- <!-- > [!Tip] -->
321
-
322
  > We recommend using the following set of sampling parameters for generation:
323
  >
324
  > - Text: `{"max_new_tokens":2048,"do_sample":false}`
@@ -333,30 +335,8 @@ messages = [
333
 
334
  ## Deployment
335
 
336
- We have implemented basic adaptations in SGLang(Code is being uploaded) to support the deployment of LongCat-Next.
337
-
338
- ```shell
339
- git clone [TBU]
340
- cd nmm_infer
341
- git checkout master
342
- sh setup.sh
343
- ```
344
 
345
- ```shell
346
- # Require CUDA >= 12.9
347
-
348
- # Setup environment
349
- source create_env.sh
350
- source set_env.sh
351
-
352
- # Run tests
353
- python3 demo.py \
354
- --model-path meituan-longcat/LongCat-Next \
355
- --sequential \
356
- --output-dir output \
357
- --tasks vis_gen vis_und aud_qa spk_syn
358
-
359
- ```
360
 
361
 
362
  ## License Agreement
 
268
 
269
  ```python
270
  # Simply replace the messages in the main example with the messages below.
271
+ # Suffix user content with '<longcat_img_start>' to force image generation.
272
  messages = [
273
  {"role": "system", "content": ""},
274
+ {"role": "user", "content": "A small kitten sitting naturally on a moss-covered forest floor, centered in the frame, holding a rectangular wooden sign gently with its front paws resting over the top edge. The kitten has soft, fluffy fur, a natural relaxed posture, and a calm, curious expression with a slightly open mouth (not exaggerated), looking directly at the camera.\n\nThe sign is positioned firmly in front of the kitten\'s chest, supported by its paws, with realistic contact and no floating effect. The board reads \"LongCat-Next: Lexicalizing Modalities as Discrete Tokens\" in clean, sharp black text, perfectly legible.\n\nThe environment is a lush forest with tall trees, ferns, and soft green foliage. The ground is covered with moss and small plants. Background softly blurred with natural depth of field. Lighting is soft, diffused sunlight filtering through the trees, creating gentle highlights and shadows. Realistic photography style, natural colors, high detail, no cartoonish exaggeration.<longcat_img_start>"}
275
  ]
276
  ```
277
 
 
296
 
297
  ```python
298
  # Simply replace the messages in the main example with the messages below.
299
+ # Suffix user content with '<longcat_audiogen_start>' to force audio generation.
300
  messages = [
301
  {"role": "system", "content": "Replicate the voice in the audio clip to formulate an answer:<longcat_audio_start>./assets/system_audio.wav<longcat_audio_end>"},
302
  {"role": "user", "content": "<longcat_audio_start>./assets/math1.wav<longcat_audio_end><longcat_audiogen_start>"}
 
310
 
311
  ```python
312
  # Simply replace the messages in the main example with the messages below.
313
+ # Suffix user content with '<longcat_audiogen_start>' to force audio generation.
314
  messages = [
315
  {"role": "system", "content": "Replicate the voice in the audio clip to formulate an answer:<longcat_audio_start>./assets/vc_zh3.wav<longcat_audio_end>"},
316
  {"role": "user", "content": "用这个声音合成以下内容:明天的meeting在三楼的Conference Room举行。<longcat_audiogen_start>"}
 
320
  </details>
321
 
322
 
323
+ > [!Tip]
 
324
  > We recommend using the following set of sampling parameters for generation:
325
  >
326
  > - Text: `{"max_new_tokens":2048,"do_sample":false}`
 
335
 
336
  ## Deployment
337
 
338
+ We have implemented basic adaptations in SGLang to support the deployment of LongCat-Next. Please refer to this repository for more information: [meituan-longcat/LongCat-Next-inference](https://github.com/meituan-longcat/LongCat-Next-inference)
 
 
 
 
 
 
 
339
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
340
 
341
 
342
  ## License Agreement
assets/evaluation.png CHANGED

Git LFS Details

  • SHA256: 82bc8ab1a053e71f1328241be1739f3b9d0f0c0f84501500070c2e8a49542759
  • Pointer size: 131 Bytes
  • Size of remote file: 267 kB

Git LFS Details

  • SHA256: a4ad74ef615c6524fad0d8aa959cc85ebfb5cbb543ba1dc4fbb142801709d50e
  • Pointer size: 131 Bytes
  • Size of remote file: 267 kB
config.json CHANGED
@@ -38,6 +38,7 @@
38
  "emb_split_num": 4,
39
  "torch_dtype": "bfloat16",
40
  "transformers_version": "4.57.6",
 
41
 
42
  "text_vocab_size": 131072,
43
  "text_vocab_plus_multimodal_special_token_size": 131125,
 
38
  "emb_split_num": 4,
39
  "torch_dtype": "bfloat16",
40
  "transformers_version": "4.57.6",
41
+ "model_type": "longcat_next",
42
 
43
  "text_vocab_size": 131072,
44
  "text_vocab_plus_multimodal_special_token_size": 131125,
configuration_longcat_next.py CHANGED
@@ -5,6 +5,7 @@ from transformers.models.whisper.configuration_whisper import WhisperConfig
5
  from .configuration_longcat_ngram import LongcatFlashNgramConfig
6
 
7
  class LongcatNextConfig(LongcatFlashNgramConfig):
 
8
  def __init__(
9
  self,
10
  vocab_size=131072,
 
5
  from .configuration_longcat_ngram import LongcatFlashNgramConfig
6
 
7
  class LongcatNextConfig(LongcatFlashNgramConfig):
8
+ model_type = "longcat_next"
9
  def __init__(
10
  self,
11
  vocab_size=131072,
nmm_infer/config.json ADDED
@@ -0,0 +1,1153 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "LongcatCausalLM"
4
+ ],
5
+ "attention_bias": false,
6
+ "attention_dropout": 0.0,
7
+ "auto_map": {
8
+ "AutoConfig": "configuration_omni.LongcatConfig"
9
+ },
10
+ "vocab_size": 282624,
11
+ "hidden_size": 3072,
12
+ "ffn_hidden_size": 6144,
13
+ "expert_ffn_hidden_size": 1024,
14
+ "num_layers": 14,
15
+ "num_attention_heads": 32,
16
+ "kv_lora_rank": 512,
17
+ "q_lora_rank": 1536,
18
+ "qk_rope_head_dim": 64,
19
+ "v_head_dim": 128,
20
+ "qk_nope_head_dim": 128,
21
+ "mla_scale_q_lora": true,
22
+ "mla_scale_kv_lora": true,
23
+ "routed_scaling_factor": 6.0,
24
+ "n_routed_experts": 256,
25
+ "max_position_embeddings": 131072,
26
+ "rms_norm_eps": 1e-5,
27
+ "use_cache": true,
28
+ "bos_token_id": 1,
29
+ "eos_token_id": 2,
30
+ "rope_theta": 10000000,
31
+ "attention_method": "MLA",
32
+ "zero_expert_num": 128,
33
+ "zero_expert_type": "identity",
34
+ "moe_topk": 12,
35
+ "use_mla": 1,
36
+ "moe_switch_token_num": 1024,
37
+ "moe_impl": "mix",
38
+ "oe_vocab_size_ratio": 78,
39
+ "oe_neighbor_num": 4,
40
+ "oe_split_num": 4,
41
+ "embP": 16,
42
+ "visual_offset": 150581,
43
+ "vocab_size_text":131072,
44
+ "vocab_size_special": 131125,
45
+ "audio_offset":131125,
46
+ "audio_config": {
47
+ "_attn_implementation_autoset": false,
48
+ "_name_or_path": "",
49
+ "activation_dropout": 0.0,
50
+ "activation_function": "gelu",
51
+ "add_cross_attention": false,
52
+ "apply_spec_augment": false,
53
+ "architectures": null,
54
+ "attention_dropout": 0.0,
55
+ "audio_codebook_loss_weights": [
56
+ 1.0,
57
+ 0.5,
58
+ 0.25,
59
+ 0.125,
60
+ 0.125,
61
+ 0.125,
62
+ 0.125,
63
+ 0.125,
64
+ 0.125,
65
+ 0.125
66
+ ],
67
+ "audio_delim_token_id": 131116,
68
+ "audio_end_token_id": 131104,
69
+ "audio_head_transformer_dims": 3072,
70
+ "audio_head_transformer_ffn_scale": 16,
71
+ "audio_head_transformer_layers": 4,
72
+ "audio_loss_weight": 1.0,
73
+ "audio_pad_token_id": 131105,
74
+ "audio_start_token_id": 131103,
75
+ "audio_text_delay": 1,
76
+ "audiogen_end_token_id": 131124,
77
+ "audiogen_start_token_id": 131123,
78
+ "audiotext_end_token_id": 131121,
79
+ "audiotext_loss_weight": 1.0,
80
+ "audiotext_pad_token_id": 131122,
81
+ "audiotext_start_token_id": 131120,
82
+ "avg_pooler": 4,
83
+ "bad_words_ids": null,
84
+ "begin_suppress_tokens": [
85
+ 220,
86
+ 50256
87
+ ],
88
+ "bos_token_id": 50256,
89
+ "chunk_size_feed_forward": 0,
90
+ "classifier_proj_size": 256,
91
+ "cross_attention_hidden_size": null,
92
+ "d_model": 1280,
93
+ "decoder_attention_heads": 20,
94
+ "decoder_ffn_dim": 5120,
95
+ "decoder_kernel_size": 3,
96
+ "decoder_layerdrop": 0.0,
97
+ "decoder_layers": 8,
98
+ "decoder_start_token_id": 50257,
99
+ "decoder_stride_size": 2,
100
+ "diversity_penalty": 0.0,
101
+ "do_sample": false,
102
+ "dropout": 0.0,
103
+ "early_stopping": false,
104
+ "enable": true,
105
+ "encoder_attention_heads": 20,
106
+ "encoder_ffn_dim": 5120,
107
+ "encoder_layerdrop": 0.0,
108
+ "encoder_layers": 32,
109
+ "encoder_no_repeat_ngram_size": 0,
110
+ "eos_token_id": 50256,
111
+ "exponential_decay_length_penalty": null,
112
+ "finetuning_task": null,
113
+ "forced_bos_token_id": null,
114
+ "forced_eos_token_id": null,
115
+ "hop_length": 160,
116
+ "id2label": {
117
+ "0": "LABEL_0",
118
+ "1": "LABEL_1"
119
+ },
120
+ "init_std": 0.02,
121
+ "is_decoder": false,
122
+ "is_encoder_decoder": true,
123
+ "kernel_size": 3,
124
+ "label2id": {
125
+ "LABEL_0": 0,
126
+ "LABEL_1": 1
127
+ },
128
+ "length_penalty": 1.0,
129
+ "mask_feature_length": 10,
130
+ "mask_feature_min_masks": 0,
131
+ "mask_feature_prob": 0.0,
132
+ "mask_time_length": 10,
133
+ "mask_time_min_masks": 2,
134
+ "mask_time_prob": 0.05,
135
+ "max_audio_seconds": 30,
136
+ "max_length": 20,
137
+ "max_source_positions": 1500,
138
+ "max_target_positions": 448,
139
+ "median_filter_width": 7,
140
+ "min_length": 0,
141
+ "model_type": "whisper",
142
+ "n_fft": 400,
143
+ "no_repeat_ngram_size": 0,
144
+ "num_beam_groups": 1,
145
+ "num_beams": 1,
146
+ "num_hidden_layers": 32,
147
+ "num_mel_bins": 128,
148
+ "num_return_sequences": 1,
149
+ "output_attentions": false,
150
+ "output_hidden_states": false,
151
+ "output_scores": false,
152
+ "pad_token_id": 50256,
153
+ "prefix": null,
154
+ "problem_type": null,
155
+ "pruned_heads": {},
156
+ "remove_invalid_values": false,
157
+ "repetition_penalty": 1.0,
158
+ "return_dict": true,
159
+ "return_dict_in_generate": false,
160
+ "sampling_rate": 16000,
161
+ "scale_embedding": false,
162
+ "sep_token_id": null,
163
+ "split_overlap": 0.0,
164
+ "stride_size": 2,
165
+ "suppress_tokens": null,
166
+ "task_specific_params": null,
167
+ "temperature": 1.0,
168
+ "tf_legacy_loss": false,
169
+ "tie_encoder_decoder": false,
170
+ "tie_word_embeddings": true,
171
+ "tokenizer_class": null,
172
+ "top_k": 50,
173
+ "top_p": 1.0,
174
+ "torch_dtype": null,
175
+ "torchscript": false,
176
+ "typical_p": 1.0,
177
+ "use_bfloat16": false,
178
+ "use_cache": true,
179
+ "use_weighted_layer_sum": false,
180
+ "vocab_size": 51865,
181
+ "vq_config": {
182
+ "_attn_implementation_autoset": false,
183
+ "_name_or_path": "",
184
+ "add_cross_attention": false,
185
+ "architectures": null,
186
+ "bad_words_ids": null,
187
+ "begin_suppress_tokens": null,
188
+ "bos_token_id": null,
189
+ "chunk_size_feed_forward": 0,
190
+ "codebook_sizes": [
191
+ 8192,
192
+ 4096,
193
+ 2048,
194
+ 1024,
195
+ 1024,
196
+ 1024,
197
+ 1024,
198
+ 1024
199
+ ],
200
+ "cross_attention_hidden_size": null,
201
+ "decoder_start_token_id": null,
202
+ "diversity_penalty": 0.0,
203
+ "do_sample": false,
204
+ "early_stopping": false,
205
+ "enable": true,
206
+ "encoder_no_repeat_ngram_size": 0,
207
+ "eos_token_id": null,
208
+ "exponential_decay_length_penalty": null,
209
+ "finetuning_task": null,
210
+ "forced_bos_token_id": null,
211
+ "forced_eos_token_id": null,
212
+ "id2label": {
213
+ "0": "LABEL_0",
214
+ "1": "LABEL_1"
215
+ },
216
+ "is_decoder": false,
217
+ "is_encoder_decoder": false,
218
+ "label2id": {
219
+ "LABEL_0": 0,
220
+ "LABEL_1": 1
221
+ },
222
+ "length_penalty": 1.0,
223
+ "max_length": 20,
224
+ "min_length": 0,
225
+ "model_type": "",
226
+ "no_repeat_ngram_size": 0,
227
+ "num_beam_groups": 1,
228
+ "num_beams": 1,
229
+ "num_return_sequences": 1,
230
+ "output_attentions": false,
231
+ "output_hidden_states": false,
232
+ "output_scores": false,
233
+ "pad_token_id": null,
234
+ "prefix": null,
235
+ "problem_type": null,
236
+ "pruned_heads": {},
237
+ "remove_invalid_values": false,
238
+ "repetition_penalty": 1.0,
239
+ "return_dict": true,
240
+ "return_dict_in_generate": false,
241
+ "sep_token_id": null,
242
+ "suppress_tokens": null,
243
+ "task_specific_params": null,
244
+ "temperature": 1.0,
245
+ "tf_legacy_loss": false,
246
+ "tie_encoder_decoder": false,
247
+ "tie_word_embeddings": true,
248
+ "tokenizer_class": null,
249
+ "top_k": 50,
250
+ "top_p": 1.0,
251
+ "torch_dtype": null,
252
+ "torchscript": false,
253
+ "typical_p": 1.0,
254
+ "use_bfloat16": false
255
+ }
256
+ },
257
+ "flow_matching_config": {
258
+ "_attn_implementation_autoset": false,
259
+ "_name_or_path": "",
260
+ "act_fn": "gelu",
261
+ "add_cross_attention": false,
262
+ "architectures": null,
263
+ "attention_head_dim": 64,
264
+ "bad_words_ids": null,
265
+ "begin_suppress_tokens": null,
266
+ "bos_token_id": null,
267
+ "cal_mel_mae": true,
268
+ "cfm_params": {
269
+ "_attn_implementation_autoset": false,
270
+ "_name_or_path": "",
271
+ "add_cross_attention": false,
272
+ "architectures": null,
273
+ "bad_words_ids": null,
274
+ "begin_suppress_tokens": null,
275
+ "bos_token_id": null,
276
+ "chunk_size_feed_forward": 0,
277
+ "cross_attention_hidden_size": null,
278
+ "decoder_start_token_id": null,
279
+ "diversity_penalty": 0.0,
280
+ "do_sample": false,
281
+ "early_stopping": false,
282
+ "encoder_no_repeat_ngram_size": 0,
283
+ "eos_token_id": null,
284
+ "exponential_decay_length_penalty": null,
285
+ "finetuning_task": null,
286
+ "forced_bos_token_id": null,
287
+ "forced_eos_token_id": null,
288
+ "id2label": {
289
+ "0": "LABEL_0",
290
+ "1": "LABEL_1"
291
+ },
292
+ "inference_cfg_rate": 0.7,
293
+ "is_decoder": false,
294
+ "is_encoder_decoder": false,
295
+ "label2id": {
296
+ "LABEL_0": 0,
297
+ "LABEL_1": 1
298
+ },
299
+ "length_penalty": 1.0,
300
+ "max_length": 20,
301
+ "min_length": 0,
302
+ "model_type": "",
303
+ "no_repeat_ngram_size": 0,
304
+ "num_beam_groups": 1,
305
+ "num_beams": 1,
306
+ "num_return_sequences": 1,
307
+ "output_attentions": false,
308
+ "output_hidden_states": false,
309
+ "output_scores": false,
310
+ "pad_token_id": null,
311
+ "prefix": null,
312
+ "problem_type": null,
313
+ "pruned_heads": {},
314
+ "reg_loss_type": "l1",
315
+ "remove_invalid_values": false,
316
+ "repetition_penalty": 1.0,
317
+ "return_dict": true,
318
+ "return_dict_in_generate": false,
319
+ "sep_token_id": null,
320
+ "sigma_min": 1e-06,
321
+ "solver": "euler",
322
+ "suppress_tokens": null,
323
+ "t_scheduler": "cosine",
324
+ "task_specific_params": null,
325
+ "temperature": 1.0,
326
+ "tf_legacy_loss": false,
327
+ "tie_encoder_decoder": false,
328
+ "tie_word_embeddings": true,
329
+ "tokenizer_class": null,
330
+ "top_k": 50,
331
+ "top_p": 1.0,
332
+ "torch_dtype": null,
333
+ "torchscript": false,
334
+ "training_cfg_rate": 0.2,
335
+ "typical_p": 1.0,
336
+ "use_bfloat16": false
337
+ },
338
+ "channels": [
339
+ 256
340
+ ],
341
+ "chunk_size_feed_forward": 0,
342
+ "cross_attention_hidden_size": null,
343
+ "decoder_start_token_id": null,
344
+ "diffusion_steps": 10,
345
+ "diversity_penalty": 0.0,
346
+ "do_sample": false,
347
+ "dropout": 0.0,
348
+ "early_stopping": false,
349
+ "enable": true,
350
+ "encoder_no_repeat_ngram_size": 0,
351
+ "eos_token_id": null,
352
+ "exponential_decay_length_penalty": null,
353
+ "finetuning_task": null,
354
+ "forced_bos_token_id": null,
355
+ "forced_eos_token_id": null,
356
+ "hop_length": 480,
357
+ "id2label": {
358
+ "0": "LABEL_0",
359
+ "1": "LABEL_1"
360
+ },
361
+ "in_channels": 80,
362
+ "is_decoder": false,
363
+ "is_encoder_decoder": false,
364
+ "label2id": {
365
+ "LABEL_0": 0,
366
+ "LABEL_1": 1
367
+ },
368
+ "length_penalty": 1.0,
369
+ "loss_weight": 1.0,
370
+ "max_audio_seconds": 30,
371
+ "max_length": 20,
372
+ "min_length": 0,
373
+ "model_type": "",
374
+ "n_blocks": 4,
375
+ "no_repeat_ngram_size": 0,
376
+ "num_beam_groups": 1,
377
+ "num_beams": 1,
378
+ "num_heads": 8,
379
+ "num_mid_blocks": 12,
380
+ "num_return_sequences": 1,
381
+ "output_attentions": false,
382
+ "output_hidden_states": false,
383
+ "output_scores": false,
384
+ "pad_token_id": null,
385
+ "prefix": null,
386
+ "prenet_activation_function": "gelu",
387
+ "prenet_attention_heads": 8,
388
+ "prenet_d_model": 512,
389
+ "prenet_ffn_dim": 2048,
390
+ "prenet_in_dim": 1280,
391
+ "prenet_loss_weight": 1.0,
392
+ "prenet_max_source_positions": 5000,
393
+ "prenet_nlayers": 12,
394
+ "prenet_out_dim": 80,
395
+ "prenet_target_mel_length_scale_ratio": 1.0,
396
+ "problem_type": null,
397
+ "pruned_heads": {},
398
+ "remove_invalid_values": false,
399
+ "repetition_penalty": 1.0,
400
+ "return_dict": true,
401
+ "return_dict_in_generate": false,
402
+ "sampling_rate": 24000,
403
+ "sep_token_id": null,
404
+ "spk_emb_dim": 0,
405
+ "split_overlap": 0.1,
406
+ "suppress_tokens": null,
407
+ "task_specific_params": null,
408
+ "temperature": 1.0,
409
+ "tf_legacy_loss": false,
410
+ "tie_encoder_decoder": false,
411
+ "tie_word_embeddings": true,
412
+ "tokenizer_class": null,
413
+ "top_k": 50,
414
+ "top_p": 1.0,
415
+ "torch_dtype": null,
416
+ "torchscript": false,
417
+ "typical_p": 1.0,
418
+ "unet_use_omni_attn": false,
419
+ "use_bfloat16": false,
420
+ "use_hidden_states_before_dconv2": true,
421
+ "use_hires_mel": true
422
+ },
423
+ "head_dim": 128,
424
+ "hidden_act": "silu",
425
+ "initializer_range": 0.02,
426
+ "intermediate_size": 8192,
427
+ "max_window_layers": 28,
428
+ "model_type": "omni",
429
+ "multimodal": [
430
+ "image",
431
+ "video",
432
+ "audio",
433
+ "audiogen"
434
+ ],
435
+ "multimodal_special_token_list": [
436
+ 131072,
437
+ 131073,
438
+ 131074,
439
+ 131075,
440
+ 131076,
441
+ 131077,
442
+ 131078,
443
+ 131079,
444
+ 131080,
445
+ 131081,
446
+ 131082,
447
+ 131083,
448
+ 131084,
449
+ 131085,
450
+ 131086,
451
+ 131087,
452
+ 131088,
453
+ 131089,
454
+ 131090,
455
+ 131091,
456
+ 131092,
457
+ 131093,
458
+ 131094,
459
+ 131095,
460
+ 131096,
461
+ 131097,
462
+ 131098,
463
+ 131099,
464
+ 131100,
465
+ 131101,
466
+ 131102,
467
+ 131103,
468
+ 131104,
469
+ 131105,
470
+ 131106,
471
+ 131107,
472
+ 131108,
473
+ 131109,
474
+ 131110,
475
+ 131111,
476
+ 131112,
477
+ 131113,
478
+ 131114,
479
+ 131115,
480
+ 131116,
481
+ 131117,
482
+ 131118,
483
+ 131119,
484
+ 131120,
485
+ 131121,
486
+ 131122,
487
+ 131123,
488
+ 131124
489
+ ],
490
+ "num_hidden_layers": 14,
491
+ "num_key_value_heads": 4,
492
+ "omni_tokenizer_type": "auto",
493
+ "pad_token_id": 0,
494
+ "position_embedding_type": "rope",
495
+ "rope_scaling": {
496
+ "mrope_section": [
497
+ 16,
498
+ 24,
499
+ 24
500
+ ],
501
+ "type": "mrope"
502
+ },
503
+ "sliding_window": 131072,
504
+ "sparse_attention_heads": null,
505
+ "sparse_attention_layers": [],
506
+ "tie_word_embeddings": false,
507
+ "torch_dtype": "bfloat16",
508
+ "train_multimodal_special_tokens_only": false,
509
+ "transformers_version": "4.51.3",
510
+ "use_norm_head": false,
511
+ "use_sliding_window": false,
512
+ "video_config": {
513
+ "_attn_implementation_autoset": false,
514
+ "_name_or_path": "",
515
+ "add_cross_attention": false,
516
+ "architectures": null,
517
+ "attention_dropout": 0.0,
518
+ "bad_words_ids": null,
519
+ "begin_suppress_tokens": null,
520
+ "bos_token_id": null,
521
+ "chunk_size_feed_forward": 0,
522
+ "cross_attention_hidden_size": null,
523
+ "decode_way": "1fps",
524
+ "decoder_start_token_id": null,
525
+ "depth": 32,
526
+ "diversity_penalty": 0.0,
527
+ "do_sample": false,
528
+ "early_stopping": false,
529
+ "enable": true,
530
+ "encoder_no_repeat_ngram_size": 0,
531
+ "eos_token_id": null,
532
+ "exponential_decay_length_penalty": null,
533
+ "finetuning_task": null,
534
+ "forced_bos_token_id": null,
535
+ "forced_eos_token_id": null,
536
+ "frame_pattern": "<frame><TIMEIDX>",
537
+ "hidden_act": "quick_gelu",
538
+ "hidden_size": 3584,
539
+ "id2label": {
540
+ "0": "LABEL_0",
541
+ "1": "LABEL_1"
542
+ },
543
+ "image_delimiter_token_id": 131115,
544
+ "image_end_token_id": 131107,
545
+ "image_line_token_id": 131109,
546
+ "image_mean": [
547
+ 0.48145466,
548
+ 0.4578275,
549
+ 0.40821073
550
+ ],
551
+ "image_pad_token_id": 131108,
552
+ "image_size": 224,
553
+ "image_start_token_id": 131106,
554
+ "image_std": [
555
+ 0.26862954,
556
+ 0.26130258,
557
+ 0.27577711
558
+ ],
559
+ "in_channels": 3,
560
+ "in_chans": 3,
561
+ "initializer_factor": 1.0,
562
+ "initializer_range": 0.02,
563
+ "intermediate_size": 3420,
564
+ "is_decoder": false,
565
+ "is_encoder_decoder": false,
566
+ "label2id": {
567
+ "LABEL_0": 0,
568
+ "LABEL_1": 1
569
+ },
570
+ "layer_norm_eps": 1e-05,
571
+ "length_penalty": 1.0,
572
+ "max_frame_num": 128,
573
+ "max_length": 20,
574
+ "max_pixels": 150528,
575
+ "merge_size": 2,
576
+ "min_length": 0,
577
+ "min_pixels": 78400,
578
+ "mlp_ratio": 4,
579
+ "model_type": "clip_vision_model",
580
+ "no_repeat_ngram_size": 0,
581
+ "num_attention_heads": 12,
582
+ "num_beam_groups": 1,
583
+ "num_beams": 1,
584
+ "num_channels": 3,
585
+ "num_heads": 16,
586
+ "num_hidden_layers": 12,
587
+ "num_return_sequences": 1,
588
+ "output_attentions": false,
589
+ "output_hidden_states": false,
590
+ "output_scores": false,
591
+ "pad_token_id": null,
592
+ "patch_size": 14,
593
+ "prefix": null,
594
+ "problem_type": null,
595
+ "projection_dim": 512,
596
+ "pruned_heads": {},
597
+ "remove_invalid_values": false,
598
+ "repetition_penalty": 1.0,
599
+ "return_dict": true,
600
+ "return_dict_in_generate": false,
601
+ "sep_token_id": null,
602
+ "spatial_merge_size": 2,
603
+ "spatial_patch_size": 14,
604
+ "split_video": true,
605
+ "suppress_tokens": null,
606
+ "task_specific_params": null,
607
+ "temperature": 1.0,
608
+ "temporal_patch_size": 2,
609
+ "tf_legacy_loss": false,
610
+ "tie_encoder_decoder": false,
611
+ "tie_word_embeddings": true,
612
+ "tokenizer_class": null,
613
+ "top_k": 50,
614
+ "top_p": 1.0,
615
+ "torch_dtype": null,
616
+ "torchscript": false,
617
+ "typical_p": 1.0,
618
+ "use_bfloat16": false,
619
+ "video_end_token_id": 131119,
620
+ "video_place_token_id": 131117,
621
+ "video_start_token_id": 131118
622
+ },
623
+ "visual_config": {
624
+ "_attn_implementation_autoset": true,
625
+ "_name_or_path": "",
626
+ "add_cross_attention": false,
627
+ "architectures": null,
628
+ "attention_dropout": 0.0,
629
+ "bad_words_ids": null,
630
+ "begin_suppress_tokens": null,
631
+ "bos_token_id": null,
632
+ "chunk_size_feed_forward": 0,
633
+ "cross_attention_hidden_size": null,
634
+ "decoder_start_token_id": null,
635
+ "depth": 32,
636
+ "diversity_penalty": 0.0,
637
+ "do_sample": false,
638
+ "early_stopping": false,
639
+ "enable": true,
640
+ "encoder_no_repeat_ngram_size": 0,
641
+ "eos_token_id": null,
642
+ "exponential_decay_length_penalty": null,
643
+ "finetuning_task": null,
644
+ "forced_bos_token_id": null,
645
+ "forced_eos_token_id": null,
646
+ "freeze": true,
647
+ "fullatt_block_indexes": [
648
+ 7,
649
+ 15,
650
+ 23,
651
+ 31
652
+ ],
653
+ "hidden_act": "silu",
654
+ "hidden_size": 1280,
655
+ "id2label": {
656
+ "0": "LABEL_0",
657
+ "1": "LABEL_1"
658
+ },
659
+ "image_delimiter_token_id": 131115,
660
+ "image_end_token_id": 131107,
661
+ "image_head_config": {
662
+ "_attn_implementation_autoset": false,
663
+ "_name_or_path": "",
664
+ "add_cross_attention": false,
665
+ "architectures": null,
666
+ "bad_words_ids": null,
667
+ "begin_suppress_tokens": null,
668
+ "bos_token_id": null,
669
+ "chunk_size_feed_forward": 0,
670
+ "cross_attention_hidden_size": null,
671
+ "decoder_start_token_id": null,
672
+ "diversity_penalty": 0.0,
673
+ "do_sample": false,
674
+ "early_stopping": false,
675
+ "enable": true,
676
+ "encoder_no_repeat_ngram_size": 0,
677
+ "eos_token_id": null,
678
+ "exponential_decay_length_penalty": null,
679
+ "finetuning_task": null,
680
+ "forced_bos_token_id": null,
681
+ "forced_eos_token_id": null,
682
+ "id2label": {
683
+ "0": "LABEL_0",
684
+ "1": "LABEL_1"
685
+ },
686
+ "image_head_transformer_dims": 2048,
687
+ "image_head_transformer_ffn_scale": 16,
688
+ "image_head_transformer_layers": 4,
689
+ "is_decoder": false,
690
+ "is_encoder_decoder": false,
691
+ "label2id": {
692
+ "LABEL_0": 0,
693
+ "LABEL_1": 1
694
+ },
695
+ "length_penalty": 1.0,
696
+ "max_length": 20,
697
+ "min_length": 0,
698
+ "model_type": "",
699
+ "no_repeat_ngram_size": 0,
700
+ "num_beam_groups": 1,
701
+ "num_beams": 1,
702
+ "num_return_sequences": 1,
703
+ "output_attentions": false,
704
+ "output_hidden_states": false,
705
+ "output_scores": false,
706
+ "pad_token_id": null,
707
+ "prefix": null,
708
+ "problem_type": null,
709
+ "pruned_heads": {},
710
+ "remove_invalid_values": false,
711
+ "repetition_penalty": 1.0,
712
+ "return_dict": true,
713
+ "return_dict_in_generate": false,
714
+ "sep_token_id": null,
715
+ "suppress_tokens": null,
716
+ "task_specific_params": null,
717
+ "temperature": 1.0,
718
+ "tf_legacy_loss": false,
719
+ "tie_encoder_decoder": false,
720
+ "tie_word_embeddings": true,
721
+ "tokenizer_class": null,
722
+ "top_k": 50,
723
+ "top_p": 1.0,
724
+ "torch_dtype": null,
725
+ "torchscript": false,
726
+ "typical_p": 1.0,
727
+ "use_bfloat16": false,
728
+ "visual_codebook_loss_weights": [
729
+ 1.0,
730
+ 1.0,
731
+ 1.0,
732
+ 1.0,
733
+ 1.0,
734
+ 1.0,
735
+ 1.0,
736
+ 1.0,
737
+ 1.0,
738
+ 1.0
739
+ ]
740
+ },
741
+ "image_line_token_id": 131109,
742
+ "image_mean": [
743
+ 0.48145466,
744
+ 0.4578275,
745
+ 0.40821073
746
+ ],
747
+ "image_pad_token_id": 131108,
748
+ "image_size": 224,
749
+ "image_start_token_id": 131106,
750
+ "image_std": [
751
+ 0.26862954,
752
+ 0.26130258,
753
+ 0.27577711
754
+ ],
755
+ "in_channels": 3,
756
+ "in_chans": 3,
757
+ "initializer_factor": 1.0,
758
+ "initializer_range": 0.02,
759
+ "intermediate_size": 3420,
760
+ "is_decoder": false,
761
+ "is_encoder_decoder": false,
762
+ "label2id": {
763
+ "LABEL_0": 0,
764
+ "LABEL_1": 1
765
+ },
766
+ "layer_norm_eps": 1e-05,
767
+ "length_penalty": 1.0,
768
+ "max_length": 20,
769
+ "max_pixels": 3211264,
770
+ "merge_size": 2,
771
+ "min_length": 0,
772
+ "min_pixels": 50176,
773
+ "mlp_ratio": 4,
774
+ "model_type": "clip_vision_model",
775
+ "no_repeat_ngram_size": 0,
776
+ "num_attention_heads": 12,
777
+ "num_beam_groups": 1,
778
+ "num_beams": 1,
779
+ "num_channels": 3,
780
+ "num_heads": 16,
781
+ "num_hidden_layers": 12,
782
+ "num_return_sequences": 1,
783
+ "out_hidden_size": 3584,
784
+ "output_attentions": false,
785
+ "output_hidden_states": false,
786
+ "output_scores": false,
787
+ "pad_token_id": null,
788
+ "patch_size": 14,
789
+ "prefix": null,
790
+ "problem_type": null,
791
+ "projection_dim": 512,
792
+ "pruned_heads": {},
793
+ "remove_invalid_values": false,
794
+ "repetition_penalty": 1.0,
795
+ "return_dict": true,
796
+ "return_dict_in_generate": false,
797
+ "sep_token_id": null,
798
+ "spatial_merge_size": 2,
799
+ "spatial_patch_size": 14,
800
+ "suppress_tokens": null,
801
+ "task_specific_params": null,
802
+ "temperature": 1.0,
803
+ "temporal_patch_size": 2,
804
+ "tf_legacy_loss": false,
805
+ "tie_encoder_decoder": false,
806
+ "tie_word_embeddings": true,
807
+ "tokenizer_class": null,
808
+ "tokens_per_second": 2,
809
+ "top_k": 50,
810
+ "top_p": 1.0,
811
+ "torch_dtype": null,
812
+ "torchscript": false,
813
+ "typical_p": 1.0,
814
+ "use_bfloat16": false,
815
+ "visual_loss_weight": 0.125,
816
+ "window_size": 112
817
+ },
818
+ "visual_quantizer_config": {
819
+ "_attn_implementation_autoset": false,
820
+ "_name_or_path": "",
821
+ "add_cross_attention": false,
822
+ "architectures": null,
823
+ "bad_words_ids": null,
824
+ "begin_suppress_tokens": null,
825
+ "bos_token_id": null,
826
+ "chunk_size_feed_forward": 0,
827
+ "codebook_dim": 3584,
828
+ "codebook_size": 16384,
829
+ "codebook_sizes": [
830
+ 16384,
831
+ 16384,
832
+ 16384,
833
+ 16384,
834
+ 16384,
835
+ 16384,
836
+ 16384,
837
+ 16384
838
+ ],
839
+ "commit_loss_ratio": 0.25,
840
+ "cross_attention_hidden_size": null,
841
+ "decay": 0.99,
842
+ "decoder_start_token_id": null,
843
+ "depth": 8,
844
+ "diversity_penalty": 0.0,
845
+ "do_sample": false,
846
+ "early_stopping": false,
847
+ "enable": true,
848
+ "encoder_no_repeat_ngram_size": 0,
849
+ "entropy_loss_ratio": 0,
850
+ "eos_token_id": null,
851
+ "exponential_decay_length_penalty": null,
852
+ "feature_decoder_config": {
853
+ "_attn_implementation_autoset": false,
854
+ "_name_or_path": "",
855
+ "add_cross_attention": false,
856
+ "architectures": null,
857
+ "bad_words_ids": null,
858
+ "begin_suppress_tokens": null,
859
+ "bos_token_id": null,
860
+ "chunk_size_feed_forward": 0,
861
+ "cross_attention_hidden_size": null,
862
+ "decoder_start_token_id": null,
863
+ "depth": 3,
864
+ "diversity_penalty": 0.0,
865
+ "do_sample": false,
866
+ "early_stopping": false,
867
+ "enable": false,
868
+ "encoder_no_repeat_ngram_size": 0,
869
+ "eos_token_id": null,
870
+ "exponential_decay_length_penalty": null,
871
+ "finetuning_task": null,
872
+ "forced_bos_token_id": null,
873
+ "forced_eos_token_id": null,
874
+ "fullatt_block_indexes": [
875
+ 0,
876
+ 1,
877
+ 2
878
+ ],
879
+ "hidden_act": "silu",
880
+ "hidden_size": 1280,
881
+ "id2label": {
882
+ "0": "LABEL_0",
883
+ "1": "LABEL_1"
884
+ },
885
+ "in_channels": 1280,
886
+ "intermediate_size": 3420,
887
+ "is_decoder": false,
888
+ "is_encoder_decoder": false,
889
+ "label2id": {
890
+ "LABEL_0": 0,
891
+ "LABEL_1": 1
892
+ },
893
+ "length_penalty": 1.0,
894
+ "max_length": 20,
895
+ "min_length": 0,
896
+ "model_type": "",
897
+ "no_repeat_ngram_size": 0,
898
+ "num_beam_groups": 1,
899
+ "num_beams": 1,
900
+ "num_heads": 16,
901
+ "num_return_sequences": 1,
902
+ "out_channels": 3584,
903
+ "output_attentions": false,
904
+ "output_hidden_states": false,
905
+ "output_scores": false,
906
+ "pad_token_id": null,
907
+ "patch_size": 1,
908
+ "post_quant_conv": true,
909
+ "prefix": null,
910
+ "problem_type": null,
911
+ "pruned_heads": {},
912
+ "remove_invalid_values": false,
913
+ "repetition_penalty": 1.0,
914
+ "return_dict": true,
915
+ "return_dict_in_generate": false,
916
+ "sep_token_id": null,
917
+ "spatial_merge_size": 1,
918
+ "suppress_tokens": null,
919
+ "task_specific_params": null,
920
+ "temperature": 1.0,
921
+ "temporal_patch_size": 1,
922
+ "tf_legacy_loss": false,
923
+ "tie_encoder_decoder": false,
924
+ "tie_word_embeddings": true,
925
+ "tokenizer_class": null,
926
+ "top_k": 50,
927
+ "top_p": 1.0,
928
+ "torch_dtype": null,
929
+ "torchscript": false,
930
+ "typical_p": 1.0,
931
+ "use_bfloat16": false,
932
+ "use_sliding_window": false,
933
+ "window_size": -1
934
+ },
935
+ "feature_reconstruction_ratio": 1.0,
936
+ "finetuning_task": null,
937
+ "forced_bos_token_id": null,
938
+ "forced_eos_token_id": null,
939
+ "id2label": {
940
+ "0": "LABEL_0",
941
+ "1": "LABEL_1"
942
+ },
943
+ "in_channels": 3584,
944
+ "is_decoder": false,
945
+ "is_encoder_decoder": false,
946
+ "label2id": {
947
+ "LABEL_0": 0,
948
+ "LABEL_1": 1
949
+ },
950
+ "length_penalty": 1.0,
951
+ "max_length": 20,
952
+ "min_length": 0,
953
+ "model_type": "",
954
+ "no_repeat_ngram_size": 0,
955
+ "num_beam_groups": 1,
956
+ "num_beams": 1,
957
+ "num_return_sequences": 1,
958
+ "output_attentions": false,
959
+ "output_hidden_states": false,
960
+ "output_scores": false,
961
+ "pad_token_id": null,
962
+ "prefix": null,
963
+ "problem_type": null,
964
+ "pruned_heads": {},
965
+ "quant_conv": true,
966
+ "quantizer_type": "rq",
967
+ "remove_invalid_values": false,
968
+ "repetition_penalty": 1.0,
969
+ "restart_unused_codes": true,
970
+ "return_dict": true,
971
+ "return_dict_in_generate": false,
972
+ "sep_token_id": null,
973
+ "shared_codebook": true,
974
+ "suppress_tokens": null,
975
+ "task_specific_params": null,
976
+ "temperature": 1.0,
977
+ "tf_legacy_loss": false,
978
+ "tie_encoder_decoder": false,
979
+ "tie_word_embeddings": true,
980
+ "tokenizer_class": null,
981
+ "top_k": 50,
982
+ "top_p": 1.0,
983
+ "torch_dtype": null,
984
+ "torchscript": false,
985
+ "typical_p": 1.0,
986
+ "use_bfloat16": false,
987
+ "vq_loss_ratio": 0
988
+ },
989
+ "vocoder_config": {
990
+ "_attn_implementation_autoset": false,
991
+ "_name_or_path": "",
992
+ "add_cross_attention": false,
993
+ "architectures": null,
994
+ "bad_words_ids": null,
995
+ "begin_suppress_tokens": null,
996
+ "bos_token_id": null,
997
+ "channels": [
998
+ 256,
999
+ 256,
1000
+ 256,
1001
+ 256,
1002
+ 256
1003
+ ],
1004
+ "chunk_size_feed_forward": 0,
1005
+ "cross_attention_hidden_size": null,
1006
+ "decoder_start_token_id": null,
1007
+ "diversity_penalty": 0.0,
1008
+ "do_sample": false,
1009
+ "early_stopping": false,
1010
+ "enable": true,
1011
+ "enable_multi_scale": true,
1012
+ "encoder_no_repeat_ngram_size": 0,
1013
+ "eos_token_id": null,
1014
+ "exponential_decay_length_penalty": null,
1015
+ "finetuning_task": null,
1016
+ "forced_bos_token_id": null,
1017
+ "forced_eos_token_id": null,
1018
+ "hop_length": 256,
1019
+ "id2label": {
1020
+ "0": "LABEL_0",
1021
+ "1": "LABEL_1"
1022
+ },
1023
+ "is_decoder": false,
1024
+ "is_encoder_decoder": false,
1025
+ "label2id": {
1026
+ "LABEL_0": 0,
1027
+ "LABEL_1": 1
1028
+ },
1029
+ "length_penalty": 1.0,
1030
+ "max_audio_seconds": 30,
1031
+ "max_length": 20,
1032
+ "min_length": 0,
1033
+ "model_type": "",
1034
+ "n_fft": 1024,
1035
+ "no_repeat_ngram_size": 0,
1036
+ "num_beam_groups": 1,
1037
+ "num_beams": 1,
1038
+ "num_mel_bins": 80,
1039
+ "num_return_sequences": 1,
1040
+ "output_attentions": false,
1041
+ "output_hidden_states": false,
1042
+ "output_scores": false,
1043
+ "pad_token_id": null,
1044
+ "prefix": null,
1045
+ "problem_type": null,
1046
+ "pruned_heads": {},
1047
+ "remove_invalid_values": false,
1048
+ "repetition_penalty": 1.0,
1049
+ "return_dict": true,
1050
+ "return_dict_in_generate": false,
1051
+ "sampling_rate": 16000,
1052
+ "sep_token_id": null,
1053
+ "split_overlap": 0.0,
1054
+ "suppress_tokens": null,
1055
+ "task_specific_params": null,
1056
+ "temperature": 1.0,
1057
+ "tf_legacy_loss": false,
1058
+ "tie_encoder_decoder": false,
1059
+ "tie_word_embeddings": true,
1060
+ "tokenizer_class": null,
1061
+ "top_k": 50,
1062
+ "top_p": 1.0,
1063
+ "torch_dtype": null,
1064
+ "torchscript": false,
1065
+ "typical_p": 1.0,
1066
+ "use_bfloat16": false
1067
+ },
1068
+ "visual_decoder_config": {
1069
+ "codebook_dim": 3584,
1070
+
1071
+ "image_decoder_config": {
1072
+ "attention_dropout": 0.0,
1073
+ "codebook_dim": 3584,
1074
+ "distill_taps": [
1075
+ 3,
1076
+ 7,
1077
+ 15,
1078
+ 23
1079
+ ],
1080
+ "hidden_act": "gelu",
1081
+ "hidden_size": 1024,
1082
+ "intermediate_size": 2730,
1083
+ "k_bias": false,
1084
+ "layer_norm_eps": 1e-06,
1085
+ "num_attention_heads": 16,
1086
+ "num_hidden_layers": 32,
1087
+ "patch_size": 14,
1088
+ "q_bias": true,
1089
+ "spatial_merge_size": 2,
1090
+ "subln": true,
1091
+ "swiglu": true,
1092
+ "teacher_dims": {
1093
+ "15": 1280,
1094
+ "23": 1280,
1095
+ "3": 1280,
1096
+ "7": 1280
1097
+ },
1098
+ "temporal_patch_size": 2,
1099
+ "v_bias": true
1100
+ },
1101
+
1102
+ "transformer_config": {
1103
+ "patch_size": 2,
1104
+ "in_channels": 16,
1105
+ "hidden_size": 2520,
1106
+ "num_layers": 32,
1107
+ "num_refiner_layers": 2,
1108
+ "num_attention_heads": 21,
1109
+ "num_kv_heads": 7,
1110
+ "multiple_of": 256,
1111
+ "norm_eps": 1e-5,
1112
+ "axes_dim_rope": [40, 40, 40],
1113
+ "axes_lens": [10000, 10000, 10000],
1114
+ "text_feat_dim": 2048,
1115
+ "timestep_scale": 1000.0
1116
+ },
1117
+
1118
+ "vae_config": {
1119
+ "act_fn": "silu",
1120
+ "block_out_channels": [128, 256, 512, 512],
1121
+ "down_block_types": [
1122
+ "DownEncoderBlock2D",
1123
+ "DownEncoderBlock2D",
1124
+ "DownEncoderBlock2D",
1125
+ "DownEncoderBlock2D"
1126
+ ],
1127
+ "in_channels": 3,
1128
+ "latent_channels": 16,
1129
+ "layers_per_block": 2,
1130
+ "mid_block_add_attention": true,
1131
+ "norm_num_groups": 32,
1132
+ "out_channels": 3,
1133
+ "sample_size": 1024,
1134
+ "scaling_factor": 0.3611,
1135
+ "shift_factor": 0.1159,
1136
+ "up_block_types": [
1137
+ "UpDecoderBlock2D",
1138
+ "UpDecoderBlock2D",
1139
+ "UpDecoderBlock2D",
1140
+ "UpDecoderBlock2D"
1141
+ ],
1142
+ "use_post_quant_conv": false,
1143
+ "use_quant_conv": false,
1144
+ "force_upcast": true
1145
+ },
1146
+
1147
+ "scheduler_config": {
1148
+ "num_train_timesteps": 1000,
1149
+ "dynamic_time_shift": true
1150
+ }
1151
+ },
1152
+ "ignored_token_ids": "131072:131125"
1153
+ }
nmm_infer/configuration_omni.py ADDED
@@ -0,0 +1,248 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from transformers.configuration_utils import PretrainedConfig
2
+ from transformers.modeling_rope_utils import rope_config_validation
3
+ from transformers.utils import logging
4
+ from transformers import WhisperConfig
5
+ from transformers import CLIPVisionConfig
6
+ logger = logging.get_logger(__name__)
7
+
8
+ LONGCAT_PRETRAINED_CONFIG_ARCHIVE_MAP = {}
9
+
10
+
11
+ class LongcatConfig(PretrainedConfig):
12
+ r"""
13
+ This is the configuration class to store the configuration of a [`LongcatModel`]. It is used to instantiate an Longcat
14
+ model according to the specified arguments, defining the model architecture. Instantiating a configuration with the
15
+ defaults will yield a similar configuration to that of the Longcat.
16
+ Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
17
+ documentation from [`PretrainedConfig`] for more information.
18
+
19
+
20
+ Args:
21
+ vocab_size (`int`, *optional*, defaults to 131072):
22
+ Vocabulary size of the Deep model. Defines the number of different tokens that can be represented by the
23
+ `inputs_ids` passed when calling [`LongcatModel`]
24
+ hidden_size (`int`, *optional*, defaults to 7168):
25
+ Dimension of the hidden representations.
26
+ ffn_hidden_size (`int`, *optional*, defaults to 18432):
27
+ Dimension of the MLP representations.
28
+ expert_ffn_hidden_size (`int`, *optional*, defaults to 2048):
29
+ Dimension of the MoE representations.
30
+ num_layers (`int`, *optional*, defaults to 61):
31
+ Number of hidden layers in the Transformer decoder.
32
+ num_attention_heads (`int`, *optional*, defaults to 128):
33
+ Number of attention heads for each attention layer in the Transformer decoder.
34
+ num_key_value_heads (`int`, *optional*, defaults to 128):
35
+ This is the number of key_value heads that should be used to implement Grouped Query Attention. If
36
+ `num_key_value_heads=num_attention_heads`, the model will use Multi Head Attention (MHA), if
37
+ `num_key_value_heads=1 the model will use Multi Query Attention (MQA) otherwise GQA is used. When
38
+ converting a multi-head checkpoint to a GQA checkpoint, each group key and value head should be constructed
39
+ by meanpooling all the original heads within that group. For more details checkout [this
40
+ paper](https://arxiv.org/pdf/2305.13245.pdf). If it is not specified, will default to
41
+ `num_attention_heads`.
42
+ n_routed_experts (`int`, *optional*, defaults to 256):
43
+ Number of routed experts.
44
+ routed_scaling_factor (`float`, *optional*, defaults to 2.5):
45
+ Scaling factor or routed experts.
46
+ kv_lora_rank (`int`, *optional*, defaults to 512):
47
+ Rank of the LoRA matrices for key and value projections.
48
+ q_lora_rank (`int`, *optional*, defaults to 1536):
49
+ Rank of the LoRA matrices for query projections.
50
+ qk_rope_head_dim (`int`, *optional*, defaults to 64):
51
+ Dimension of the query/key heads that use rotary position embeddings.
52
+ v_head_dim (`int`, *optional*, defaults to 128):
53
+ Dimension of the value heads.
54
+ qk_nope_head_dim (`int`, *optional*, defaults to 128):
55
+ Dimension of the query/key heads that don't use rotary position embeddings.
56
+ norm_topk_prob (`bool`, *optional*, defaults to `True`):
57
+ Whether to normalize the weights of the routed experts.
58
+ hidden_act (`str` or `function`, *optional*, defaults to `"silu"`):
59
+ The non-linear activation function (function or string) in the decoder.
60
+ max_position_embeddings (`int`, *optional*, defaults to 4096):
61
+ The maximum sequence length that this model might ever be used with.
62
+ rms_norm_eps (`float`, *optional*, defaults to 1e-06):
63
+ The epsilon used by the rms normalization layers.
64
+ use_cache (`bool`, *optional*, defaults to `True`):
65
+ Whether or not the model should return the last key/values attentions (not used by all models). Only
66
+ relevant if `config.is_decoder=True`.
67
+ pad_token_id (`int`, *optional*):
68
+ Padding token id.
69
+ bos_token_id (`int`, *optional*, defaults to 0):
70
+ Beginning of stream token id.
71
+ eos_token_id (`int`, *optional*, defaults to 1):
72
+ End of stream token id.
73
+ tie_word_embeddings (`bool`, *optional*, defaults to `False`):
74
+ Whether to tie weight embeddings
75
+ rope_theta (`float`, *optional*, defaults to 10000.0):
76
+ The base period of the RoPE embeddings.
77
+ attention_bias (`bool`, defaults to `False`, *optional*, defaults to `False`):
78
+ Whether to use a bias in the query, key, value and output projection layers during self-attention.
79
+ attention_dropout (`float`, *optional*, defaults to 0.0):
80
+ The dropout ratio for the attention probabilities.
81
+
82
+ ```python
83
+ >>> from transformers import LongcatModel, LongcatConfig
84
+
85
+ >>> # Initializing a Longcat style configuration
86
+ >>> configuration = LongcatConfig()
87
+
88
+ >>> # Accessing the model configuration
89
+ >>> configuration = model.config
90
+ ```"""
91
+
92
+ model_type = "longcat"
93
+ keys_to_ignore_at_inference = ["past_key_values"]
94
+ base_model_tp_plan = { # TODO: only replicate attention layers when > first_k_dense_replace
95
+ "layers.*.self_attn.k_proj": "colwise",
96
+ "layers.*.self_attn.v_proj": "colwise",
97
+ "layers.*.self_attn.o_proj": "rowwise",
98
+ "layers.*.mlp.experts.*.gate_proj": "local_colwise",
99
+ "layers.*.mlp.experts.*.up_proj": "local_colwise",
100
+ "layers.*.mlp.experts.*.down_proj": "local_rowwise",
101
+ "layers.*.mlps.*.gate_proj": "local_colwise",
102
+ "layers.*.mlps.*.up_proj": "local_colwise",
103
+ "layers.*.mlps.*.down_proj": "local_rowwise",
104
+ }
105
+ base_model_pp_plan = {
106
+ "embed_tokens": (["input_ids"], ["inputs_embeds"]),
107
+ "layers": (["hidden_states", "attention_mask"], ["hidden_states"]),
108
+ "norm": (["hidden_states"], ["hidden_states"]),
109
+ }
110
+
111
+ def __init__(
112
+ self,
113
+ vocab_size=282624,
114
+ hidden_size=3072,
115
+ ffn_hidden_size=6144,
116
+ expert_ffn_hidden_size=1024,
117
+ num_layers=14,
118
+ num_attention_heads=32,
119
+ num_key_value_heads=4,
120
+ n_routed_experts=256,
121
+ routed_scaling_factor=6.0,
122
+ kv_lora_rank=512,
123
+ q_lora_rank=1536,
124
+ qk_rope_head_dim=64,
125
+ v_head_dim=128,
126
+ head_dim=128,
127
+ qk_nope_head_dim=128,
128
+ mla_scale_q_lora=True,
129
+ mla_scale_kv_lora=True,
130
+ moe_topk=12,
131
+ norm_topk_prob=False,
132
+ hidden_act="silu",
133
+ max_position_embeddings=8192,
134
+ rms_norm_eps=1e-5,
135
+ use_cache=True,
136
+ pad_token_id=None,
137
+ bos_token_id=1,
138
+ eos_token_id=2,
139
+ tie_word_embeddings=True,
140
+ rope_theta=1000000,
141
+ attention_bias=False,
142
+ attention_dropout=0.0,
143
+ attention_method='MLA',
144
+ initializer_range=0.02,
145
+ router_bias=False,
146
+ zero_expert_num=None,
147
+ zero_expert_type=None,
148
+ oe_vocab_size_ratio=None,
149
+ oe_neighbor_num=None,
150
+ oe_split_num=None,
151
+ embP=None,
152
+ audio_config=None,
153
+ visual_config=None,
154
+ video_config=None,
155
+ vocoder_config=None,
156
+ flow_matching_config=None,
157
+ visual_quantizer_config=None,
158
+ visual_decoder_config=None,
159
+ **kwargs,
160
+ ):
161
+ self.vocab_size = vocab_size
162
+ self.max_position_embeddings = max_position_embeddings
163
+ self.hidden_size = hidden_size
164
+ self.ffn_hidden_size = ffn_hidden_size
165
+ self.expert_ffn_hidden_size = expert_ffn_hidden_size
166
+ self.num_layers = num_layers
167
+ self.num_attention_heads = num_attention_heads
168
+ self.n_routed_experts = n_routed_experts
169
+ self.routed_scaling_factor = routed_scaling_factor
170
+ self.kv_lora_rank = kv_lora_rank
171
+ self.q_lora_rank = q_lora_rank
172
+ self.qk_rope_head_dim = qk_rope_head_dim
173
+ self.v_head_dim = v_head_dim
174
+ self.qk_nope_head_dim = qk_nope_head_dim
175
+ self.qk_head_dim = qk_nope_head_dim + qk_rope_head_dim
176
+ self.moe_topk = moe_topk
177
+ self.norm_topk_prob = norm_topk_prob
178
+ self.mla_scale_q_lora = mla_scale_q_lora
179
+ self.mla_scale_kv_lora = mla_scale_kv_lora
180
+ self.attention_method = attention_method
181
+ self.initializer_range = initializer_range
182
+ self.router_bias = router_bias
183
+ self.zero_expert_num = zero_expert_num
184
+ self.zero_expert_type = zero_expert_type
185
+ self.oe_vocab_size_ratio=oe_vocab_size_ratio
186
+ self.oe_neighbor_num=oe_neighbor_num
187
+ self.oe_split_num=oe_split_num
188
+ self.embP=embP
189
+
190
+ if self.attention_method == "GQA":
191
+ self.head_dim = head_dim
192
+ elif self.attention_method == "MLA":
193
+ self.head_dim = qk_rope_head_dim
194
+ else:
195
+ ValueError("attention_method should be one of [\"GQA\", \"MLA\"]")
196
+
197
+ # for backward compatibility
198
+ if num_key_value_heads is None:
199
+ num_key_value_heads = num_attention_heads
200
+
201
+ self.num_key_value_heads = num_key_value_heads
202
+ self.hidden_act = hidden_act
203
+ self.rms_norm_eps = rms_norm_eps
204
+ self.use_cache = use_cache
205
+ self.rope_theta = rope_theta
206
+ self.attention_bias = attention_bias
207
+ self.attention_dropout = attention_dropout
208
+ # Validate the correctness of rotary position embeddings parameters
209
+ # BC: if there is a 'type' field, copy it it to 'rope_type'.
210
+ rope_config_validation(self)
211
+
212
+ super().__init__(
213
+ pad_token_id=pad_token_id,
214
+ bos_token_id=bos_token_id,
215
+ eos_token_id=eos_token_id,
216
+ tie_word_embeddings=tie_word_embeddings,
217
+ **kwargs,
218
+ )
219
+ if audio_config is not None:
220
+ self.audio_config = WhisperConfig(**audio_config)
221
+ if self.audio_config.vq_config is not None:
222
+ self.audio_config.vq_config = PretrainedConfig(**self.audio_config.vq_config)
223
+ if vocoder_config is not None:
224
+ self.vocoder_config = PretrainedConfig(**vocoder_config)
225
+ if flow_matching_config is not None:
226
+ self.flow_matching_config = PretrainedConfig(**flow_matching_config)
227
+ self.flow_matching_config.cfm_params = PretrainedConfig(**self.flow_matching_config.cfm_params)
228
+ if visual_config is not None:
229
+ self.visual_config = CLIPVisionConfig(**visual_config)
230
+ if hasattr(self.visual_config, 'vq_config') and self.visual_config.vq_config is not None:
231
+ self.visual_config.vq_config = PretrainedConfig(**self.visual_config.vq_config)
232
+ if hasattr(self.visual_config, 'image_head_config') and self.visual_config.image_head_config is not None:
233
+ self.visual_config.image_head_config = PretrainedConfig(**self.visual_config.image_head_config)
234
+ if video_config is not None:
235
+ self.video_config = CLIPVisionConfig(**video_config)
236
+ if visual_quantizer_config is not None:
237
+ self.visual_quantizer_config = PretrainedConfig(**visual_quantizer_config)
238
+ self.visual_quantizer_config.feature_decoder_config = PretrainedConfig(
239
+ **self.visual_quantizer_config.feature_decoder_config)
240
+ if visual_decoder_config is not None:
241
+ self.visual_decoder_config = PretrainedConfig(**visual_decoder_config)
242
+ self.visual_decoder_config.image_decoder_config = PretrainedConfig(**getattr(self.visual_decoder_config, "image_decoder_config", {}))
243
+ self.visual_decoder_config.transformer_config = PretrainedConfig(**getattr(self.visual_decoder_config, "transformer_config", {}))
244
+ self.visual_decoder_config.vae_config = PretrainedConfig(**getattr(self.visual_decoder_config, "vae_config", {}))
245
+ self.visual_decoder_config.scheduler_config = PretrainedConfig(**getattr(self.visual_decoder_config, "scheduler_config", {}))
246
+
247
+
248
+ __all__ = ["LongcatConfig"]