Anyone get this working on 2x rtx 6000 pro?

#10
by Quadrapole - opened

Would love to see how you got it running and if using q3

Or how it runs at q3

Thanks

2x RTX PRO 6000 = 192 GB

This model = 854 GB

So nope, not possible.

About Q3, I assume you mean Q3 GGUFs (Q3_K_L, Q3_K_M, Q3_K_S), the issue is that GGUF support is not quite ready yet, it's still a work in progress, see here:

https://github.com/ggml-org/llama.cpp/pull/24523

Now you might then be wondering how Unsloth posted GGUFs then, here:

https://huggingface.co/unsloth/MiniMax-M3-GGUF

That's because Unsloth released EXPERIMENTAL GGUFs with the partial preliminary support available from PR 24523, but as it's not quite ready yet as it's only preliminaries and so there are many things don't work yet, things that are not supported and things that need fixes, like the vision projector, if you look at the Unsloth GGUFs, there is no vision projector in there, another thing is that sparse attention (MiniMaxM3SparseForConditionalGeneration) does not work either and it falls back to dense attention instead (Unsloth says it on their model card here: Note: MiniMax Sparse Attention is not supported yet, so inference falls back to dense attention.) So basically the full multimodal model doesn't have GGUF support yet, it's text-only with no support for sparse attention and tool calling need fixes.

Sign up or log in to comment