Note: For full-duplex (real-time) inference, use the 8-bit variant instead. 4-bit quantization degrades PersonaPlex response quality significantly β INT8 is both 30% faster (112ms vs 158ms/step) and produces coherent responses where INT4 generates garbled output.
PersonaPlex-7B MLX 4-bit
PersonaPlex 7B full-duplex speech-to-speech model converted to MLX safetensors with 4-bit quantization for Apple Silicon.
Converted from nvidia/personaplex-7b-v1 (based on Kyutai Moshi architecture).
Swift inference: soniqo/speech-swift
Library Docs: soniqo.audio
Blog: PersonaPlex on Apple Silicon β Full-Duplex Speech-to-Speech in Native Swift with MLX
Model Details
| Component | Architecture | Size |
|---|---|---|
| Temporal Transformer | 32-layer, 4096d, 32 heads (7B params) | ~3.5 GB (4-bit) |
| Depformer | 6-layer, 1024d, 16 heads, per-codebook weights | ~50 MB (fp16) |
| Mimi Codec | SEANet encoder/decoder + 8L transformer + 16 RVQ codebooks | ~370 MB (fp16) |
| Embeddings | Text + 16 audio embeddings + output heads | ~940 MB (fp16) |
| Total | ~4.9 GB |
Architecture
[User Audio 24kHz] β [Mimi Encoder] β 16 codebook tokens @ 12.5Hz
β
[Temporal Transformer: 32L, dim=4096, 7B params]
17 streams: text + 8 user audio + 8 agent audio
β
[Depformer: 6L, dim=1024, per-codebook weights]
16 sequential steps β agent audio codebook tokens
β
[Agent Audio 24kHz] β [Mimi Decoder] β codebook tokens @ 12.5Hz
Voices
18 voice presets available:
| Category | Voices |
|---|---|
| Natural Female | NATF0, NATF1, NATF2, NATF3 |
| Natural Male | NATM0, NATM1, NATM2, NATM3 |
| Variety Female | VARF0, VARF1, VARF2, VARF3, VARF4 |
| Variety Male | VARM0, VARM1, VARM2, VARM3, VARM4 |
Files
temporal.safetensorsβ Temporal transformer (4-bit quantized, group_size=64)depformer.safetensorsβ Depformer layers + input projections (fp16)embeddings.safetensorsβ Text/audio embeddings + output heads (fp16)mimi.safetensorsβ Mimi neural audio codec (fp16)voices/*.safetensorsβ Voice preset embeddingstokenizer_spm_32k_3.modelβ SentencePiece tokenizerconfig.jsonβ Model configuration
Quantization
- Temporal transformer attention (Q/K/V output projections) and FFN: 4-bit with group_size=64
- Attention input projection (
in_proj): kept fp16 (packed Q+K+V format) - Depformer: kept fp16 (~50 MB, not worth quantizing)
- Mimi codec: kept fp16 (audio quality sensitive)
Usage
import PersonaPlex
let model = try await PersonaPlexModel.fromPretrained()
let response = model.respond(
userAudio: audioSamples, // [Float] 24kHz mono
voice: .NATM0,
maxSteps: 500
)
CLI
swift run personaplex-cli --input question.wav --output response.wav --voice NATM0
See soniqo/speech-swift for build instructions.
License
CC-BY-NC-4.0 (same as upstream PersonaPlex)
Citation
@article{nguyen2025personaplex,
title={PersonaPlex: Enhancing Human-Centric AI Through Full-Duplex Multi-Turn Conversations With Persona-Conditioned Voice Responses},
author={Nguyen, Tu Anh and others},
journal={arXiv preprint arXiv:2504.07966},
year={2025}
}
- Downloads last month
- 2,972
Hardware compatibility
Log In to add your hardware
Quantized
Model tree for aufklarer/PersonaPlex-7B-MLX-4bit
Collection including aufklarer/PersonaPlex-7B-MLX-4bit
Collection
Speech AI models for Apple Silicon via MLX. ASR, TTS, VAD, diarization, speaker embedding. β’ 21 items β’ Updated