Qwen3.5-35B-A3B APEX GGUF -- A Novel MoE-Aware Mixed-Precision Quantization Technique
Brought to you by the LocalAI team -- the creators of LocalAI the open-source AI engine that runs any model - LLMs, vision, voice, image, video - on any hardware. No GPU required.
APEX Technical Report | GitHub Repository | LocalAI
APEX (Adaptive Precision for EXpert Models) is a novel quantization technique for Mixture-of-Experts language models. Unlike uniform quantization methods that apply the same precision to every tensor, APEX introduces a layer-wise precision gradient combined with MoE-aware tensor classification and diverse imatrix calibration to achieve Q8_0-level quality at a fraction of the size. The method was discovered through systematic human-driven, AI-assisted research across 25+ quantization strategies. APEX outperforms Unsloth Dynamic 2.0 (UD) quantizations on accuracy benchmarks while being 2x smaller.
This repository contains seven APEX GGUF files plus a vision projector (mmproj) covering every deployment scenario from maximum accuracy to consumer GPU inference. The best configuration (APEX Quality) beats both Q8_0 and F16 perplexity while being 38% smaller than Q8_0. I-variants use a diverse imatrix (chat, code, reasoning, tool-calling -- no Wikipedia) that trades tiny perplexity increases for significant accuracy gains and lower KL divergence.
For the full technical details, method description, and reproduction scripts, see the APEX GitHub repository.
Available Files
| File | Configuration | Size | PPL | Speed (tg128) | Best for |
|---|---|---|---|---|---|
Qwen3.5-35B-A3B-APEX-Quality.gguf |
APEX Quality | 21.3 GB | 6.527 | 62.3 t/s | Lowest perplexity of any quantization |
Qwen3.5-35B-A3B-APEX-I-Quality.gguf |
APEX I-Quality | 21.3 GB | 6.552 | 63.1 t/s | Best accuracy across benchmarks |
Qwen3.5-35B-A3B-APEX-Balanced.gguf |
APEX Balanced | 23.6 GB | 6.533 | 60.8 t/s | Interactive use, serving, general purpose |
Qwen3.5-35B-A3B-APEX-I-Balanced.gguf |
APEX I-Balanced | 23.6 GB | 6.548 | 61.4 t/s | All-round with lower KL divergence |
Qwen3.5-35B-A3B-APEX-Compact.gguf |
APEX Compact | 16.1 GB | 6.783 | 69.8 t/s | Consumer 24 GB GPUs |
Qwen3.5-35B-A3B-APEX-I-Compact.gguf |
APEX I-Compact | 16.1 GB | 6.669 | 69.8 t/s | 16 GB GPUs, best accuracy at this size |
Qwen3.5-35B-A3B-APEX-Mini.gguf |
APEX Mini | 12.2 GB | 7.088 | 74.4 t/s | Consumer 16 GB VRAM, smallest viable |
mmproj-F16.gguf |
Vision Projector | 899 MB | -- | -- | Required for vision/multimodal tasks |
APEX Quality uses a 3-tier layer-wise precision gradient (Q6_K/Q5_K/IQ4_XS) with Q8_0 shared experts. It achieves the lowest perplexity of any quantization tested -- beating even F16 (6.527 vs 6.537).
APEX I-Quality uses the same architecture as Quality but with a diverse imatrix (chat, code, reasoning, tool-calling -- no Wikipedia). It achieves the highest HellaSwag (83.5%), matches Q8_0 on ARC (57.9%), and posts the best TruthfulQA (38.4%) of any model tested.
APEX Balanced uses a 2-tier gradient (Q6_K edges, Q5_K middle) with Q8_0 shared experts. It matches Q8_0 perplexity exactly (6.533) while being 31% smaller and 16% faster. Recommended for general-purpose use.
APEX I-Balanced uses the same architecture as Balanced with a diverse imatrix. KL divergence drops 11% (mean 0.0078 vs 0.0088) and KL max drops from 6.03 to 5.77.
APEX Compact uses Q4_K edge layers, Q3_K middle layers, and Q6_K shared experts. At 16.1 GB it fits consumer 24 GB GPUs with room for KV cache.
APEX I-Compact is the biggest imatrix winner: PPL drops from 6.783 to 6.669 (-0.114), KL max from 7.56 to 5.50, and MMLU rises from 40.9% to 41.7%. The diverse imatrix has the largest impact on aggressively quantized tiers.
APEX Mini combines the layer-wise precision gradient with IQ2_S middle-layer experts and a diverse imatrix, pushing to 12.2 GB. It beats bartowski IQ2_M (11.3 GB) on every metric: PPL 7.088 vs 7.303, HellaSwag 81.0% vs 80.3%, MMLU 41.3% vs 39.6%. Fits consumer 16 GB VRAM GPUs with room for context.
Benchmark Results
All measurements on Qwen3.5-35B-A3B, NVIDIA DGX Spark (GB10, 122 GB VRAM). Perplexity measured on wikitext-2-raw, context 2048. Accuracy benchmarks (HellaSwag, Winogrande, MMLU, ARC-Challenge, TruthfulQA) evaluated via llama.cpp using 400 tasks where applicable.
Core Metrics
| Quantization | Size (GB) | PPL | KL mean | KL max | HS | WG | MMLU | ARC | TQA | tg128 (t/s) |
|---|---|---|---|---|---|---|---|---|---|---|
| F16 | 64.6 | 6.537 | -- | -- | 82.5% | 74.5% | 41.5% | 56.9% | 37.2% | 30.4 |
| Q8_0 | 34.4 | 6.533 | 0.0046 | 14.71 | 83.0% | 75.3% | 41.2% | 57.9% | 37.7% | 52.5 |
| APEX Quality | 21.3 | 6.527 | 0.0114 | 5.85 | 83.0% | 74.5% | 41.2% | 56.2% | 37.7% | 62.3 |
| APEX I-Quality | 21.3 | 6.552 | 0.0102 | 5.59 | 83.5% | 74.5% | 41.4% | 57.9% | 38.4% | 63.1 |
| APEX Balanced | 23.6 | 6.533 | 0.0088 | 6.03 | 83.0% | 74.5% | 41.3% | 56.9% | 36.8% | 60.8 |
| APEX I-Balanced | 23.6 | 6.548 | 0.0078 | 5.77 | 83.0% | 73.3% | 41.0% | 57.5% | 37.5% | 61.4 |
| APEX Compact | 16.1 | 6.783 | 0.0469 | 7.56 | 82.5% | 73.3% | 40.9% | 55.2% | 36.5% | 69.8 |
| APEX I-Compact | 16.1 | 6.669 | 0.0332 | 5.50 | 81.8% | 75.0% | 41.7% | 55.5% | 37.9% | 69.8 |
| APEX Mini | 12.2 | 7.088 | 0.0870 | 5.57 | 81.0% | 75.5% | 41.3% | 57.2% | 36.7% | 74.4 |
| Unsloth UD-Q8_K_XL | 45.3 | 6.536 | 0.0025 | 4.36 | 82.5% | 74.8% | 41.3% | 57.9% | 38.1% | 36.4 |
| Unsloth UD-Q4_K_L | 18.8 | 6.586 | 0.0151 | 5.98 | 82.3% | 75.8% | 41.1% | 59.2% | 37.3% | 65.5 |
| bartowski IQ2_M | 11.3 | 7.303 | 0.1113 | 6.07 | 80.3% | 74.0% | 39.6% | 56.2% | 35.0% | 76.2 |
| bartowski Q3_K_M | 15.1 | 6.730 | 0.0420 | 5.56 | 82.0% | 75.0% | 41.5% | 57.5% | 38.8% | 60.6 |
Accuracy Benchmarks
| Benchmark | F16 | Q8_0 | Quality | I-Quality | Balanced | I-Balanced | Compact | I-Compact | Mini | Q8_K_XL | Q4_K_L | IQ2_M | Q3_K_M |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| HellaSwag | 82.5% | 83.0% | 83.0% | 83.5% | 83.0% | 83.0% | 82.5% | 81.8% | 81.0% | 82.5% | 82.3% | 80.3% | 82.0% |
| Winogrande | 74.5% | 75.3% | 74.5% | 74.5% | 74.5% | 73.3% | 73.3% | 75.0% | 75.5% | 74.8% | 75.8% | 74.0% | 75.0% |
| MMLU | 41.5% | 41.2% | 41.2% | 41.4% | 41.3% | 41.0% | 40.9% | 41.7% | 41.3% | 41.3% | 41.1% | 39.6% | 41.5% |
| ARC | 56.9% | 57.9% | 56.2% | 57.9% | 56.9% | 57.5% | 55.2% | 55.5% | 57.2% | 57.9% | 59.2% | 56.2% | 57.5% |
| TruthfulQA | 37.2% | 37.7% | 37.7% | 38.4% | 36.8% | 37.5% | 36.5% | 37.9% | 36.7% | 38.1% | 37.3% | 35.0% | 38.8% |
Key Takeaways
- APEX Quality has the best perplexity of any quantization (6.527, beats even F16's 6.537) at just 21.3 GB.
- I-variants trade tiny PPL increases for significant accuracy gains. I-Quality achieves 83.5% HellaSwag (best of any model), 57.9% ARC, and 38.4% TruthfulQA. KL divergence is consistently 10-30% lower across all I-variants.
- I-Compact is the biggest imatrix winner: PPL drops from 6.783 to 6.669 (-0.114), KL max from 7.56 to 5.50, MMLU from 40.9% to 41.7%.
- APEX Mini (12.2 GB) beats bartowski IQ2_M (11.3 GB) on every metric: PPL 7.088 vs 7.303, HellaSwag 81.0% vs 80.3%, MMLU 41.3% vs 39.6%. Layer gradient + IQ2_S with diverse imatrix outperforms uniform IQ2_M.
- At similar size (18.8 vs 21.3 GB), APEX Quality beats Unsloth UD-Q4_K_L on perplexity (6.527 vs 6.586), KL mean (0.011 vs 0.015), and HellaSwag (83.0% vs 82.3%).
- APEX Compact (16.1 GB) is 14% smaller than Unsloth UD-Q4_K_L (18.8 GB) and 7% faster (69.8 vs 65.5 t/s).
- Unsloth UD-Q8_K_XL wins on KL divergence (best mean 0.0025, best max 4.36) but at 2-3x the size of APEX tiers.
- Q8_0 has the worst outlier divergence of all models tested (KL max 14.71), despite a low KL mean.
- All APEX tiers match or beat Unsloth on accuracy benchmarks within noise, at a fraction of the size.
Benchmark Plots
How to Download and Use
APEX I-Quality (21.3 GB) -- Best accuracy
# Download
huggingface-cli download mudler/Qwen3.5-35B-A3B-APEX-GGUF \
Qwen3.5-35B-A3B-APEX-I-Quality.gguf --local-dir ./model
# Interactive chat
llama-cli -m ./model/Qwen3.5-35B-A3B-APEX-I-Quality.gguf \
--conversation -ngl 99
# Server mode
llama-server -m ./model/Qwen3.5-35B-A3B-APEX-I-Quality.gguf \
--host 0.0.0.0 --port 8080 -ngl 99
Requires ~22 GB VRAM for full GPU offload. Uses diverse imatrix calibration for best accuracy across benchmarks. Recommended when downstream task performance matters more than raw perplexity.
APEX Quality (21.3 GB) -- Best perplexity
# Download
huggingface-cli download mudler/Qwen3.5-35B-A3B-APEX-GGUF \
Qwen3.5-35B-A3B-APEX-Quality.gguf --local-dir ./model
# Interactive chat
llama-cli -m ./model/Qwen3.5-35B-A3B-APEX-Quality.gguf \
--conversation -ngl 99
# Server mode
llama-server -m ./model/Qwen3.5-35B-A3B-APEX-Quality.gguf \
--host 0.0.0.0 --port 8080 -ngl 99
Requires ~22 GB VRAM for full GPU offload. Uses IQ4_XS for middle-layer experts, so llama.cpp b5460 or later is recommended.
APEX I-Balanced (23.6 GB) -- All-round with lower KL
# Download
huggingface-cli download mudler/Qwen3.5-35B-A3B-APEX-GGUF \
Qwen3.5-35B-A3B-APEX-I-Balanced.gguf --local-dir ./model
# Interactive chat
llama-cli -m ./model/Qwen3.5-35B-A3B-APEX-I-Balanced.gguf \
--conversation -ngl 99
# Server mode
llama-server -m ./model/Qwen3.5-35B-A3B-APEX-I-Balanced.gguf \
--host 0.0.0.0 --port 8080 -ngl 99
Requires ~24 GB VRAM for full GPU offload. Uses diverse imatrix calibration with standard K-quant formats for lower KL divergence.
APEX Balanced (23.6 GB) -- Best all-rounder
# Download
huggingface-cli download mudler/Qwen3.5-35B-A3B-APEX-GGUF \
Qwen3.5-35B-A3B-APEX-Balanced.gguf --local-dir ./model
# Interactive chat
llama-cli -m ./model/Qwen3.5-35B-A3B-APEX-Balanced.gguf \
--conversation -ngl 99
# Server mode
llama-server -m ./model/Qwen3.5-35B-A3B-APEX-Balanced.gguf \
--host 0.0.0.0 --port 8080 -ngl 99
Requires ~24 GB VRAM for full GPU offload. Uses only standard K-quant formats (Q6_K/Q5_K) with optimized dequantization kernels.
APEX I-Compact (16.1 GB) -- Best accuracy at 16 GB
# Download
huggingface-cli download mudler/Qwen3.5-35B-A3B-APEX-GGUF \
Qwen3.5-35B-A3B-APEX-I-Compact.gguf --local-dir ./model
# Interactive chat
llama-cli -m ./model/Qwen3.5-35B-A3B-APEX-I-Compact.gguf \
--conversation -ngl 99
# Server mode
llama-server -m ./model/Qwen3.5-35B-A3B-APEX-I-Compact.gguf \
--host 0.0.0.0 --port 8080 -ngl 99
Requires ~17 GB VRAM for full GPU offload. The biggest imatrix winner -- PPL drops 0.114 vs standard Compact, MMLU rises from 40.9% to 41.7%.
APEX Compact (16.1 GB) -- Consumer GPUs
# Download
huggingface-cli download mudler/Qwen3.5-35B-A3B-APEX-GGUF \
Qwen3.5-35B-A3B-APEX-Compact.gguf --local-dir ./model
# Interactive chat
llama-cli -m ./model/Qwen3.5-35B-A3B-APEX-Compact.gguf \
--conversation -ngl 99
# Server mode
llama-server -m ./model/Qwen3.5-35B-A3B-APEX-Compact.gguf \
--host 0.0.0.0 --port 8080 -ngl 99
Requires ~17 GB VRAM for full GPU offload. Fits consumer 24 GB GPUs (RTX 4090, RTX 5090) with room for KV cache and context.
APEX Mini (12.2 GB) -- Consumer 16 GB VRAM
# Download
huggingface-cli download mudler/Qwen3.5-35B-A3B-APEX-GGUF \
Qwen3.5-35B-A3B-APEX-Mini.gguf --local-dir ./model
# Interactive chat
llama-cli -m ./model/Qwen3.5-35B-A3B-APEX-Mini.gguf \
--conversation -ngl 99
# Server mode
llama-server -m ./model/Qwen3.5-35B-A3B-APEX-Mini.gguf \
--host 0.0.0.0 --port 8080 -ngl 99
Requires ~13 GB VRAM for full GPU offload. Fits consumer 16 GB VRAM GPUs (RTX 4060 Ti 16GB, RTX 5060 Ti) with room for context. Beats bartowski IQ2_M on every metric despite being only 0.9 GB larger.
Download all files
huggingface-cli download mudler/Qwen3.5-35B-A3B-APEX-GGUF --local-dir ./model
About the Base Model
Qwen3.5-35B-A3B is a Mixture-of-Experts language model with 35 billion total parameters but only 3 billion active per token. It uses 256 experts per MoE layer, routing 8 experts plus 1 shared expert per token across 40 transformer layers. This sparse activation pattern means 97% of expert weights are idle for any given token, creating an opportunity for differentiated quantization.
Quantization Methodology
APEX exploits three properties of MoE models to achieve lossless compression:
1. MoE-aware tensor classification
Not all tensors in an MoE model are equal. APEX classifies them into three categories with different precision requirements:
- Routed expert weights (gate/up/down projections): These make up the bulk of model parameters but only 8 out of 256 experts are active per token. The 97% sparsity means these tolerate aggressive quantization -- the routing decision uses full-precision gate weights, so quantization noise in inactive experts never affects output.
- Shared expert weights: Always active for every token and exhibit heavy-tailed weight distributions (kurtosis 13.10 vs 3.41 for routed experts). These need high precision (Q8_0) to preserve outlier values.
- Attention and SSM weights: Dense layers that contribute few parameters but matter for generation quality. Kept at Q6_K uniformly in the Quality and Balanced tiers.
2. Layer-wise precision gradient
Edge transformer layers (the first and last 5) handle input embedding alignment and output logit generation. They are significantly more sensitive to quantization than the middle layers, which perform more redundant intermediate processing. APEX assigns higher precision to the edges and lower precision to the middle.
3. Five tiers (seven configurations)
| Configuration | Size | Expert strategy | Shared expert | Attention | Best for |
|---|---|---|---|---|---|
| APEX I-Quality | 21.3 GB | Q6_K edges, Q5_K near-edges, IQ4_XS middle, diverse imatrix | Q8_0 | Q6_K | Best accuracy |
| APEX Quality | 21.3 GB | Q6_K edges, Q5_K near-edges, IQ4_XS middle | Q8_0 | Q6_K | Lowest perplexity |
| APEX I-Balanced | 23.6 GB | Q6_K edges, Q5_K middle, diverse imatrix | Q8_0 | Q6_K | All-round, lower KL |
| APEX Balanced | 23.6 GB | Q6_K edges, Q5_K middle | Q8_0 | Q6_K | General purpose |
| APEX I-Compact | 16.1 GB | Q4_K edges, Q3_K middle, diverse imatrix | Q6_K | Q4_K | Best accuracy at 16 GB |
| APEX Compact | 16.1 GB | Q4_K edges, Q3_K middle | Q6_K | Q4_K | Consumer 24 GB GPUs |
| APEX Mini | 12.2 GB | Layer gradient with IQ2_S middle, diverse imatrix | Q6_K | Q4_K | Consumer 16 GB VRAM |
I-variants: diverse imatrix calibration
Standard imatrix calibration uses Wikipedia text, which biases quantization toward encyclopedic prose. APEX I-variants use a diverse calibration dataset spanning chat, code, reasoning, and tool-calling -- no Wikipedia. This produces a different optimization tradeoff: I-variants trade a tiny perplexity increase on wikitext (the benchmark Wikipedia text) for significant gains on real-world accuracy benchmarks and consistently lower KL divergence.
The effect is most dramatic on aggressive quantizations. I-Compact drops perplexity from 6.783 to 6.669 (-0.114), reduces KL max from 7.56 to 5.50, and lifts MMLU from 40.9% to 41.7%. At the Quality tier, I-Quality achieves the highest HellaSwag score of any model tested (83.5%), matches Q8_0 on ARC (57.9%), and posts the best TruthfulQA (38.4%).
APEX Mini: the 12 GB tier
APEX Mini combines the layer-wise precision gradient with IQ2_S middle-layer experts and a diverse imatrix to push MoE quantization to 12.2 GB. At this size it fits consumer 16 GB VRAM GPUs (RTX 4060 Ti 16GB, RTX 5060 Ti) with room for context. It beats bartowski IQ2_M (11.3 GB) on every single metric: PPL 7.088 vs 7.303, HellaSwag 81.0% vs 80.3%, MMLU 41.3% vs 39.6%, ARC 57.2% vs 56.2%. The layer gradient + diverse imatrix combination outperforms uniform quantization even at extreme compression ratios.
Key findings from 25+ experiments
- Q6_K is the sweet spot for routed experts. Going from Q6_K to Q8_0 on expert weights wastes 7.5 GB for zero perplexity improvement. Going below Q5_K causes measurable degradation.
- Layer position matters more than uniform bit-width. A 2-tier layer gradient (Q6_K edges, Q5_K middle) matches Q8_0 quality. A uniform Q5_K assignment does not.
- Shared expert precision is critical. The shared expert's heavy-tailed weight distribution (kurtosis 13.10) makes it the most sensitive component.
- IQ formats underperform K-quants for MoE experts. IQ3_S gives worse perplexity than Q3_K on routed expert tensors despite similar bit rates.
- Diverse imatrix calibration improves real-world accuracy. A calibration dataset spanning chat, code, reasoning, and tool-calling (no Wikipedia) trades tiny wikitext perplexity increases for significant gains on downstream benchmarks and consistently lower KL divergence. The effect is strongest on aggressive quantizations.
- Stock llama.cpp quantization algorithms are already optimal. Five novel C-level modifications all showed zero improvement. Gains come from better precision allocation, not algorithm changes.
The APEX method and code will be published soon.
Evaluation Methodology
Information-theoretic metrics: Perplexity is measured on wikitext-2-raw (context 2048, full dataset). KL Divergence measures the divergence between quantized and full-precision logit distributions, reported as mean, max, 99.9th percentile, and median. Lower values indicate the quantized model's predictions more closely match the original.
Downstream accuracy benchmarks: HellaSwag (commonsense reasoning), Winogrande (coreference resolution), MMLU (multitask language understanding), ARC-Challenge (science QA), and TruthfulQA (truthful generation) are evaluated via llama.cpp with 400 tasks where applicable.
Note: Evaluations on hybrid MoE models were enabled by our upstream fix to llama.cpp's hybrid memory path for recurrent architectures (PR-ready).
Hardware
All benchmarks were measured on an NVIDIA DGX Spark:
- GPU: NVIDIA GB10, 122 GB unified VRAM
- CUDA: 13.0, compute capability 12.1
- Benchmark: wikitext-2-raw test set, context length 2048, full dataset evaluation
- Inference speed: measured with llama-perplexity (prompt processing throughput)
Technical Details
- Quantization tool: llama.cpp
llama-quantizewith--tensor-type-filefor per-layer precision assignments - Layer count: 40 transformer layers
- Expert count: 256 per MoE layer (8 routed + 1 shared active per token)
- Weight distributions: Routed experts are near-Gaussian (kurtosis 3.41); shared expert is heavy-tailed (kurtosis 13.10)
- Compatibility: Stock llama.cpp, no patches or custom builds required
Run locally with LocalAI
These APEX quantized models work out of the box with LocalAI -- a free, open-source OpenAI-compatible API that runs locally. Load any APEX GGUF and get an instant API server with chat completions, embeddings, and more:
# Run APEX Balanced with LocalAI
local-ai run mudler/Qwen3.5-35B-A3B-APEX-GGUF@Qwen3.5-35B-A3B-APEX-Balanced.gguf
LocalAI supports GPU acceleration, multiple model loading, and function calling. See the LocalAI documentation for more.
TurboQuant KV Cache Compression (Optional)
For additional memory savings and faster prompt processing, APEX models can be combined with KV cache compression via TurboQuant+, a fork of llama.cpp that adds turbo quantization types for the KV cache. This is separate from weight quantization -- TurboQuant compresses the KV cache 4.6x, allowing longer contexts in less VRAM.
This requires the feature/turboquant-kv-cache branch of the TurboQuant+ fork:
# Build (same as llama.cpp, but clone the fork)
git clone https://github.com/TheTom/llama-cpp-turboquant.git
cd llama-cpp-turboquant
git checkout feature/turboquant-kv-cache
cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release -j
Recommended configuration: -ctk q8_0 -ctv turbo3 -fa on
# Example: APEX Mini with TurboQuant KV cache compression
./build/bin/llama-server -m Qwen3.5-35B-A3B-APEX-Mini.gguf \
-ctk q8_0 -ctv turbo3 -fa on \
--host 0.0.0.0 --port 8080 -ngl 99
Prompt Processing Speedup at 8K Context
| Model | pp8192 baseline | pp8192 turbo3 | Speedup | tg128 delta |
|---|---|---|---|---|
| APEX I-Quality | 1,752 t/s | 2,003 t/s | +14.3% | <1% |
| APEX I-Balanced | 1,695 t/s | 1,927 t/s | +13.7% | <1% |
| APEX I-Compact | 1,714 t/s | 1,959 t/s | +14.3% | <1% |
| APEX Mini | 1,696 t/s | 1,938 t/s | +14.3% | <1% |
TurboQuant delivers 13-14% prompt processing speedup at 8K context with negligible impact on token generation speed (<1% delta on tg128). The KV cache compression is orthogonal to weight quantization, so all quality metrics (perplexity, accuracy, KL divergence) remain unchanged.
APEX Mini + TurboQuant enables running a 35B MoE model at 12 GB with 8K+ context on 16 GB VRAM GPUs.
Credits
APEX is brought to you by the LocalAI team -- the creators of the free, open-source OpenAI-compatible API for running AI locally.
Developed through human-driven, AI-assisted research to systematically explore MoE quantization strategies across 25+ experiments. Built on llama.cpp by Georgi Gerganov and contributors. Inspired by karpathy/autoresearch.
Citation
If you use APEX quantized models in your research, please cite:
@misc{apex-quant-2026,
title = {APEX: Adaptive Precision for Expert Models -- MoE-Aware Mixed-Precision Quantization},
author = {Di Giacinto, Ettore and {LocalAI Team}},
year = {2026},
url = {https://github.com/mudler/apex-quant},
note = {Layer-wise precision gradient quantization for Mixture-of-Experts models using llama.cpp}
}
@misc{localai,
title = {LocalAI: the free, Open Source OpenAI alternative},
author = {Di Giacinto, Ettore and {LocalAI Contributors}},
year = {2023},
url = {https://github.com/mudler/LocalAI}
}
- Downloads last month
- 9,266
We're not able to determine the quantization variants.





