Note: If you have a multi-GPU SM120 Blackwell system (RTX 50/Pro), try my vLLM fork to resolve P2P / TP=2 issues (Pending PR into upstream).
https://github.com/Gadflyii/vllm/tree/main
GLM-4.7-Flash-MTP-NVFP4 (Mixed Precision with MTP in BF16)
This is a mixed precision NVFP4 quantization of zai-org/GLM-4.7-Flash, a 30B-A3B (30B total, 3B active) Mixture-of-Experts model. This version preserves MTP (Multi-Token Prediction) layers in BF16 for speculative decoding compatibility.
What's Different from GLM-4.7-Flash-NVFP4?
| Feature |
GLM-4.7-Flash-NVFP4 |
This Model |
| MTP Layers |
NVFP4 |
BF16 |
| Calibration Samples |
128 |
512 |
| Calibration Seq Length |
2048 |
4096 |
| MMLU-Pro Accuracy |
23.56% |
23.91% |
Quantization Strategy
This model uses mixed precision to preserve accuracy and MTP functionality:
| Component |
Precision |
Rationale |
| MLP Experts |
FP4 (E2M1) |
64 routed experts, 4 active per token |
| Dense MLP |
FP4 (E2M1) |
First layer dense MLP |
| Attention (MLA) |
BF16 |
Low-rank compressed Q/KV projections are sensitive |
| MTP Layers |
BF16 |
eh_proj, shared_head.head for speculative decoding |
| Norms, Gates, Embeddings |
BF16 |
Standard practice |
Performance
| Metric |
BF16 |
NVFP4 |
This Model |
| MMLU-Pro |
24.83% |
23.56% |
23.91% |
| Size |
62.4 GB |
20.4 GB |
20.9 GB |
| Compression |
1x |
3.1x |
3.0x |
| Accuracy Loss |
- |
-1.27% |
-0.92% |
MTP Acceptance Rate
| Model |
Acceptance Rate |
Mean Accepted Length |
| BF16 (baseline) |
60% |
1.60 |
| This Model |
63% |
1.63 |
MTP quality is preserved (actually slightly improved) after quantization.
MTP Performance Note
MTP speculative decoding currently shows overhead rather than speedup due to missing torch.compile support for the MTP drafter model in vLLM. For best throughput, run without MTP enabled until this is resolved upstream.
| Configuration |
Tokens/sec |
| Without MTP |
78.1 tok/s |
| With MTP (1 token) |
64.7 tok/s |
| With MTP (2 tokens) |
56.8 tok/s |
| With MTP (4 tokens) |
44.5 tok/s |
Usage
Requirements
- vLLM: 0.8.0+ (for compressed-tensors NVFP4 support)
- transformers: 5.0.0+ (for
glm4_moe_lite architecture)
- GPU: NVIDIA GPU with FP4 tensor core support (Blackwell, Hopper, Ada Lovelace)
Installation
pip install vllm>=0.8.0
pip install git+https://github.com/huggingface/transformers.git
Inference with vLLM (Recommended)
from vllm import LLM, SamplingParams
model = LLM(
"GadflyII/GLM-4.7-Flash-MTP-NVFP4",
tensor_parallel_size=1,
max_model_len=4096,
trust_remote_code=True,
gpu_memory_utilization=0.90,
)
params = SamplingParams(temperature=0.7, max_tokens=512)
outputs = model.generate(["Explain quantum computing in simple terms."], params)
print(outputs[0].outputs[0].text)
Serving with vLLM
VLLM_ATTENTION_BACKEND=TRITON_MLA vllm serve GadflyII/GLM-4.7-Flash-MTP-NVFP4 \
--tensor-parallel-size 1 \
--max-model-len 4096 \
--trust-remote-code \
--gpu-memory-utilization 0.90
VLLM_ATTENTION_BACKEND=TRITON_MLA vllm serve GadflyII/GLM-4.7-Flash-MTP-NVFP4 \
--tensor-parallel-size 1 \
--max-model-len 4096 \
--trust-remote-code \
--gpu-memory-utilization 0.90 \
--speculative-config '{"method": "mtp", "num_speculative_tokens": 1}'
Model Details
- Base Model: zai-org/GLM-4.7-Flash
- Architecture:
Glm4MoeLiteForCausalLM
- Parameters: 30B total, 3B active per token (30B-A3B)
- MoE Configuration: 64 routed experts, 4 active, 1 shared expert
- Layers: 47 (with 1 MTP layer)
- Context Length: 202,752 tokens (max)
- Languages: English, Chinese
Quantization Details
- Format: compressed-tensors (NVFP4)
- Block Size: 16
- Scale Format: FP8 (E4M3)
- Calibration: 512 samples from wikitext dataset
- Calibration Sequence Length: 4096
- Full Expert Calibration: All 64 experts calibrated per sample
Tensors by Precision
| Precision |
Count |
Description |
| NVFP4 |
9,168 |
MLP/FFN weights |
| BF16 |
240 |
Attention weights (MLA) |
| BF16 |
2 |
MTP layers (eh_proj, shared_head.head) |
Evaluation
MMLU-Pro Overall Results
| Model |
Accuracy |
Correct |
Total |
| BF16 (baseline) |
24.83% |
2988 |
12032 |
| NVFP4-v1 |
23.56% |
2835 |
12032 |
| This Model |
23.91% |
2877 |
12032 |
MMLU-Pro by Category
| Category |
BF16 |
This Model |
Difference |
| Social Sciences |
32.70% |
31.26% |
-1.44% |
| Other |
31.57% |
29.85% |
-1.72% |
| Humanities |
23.78% |
22.82% |
-0.96% |
| STEM |
19.94% |
19.48% |
-0.46% |
MMLU-Pro by Subject
| Subject |
BF16 |
This Model |
Difference |
| Biology |
50.35% |
48.12% |
-2.23% |
| Psychology |
44.99% |
41.23% |
-3.76% |
| History |
33.60% |
34.12% |
+0.52% |
| Health |
35.21% |
34.11% |
-1.10% |
| Economics |
36.37% |
33.06% |
-3.31% |
| Philosophy |
31.46% |
29.26% |
-2.20% |
| Other |
28.35% |
26.08% |
-2.27% |
| Computer Science |
26.10% |
21.95% |
-4.15% |
| Business |
16.35% |
19.26% |
+2.91% |
| Law |
16.89% |
15.99% |
-0.90% |
| Math |
14.06% |
14.73% |
+0.67% |
| Physics |
15.32% |
15.24% |
-0.08% |
| Engineering |
16.00% |
14.96% |
-1.04% |
| Chemistry |
14.13% |
14.84% |
+0.71% |
Citation
If you use this model, please cite the original GLM-4.7-Flash:
@misc{glm4flash2025,
title={GLM-4.7-Flash},
author={Zhipu AI},
year={2025},
howpublished={\url{https://huggingface.co/zai-org/GLM-4.7-Flash}}
}
License
This model inherits the Apache 2.0 license from the base model.