How DeepSeek R1T‑Chimera Cuts Tokens by 40% Without Fine‑Tuning
The DeepSeek‑R1T‑Chimera model merges DeepSeek‑R1 reasoning with V3‑0324 architecture, reusing most V3 weights and swapping only the blue‑highlighted R1 routing experts, achieving the same intelligence as R1 while reducing output tokens by about 40% and running faster, all without any fine‑tuning or distillation.
Overview
The DeepSeek‑R1T‑Chimera model is built by directly merging the inference graph of DeepSeek‑R1 with the checkpoint of DeepSeek‑V3‑0324. No additional fine‑tuning, distillation, or post‑training is performed. The resulting model retains the reasoning capability of R1 while running faster and producing roughly 40% fewer output tokens.
Model repository: https://huggingface.co/tngtech/DeepSeek-R1T-Chimera
Construction Method
The merge reuses the shared experts from V3 and replaces the routing experts with those from R1. All other components (embeddings, dense blocks, attention layers inside MoE blocks, and V3’s shared experts) remain unchanged. This creates a hybrid MoE where the blue‑highlighted weights in the diagram correspond to R1 routing experts inserted into the V3 architecture.
Parameter Reuse Details
Embedding layers are taken directly from V3.
The first three dense blocks are copied unchanged from V3.
All attention sub‑layers inside MoE blocks are reused from V3.
Shared experts (the "shared_experts" sub‑module) are kept exactly as in V3.
Routed experts (the "experts" sub‑module) are replaced entirely by the routing experts from R1.
Verification Script
The following Python snippet checks tensor equality between the original V3 checkpoint, the R1 checkpoint, and the merged Chimera checkpoint. It iterates over all layers and MLP projection weights, printing whether the tensors match within an absolute tolerance of 1e‑5.
for li in range(62):
for i in range(-2, 256):
print('=' * 100)
print(f"Expert: {i}")
if i == -2:
K = f"model.layers.{li}.mlp.gate.weight"
print(K)
for mlp_name in ("up", "gate", "down"):
if i == -1:
K = f"model.layers.{li}.mlp.shared_experts.{mlp_name}_proj.weight"
print(f"Shared MLP Name: {mlp_name}")
if i >= 0:
K = f"model.layers.{li}.mlp.experts.{i}.{mlp_name}_proj.weight"
print(f"Expert MLP Name: {mlp_name}")
print("v3 r1t", end=" ")
print(torch.allclose(v3_tensors[K].to(torch.float32), r1t_tensors[K].to(torch.float32), atol=1e-5))
print("v3 r1", end=" ")
print(torch.allclose(v3_tensors[K].to(torch.float32), r1_tensors[K].to(torch.float32), atol=1e-5))
print("r1 r1t", end=" ")
print(torch.allclose(r1t_tensors[K].to(torch.float32), r1_tensors[K].to(torch.float32), atol=1e-5))Caveats and Recommendations
Because the approach has not been described in a formal paper, extensive empirical validation is advised. In particular, the MoE router continues to use V3’s gating mechanism, and it is unclear whether this combination is robust across all tasks. Researchers interested in post‑training model manipulation should experiment with this technique and report any observed limitations or performance variations.
Baobao Algorithm Notes
Author of the BaiMian large model, offering technology and industry insights.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
