10 min read

Bridging Tokenizer Gaps: Cross-Tokenizer Knowledge Distillation at AAAI 2026

This paper introduces SeDi, a semantics‑ and distribution‑aware cross‑tokenizer knowledge distillation framework that aligns teacher and student token spaces via bipartite graph components and top‑K re‑encoding, achieving state‑of‑the‑art performance and lower exposure bias on multiple LLM benchmarks.

Network Intelligence Research Center (NIRC)

Dec 30, 2025

Bridging Tokenizer Gaps: Cross-Tokenizer Knowledge Distillation at AAAI 2026

Research Motivation

Cross‑tokenizer knowledge distillation aims to transfer knowledge from a teacher model to a student model that use different tokenizers. Existing methods struggle to convey the rich token‑level semantics encoded in the teacher’s output distribution and often introduce tokenizer‑specific noise, limiting distillation effectiveness.

Existing Approaches and Their Limitations

Optimal‑transport‑based methods : use Wasserstein distance to globally align output distributions but ignore fine‑grained character‑level semantic correspondences.

Explicit vocabulary mapping : construct static token mappings via edit‑distance or other string similarity metrics, attempting direct token conversion.

Additional projection modules map teacher and student embeddings into a shared semantic space.

These techniques still suffer from clear drawbacks: cross‑boundary misalignment, output duplication, teacher‑bias injection, and lazy learning that discards the teacher’s distributional information.

Proposed Method: SeDi

The authors propose SeDi (Semantics and Distribution‑aware Cross‑tokenizer Distillation), which consists of two core strategies:

Semantic‑preserving token migration : model token‑level alignment as a bipartite graph where teacher and student token sets form disjoint node groups. Edges are created based on physical character‑span overlap, and connected‑component analysis yields alignment groups, eliminating cross‑boundary and duplicate alignments.

Distribution‑aware entropy alignment : retain both the teacher’s top‑K candidate tokens (fact knowledge) and the overall confidence distribution (entropy). The student’s pseudo‑label distribution is constructed by re‑encoding the selected teacher tokens into the student’s vocabulary and aligning its entropy with the teacher’s.

Algorithmically, the process involves three steps: node construction, edge construction based on character overlap, and connected‑component grouping to obtain aligned token groups.

Experiments

Three emergence‑focused tasks are used: instruction following (Dolly), code generation (CodeM), and mathematical reasoning (MetaMath). Experiments cover five teacher‑student pairs with varying vocabulary sizes and six mainstream cross‑tokenizer distillation baselines.

Results show that SeDi consistently outperforms all baselines:

Instruction following: +12.21% over the best baseline.

Mathematical reasoning: +19.80% .

Code generation: +8.70% .

On the ORCA dataset, Pass@1 improves by +19.8% compared to SFT.

Even when teacher and student differ by 41.9× in parameter count (Deepseek‑Coder‑6.7B → Pythia‑160M), SeDi yields up to +8.70% gain.

Exposure Bias Reduction

During inference, the student must generate tokens without ground‑truth labels, leading to exposure bias. SeDi incorporates the student’s own output distribution into the distillation loss, resulting in consistently lower exposure bias as generation length increases.

Computational Overhead

Training time per step and peak memory usage of SeDi are only marginally higher than several baselines, staying within the same order of magnitude. This demonstrates that SeDi improves distillation quality without imposing prohibitive computational costs.

Conclusion

SeDi identifies key limitations of prior cross‑tokenizer distillation methods and offers a systematic, multi‑level alignment solution that jointly leverages semantic and distributional information. The framework achieves superior performance, robustness, and efficiency, providing a solid foundation for future multi‑granularity knowledge transfer research.

Code repository: https://github.com/MaybeLizzy/SEDI

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

AI research knowledge distillation language models cross-tokenizer distillation entropy alignment semantic alignment

Written by

Network Intelligence Research Center (NIRC)

NIRC is based on the National Key Laboratory of Network and Switching Technology at Beijing University of Posts and Telecommunications. It has built a technology matrix across four AI domains—intelligent cloud networking, natural language processing, computer vision, and machine learning systems—dedicated to solving real‑world problems, creating top‑tier systems, publishing high‑impact papers, and contributing significantly to the rapid advancement of China's network technology.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.