Artificial Intelligence 7 min read

Open‑Source Qwen3.6‑35B‑A3B Runs at 162 tok/s on a Single RTX 5090

The article introduces the open‑source Qwen3.6‑35B‑A3B model, explains its MoE architecture, three‑stage LoRA fine‑tuning, shows benchmark results where it achieves 161.9 tok/s on an RTX 5090—2.6× faster than a dense 27B counterpart—and discusses deployment tips, quantized GGUF release, and known compatibility pitfalls.

Old Zhang's AI Learning

May 11, 2026

Open‑Source Qwen3.6‑35B‑A3B Runs at 162 tok/s on a Single RTX 5090

Model Overview

Qwen3.6‑35B‑A3B (also called Qwopus3.6‑35B‑A3B‑v1) is a 35 billion‑parameter mixture‑of‑experts (MoE) model with only 3 billion active parameters during inference. It contains 256 experts, supports a 262 k token context window, and combines Gated DeltaNet linear attention with standard gated attention. The model targets high‑performance agent encoding, deep reasoning, and multimodal tasks, and runs on a single consumer‑grade RTX 5090 at an average 161.9 tok/s, which is 2.6× faster than a dense 27 B model of the same family.

Fine‑tuning Process

Jackrong applied a three‑stage curriculum‑learning SFT using LoRA, training only 9 % of the parameters:

Stage 1 – Format Establishment : short‑to‑medium length samples stabilize output format and basic reasoning paths, preventing the base model’s style from being corrupted.

Stage 2 – Complexity Increase + Multi‑Teacher Distillation : gradually raise the proportion of complex reasoning samples, distilling from a 27 B teacher model chosen for stylistic similarity to avoid a large capability gap.

Stage 3 – Long‑Context Reinforcement + Anti‑Drift : strengthen long‑context inference while retaining 10 % short‑sample replay to avoid catastrophic forgetting of basic instruction following.

The author notes that a 9 % trainable ratio is risky in MoE architectures because it can increase training instability and weight‑merge conflicts.

Evaluation Results

Benchmark screenshots (see images) illustrate the model’s speed advantage. Key numbers:

RTX 5090 single‑card average throughput: 161.9 tok/s

Compared to a dense 27 B model: 2.6× faster

Performance is remarkable for a consumer‑grade single GPU.

Typical Use Cases

One‑click HTML/CSS generation : evaluated as one of the strongest open‑source one‑shot front‑end generators, producing complete pages with complex interactions.

Complex reasoning + long‑context JSON extraction : fixes earlier “thinking starvation” issues, yielding more stable multi‑step agent planning outputs.

Vision + Tool Calling : requires placing mmproj.gguf alongside the main .gguf file.

262 k context with stable memory usage : thanks to Gated DeltaNet linear attention, memory consumption does not explode as sequence length grows.

Quantized GGUF Release

A GGUF‑quantized version is provided for easy local execution. Repository address:

Jackrong/Qwopus3.6-35B-A3B-v1-GGUF

Compatibility Warning

When fine‑tuning locally with LoRA, be aware of a known incompatibility among PEFT/LoRA, Transformers 5.x, and the Unsloth patch. Merging LoRA weights may raise errors such as:

ModuleNotFoundError: Could not import module 'Qwen3_5MoeForContinualGeneration'

The MoE expert‑layer weight structure differs significantly from dense models, often causing structural mismatches. Users should be prepared to apply manual patches or downgrade specific library versions.

Author’s Assessment

The model’s main value lies in delivering near‑professional‑grade throughput for a 35 B‑scale MoE on a consumer‑grade single GPU. Developers working on UI generation, agent orchestration, or long‑context reasoning may find it worthwhile, as its fine‑tuning quality combined with MoE speed makes it stand out among community models.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Mixture of Experts Large Language Model LoRA fine-tuning GGUF quantization Qwen3.6-35B-A3B RTX 5090 inference

Written by

Old Zhang's AI Learning

AI practitioner specializing in large-model evaluation and on-premise deployment, agents, AI programming, Vibe Coding, general AI, and broader tech trends, with daily original technical articles.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.