Qwen3.5 — 20 Technical Articles | BestHub

Old Zhang's AI Learning

Apr 19, 2026 · Artificial Intelligence

From Zero to Deployment: A Complete Qwen3.5 Fine‑Tuning Guide

This guide shows how to fine‑tune Qwen3.5 models—from 0.8B to 122B—using Unsloth Studio or pure code, covering text SFT, vision fine‑tuning, MoE models, reinforcement‑learning (GRPO), extensive GGUF quantization benchmarks, hardware requirements, export formats, and deployment tips.

Fine-tuningLLMQwen3.5

0 likes · 12 min read

From Zero to Deployment: A Complete Qwen3.5 Fine‑Tuning Guide

Old Zhang's AI Learning

Apr 14, 2026 · Artificial Intelligence

Qwen3.5-27B-DFlash Delivers Up to 5× Faster Inference Without Quality Loss

The DFlash approach replaces speculative decoding’s autoregressive drafter with a block diffusion model and injects target‑model hidden features into every KV‑cache layer, achieving up to 5× speed‑up for Qwen3.5‑27B on single‑GPU and 1.5–1.9× on high‑concurrency workloads while preserving output quality.

DFlashInference AccelerationQwen3.5

0 likes · 12 min read

Qwen3.5-27B-DFlash Delivers Up to 5× Faster Inference Without Quality Loss

Old Zhang's AI Learning

Apr 10, 2026 · Artificial Intelligence

How a 9B‑parameter Qwen3.5 model achieves full‑auto data analysis on a consumer GPU

The open‑source CoPaw‑Flash‑9B‑DataAnalyst‑LoRA model, fine‑tuned via LoRA, can autonomously load, explore, statistically analyze, visualize, and generate structured reports for CSV/Excel/JSON datasets, achieving a 90% success rate with an average of 26 iteration rounds, and it runs on a single consumer‑grade GPU using vLLM and the Data Analyst framework.

AgentData AnalystGPU

0 likes · 10 min read

How a 9B‑parameter Qwen3.5 model achieves full‑auto data analysis on a consumer GPU

Old Zhang's AI Learning

Apr 3, 2026 · Artificial Intelligence

Qwopus3.5‑v3: From Reason‑Then‑Act to Act‑Then‑Refine – Claude‑Opus Distillation Turns Qwen3.5 into a Tool‑Using Agent

The newly released Qwopus3.5‑v3 model combines higher‑quality reasoning chains, dedicated tool‑calling reinforcement learning, and an act‑then‑refine paradigm, delivering a 5‑point HumanEval boost, a 1.43‑point MMLU‑Pro gain, 31.7% faster inference and 24% lower token cost, while remaining runnable on a 3090 or a 16 GB MacBook, with easy deployment via GGUF, LM Studio, Ollama or llama.cpp.

Claude OpusDistillationHumanEval

0 likes · 12 min read

Qwopus3.5‑v3: From Reason‑Then‑Act to Act‑Then‑Refine – Claude‑Opus Distillation Turns Qwen3.5 into a Tool‑Using Agent

Apr 1, 2026 · Artificial Intelligence

Holo3 AI Model Beats GPT‑5.4 at One‑Tenth the Cost for Computer Use

H Company’s new Holo3 series delivers a visual language model that outperforms GPT‑5.4 on the OSWorld‑Verified benchmark with a 78.85% score while costing only about one‑tenth as much, offering both a flagship API‑only version and an open‑source lightweight variant optimized for GUI agents.

AI benchmarkGUI agentHolo3

0 likes · 4 min read

Holo3 AI Model Beats GPT‑5.4 at One‑Tenth the Cost for Computer Use

Old Zhang's AI Learning

Mar 31, 2026 · Artificial Intelligence

Turning a Bluetooth Speaker into a Smart Assistant with Qwen 3.5‑Omni

The author demonstrates a proof‑of‑concept that combines Qwen 3.5‑Omni's real‑time internet search and audio output with a locally hosted voice‑wake‑up model to transform a Bluetooth speaker into an always‑on smart assistant, while noting latency challenges and the potential of a sub‑10B open‑source alternative.

AI integrationBluetoothLarge Language Model

0 likes · 2 min read

Turning a Bluetooth Speaker into a Smart Assistant with Qwen 3.5‑Omni

Old Zhang's AI Learning

Mar 28, 2026 · Artificial Intelligence

Qwen3.5-27B Outperforms the 397B Model in Tool Calling – Q6 Quantization Is Optimal

Using the open‑source ToolCall‑15 benchmark, the author shows that the 27‑billion‑parameter Qwen3.5 model consistently scores full marks while the 397‑billion‑parameter version fails on several tasks, and that the Q6 quantized variant offers the best trade‑off between size and tool‑calling accuracy.

AILLM benchmarkQwen3.5

0 likes · 7 min read

Qwen3.5-27B Outperforms the 397B Model in Tool Calling – Q6 Quantization Is Optimal

Old Zhang's AI Learning

Mar 25, 2026 · Artificial Intelligence

Claude‑Opus‑4.6 Distilled Qwen3.5 v2: Faster Reasoning with Same Code Accuracy

The new Claude‑Opus‑4.6 distilled Qwen3.5‑v2 keeps code‑generation accuracy while cutting reasoning length by 24% and boosting per‑token correctness by 31.6%, offering a noticeable speed and cost advantage for local LLM deployment despite a 7.2% drop on MMLU‑Pro.

Claude OpusDistillationQwen3.5

0 likes · 7 min read

Claude‑Opus‑4.6 Distilled Qwen3.5 v2: Faster Reasoning with Same Code Accuracy

Old Zhang's AI Learning

Mar 19, 2026 · Artificial Intelligence

Testing the Hot oMLX on Mac: Claude‑Opus‑4.6 Distilled and Qwen3.5‑9B Performance Review

The article evaluates oMLX, a Mac‑only LLM runtime built on Apple Silicon and MLX, by walking through installation, UI features, memory usage, single‑request speed, benchmark results for Claude‑Opus‑4.6 and Qwen3.5‑9B, continuous batch processing gains, Claude Code optimizations, multi‑model support, and the failure to run a 27B model.

Apple SiliconClaude OpusMLX

0 likes · 9 min read

Testing the Hot oMLX on Mac: Claude‑Opus‑4.6 Distilled and Qwen3.5‑9B Performance Review

Old Zhang's AI Learning

Mar 18, 2026 · Artificial Intelligence

Running Claude‑Opus‑4.6‑Distilled Qwen3.5 27B on a Single RTX 4090 with llama.cpp: 46 tokens/s Performance

The article details a hands‑on test of the Claude‑Opus‑4.6‑distilled Qwen3.5 27B model running on a single RTX 4090 via llama.cpp, showing a steady 46 tokens per second generation speed, a 64K context window, and a step‑by‑step Docker‑based setup while comparing it to GLM‑4.7‑Flash‑AWQ‑4bit and discussing llama.cpp’s limitations for multi‑GPU inference.

Claude OpusDockerLLM inference

0 likes · 5 min read

Running Claude‑Opus‑4.6‑Distilled Qwen3.5 27B on a Single RTX 4090 with llama.cpp: 46 tokens/s Performance

Old Zhang's AI Learning

Mar 16, 2026 · Artificial Intelligence

Testing Claude‑Opus‑4.6 Distilled Qwen3.5 9B Model Locally via LM Studio and Claude Code

The article evaluates the GGUF‑quantized Claude‑Opus‑4.6 distilled Qwen3.5 9B model on a 16 GB Mac Mini M4 using LM Studio, detailing model sizes, performance metrics, deployment steps, API integration with Claude Code, and concluding that while the 9B version is usable, its capabilities remain limited compared to larger models.

Claude OpusGGUFLM Studio

0 likes · 12 min read

Testing Claude‑Opus‑4.6 Distilled Qwen3.5 9B Model Locally via LM Studio and Claude Code

Old Zhang's AI Learning

Mar 9, 2026 · Artificial Intelligence

Deploying Qwen3.5 with vLLM: Full-Precision and Quantized Versions, Concurrency Benchmarks, and Scripts

The article walks through upgrading vLLM to 0.17.0, configuring Docker containers for 4090 GPUs, comparing FP8 and 4‑bit quantization of Qwen3.5 35B and 27B models, and presents detailed performance numbers and script parameters that reveal trade‑offs in memory usage and throughput.

4-bit quantizationDockerFP8

0 likes · 7 min read

Deploying Qwen3.5 with vLLM: Full-Precision and Quantized Versions, Concurrency Benchmarks, and Scripts

Old Zhang's AI Learning

Mar 7, 2026 · Artificial Intelligence

vLLM 0.17.0 Release: Full Qwen 3.5 Support and Anthropic API Compatibility

The vLLM 0.17.0 release brings FlashAttention 4 integration, a mature Model Runner V2, complete Qwen 3.5 series support, a one‑click performance‑mode flag, Anthropic API compatibility, advanced weight‑offloading, broader hardware support beyond NVIDIA, ASR model integration, and detailed upgrade and installation guidance.

ASRAnthropic APIFlashAttention

0 likes · 12 min read

vLLM 0.17.0 Release: Full Qwen 3.5 Support and Anthropic API Compatibility

Old Zhang's AI Learning

Mar 4, 2026 · Artificial Intelligence

How to Turn Thinking Mode On or Off for Qwen3.5 Models in Ollama, LM Studio, llama.cpp, and vLLM

This guide shows step‑by‑step how to enable or disable the thinking mode of Qwen3.5 series large language models across Ollama, LM Studio (GGUF and MLX), llama.cpp, and vLLM/SGLang using command‑line flags, custom model YAML files, and API parameters.

LM StudioOllamaQwen3.5

0 likes · 4 min read

How to Turn Thinking Mode On or Off for Qwen3.5 Models in Ollama, LM Studio, llama.cpp, and vLLM

Alibaba Cloud Infrastructure

Mar 4, 2026 · Cloud Computing

How to Deploy Qwen 3.5‑Plus with CoPaw on Alibaba Cloud ACK/ACS via Agent Sandbox

This guide walks you through deploying the Qwen 3.5‑plus model on Alibaba Cloud ACK/ACS using the ACS Agent Sandbox, creating a CoPaw sandbox, configuring model access, integrating with DingTalk, and optionally using the sandbox’s pause‑and‑wake features.

ACKACSAgent Sandbox

0 likes · 13 min read

How to Deploy Qwen 3.5‑Plus with CoPaw on Alibaba Cloud ACK/ACS via Agent Sandbox

Alibaba Cloud Native

Mar 3, 2026 · Cloud Native

Deploy Alibaba's Qwen3.5-397B Model in Minutes with Serverless Function Compute

This guide explains how to quickly deploy the new Qwen3.5-397B-A17B open‑source large model using Alibaba Cloud Function Compute's serverless GPU service, covering model features, deployment steps, required commands, and performance benefits.

AIQwen3.5Serverless GPU

0 likes · 5 min read

Deploy Alibaba's Qwen3.5-397B Model in Minutes with Serverless Function Compute

Old Zhang's AI Learning

Mar 3, 2026 · Artificial Intelligence

How to Deploy and Fine‑Tune Qwen3.5 Small Models (0.8B‑9B) Locally

This guide walks you through deploying Qwen3.5's 0.8B, 2B, 4B and 9B models on CPUs or modest GPUs using Unsloth's GGUF quantization, explains hardware requirements, shows how to run them with llama.cpp, llama‑server, vLLM or SGLang, and provides a free Colab fine‑tuning workflow with export options.

AI ModelsFine-tuningGGUF

0 likes · 19 min read

How to Deploy and Fine‑Tune Qwen3.5 Small Models (0.8B‑9B) Locally

Mar 3, 2026 · Artificial Intelligence

Alibaba Qwen‑3.5 Small Models: 0.8B Parameters Enable Video on Edge Devices

Alibaba released four Qwen‑3.5 models (0.8B‑9B) that use a Gated DeltaNet hybrid‑attention architecture and native multimodal training to achieve 262k‑token contexts, outperform larger rivals on visual, reasoning, and math benchmarks, and run video analysis on phones and laptops, though they still demand significant VRAM.

Edge AIGated DeltaNetMultimodal AI

0 likes · 6 min read

Alibaba Qwen‑3.5 Small Models: 0.8B Parameters Enable Video on Edge Devices

Old Zhang's AI Learning

Mar 2, 2026 · Artificial Intelligence

Qwen3.5 Small Models Unveiled: From 0.8B to 9B with Full Capabilities

The article introduces the newly released Qwen3.5 small model series (0.8B, 2B, 4B, 9B), explains their shared Gated Delta Networks architecture, early multimodal token fusion, 201‑language support and up to 1 million‑token context, and presents benchmark data that show the 9B model rivaling much larger LLMs, followed by practical guidance on model selection and deployment.

Gated Delta NetworksLocal DeploymentQwen3.5

0 likes · 10 min read

Qwen3.5 Small Models Unveiled: From 0.8B to 9B with Full Capabilities

Old Zhang's AI Learning

Mar 2, 2026 · Artificial Intelligence

Why the Qwen3.5 Series Makes Qwen3.5-27B the No‑Brainer Choice

The author reviews the Qwen3.5 model family, showing that the 27‑billion‑parameter dense Qwen3.5-27B offers the best balance of size, stability, low‑cost local deployment, and comprehensive capabilities, making it the default pick for most users.

AI benchmarkingLarge Language ModelLocal Deployment

0 likes · 6 min read

Why the Qwen3.5 Series Makes Qwen3.5-27B the No‑Brainer Choice