Old Zhang's AI Learning
Author

Old Zhang's AI Learning

AI practitioner specializing in large-model evaluation and on-premise deployment, agents, AI programming, Vibe Coding, general AI, and broader tech trends, with daily original technical articles.

141
Articles
0
Likes
3
Views
0
Comments
Recent Articles

Latest from Old Zhang's AI Learning

100 recent articles max
Old Zhang's AI Learning
Old Zhang's AI Learning
Apr 14, 2026 · Artificial Intelligence

Qwen3.5-27B-DFlash Delivers Up to 5× Faster Inference Without Quality Loss

The DFlash approach replaces speculative decoding’s autoregressive drafter with a block diffusion model and injects target‑model hidden features into every KV‑cache layer, achieving up to 5× speed‑up for Qwen3.5‑27B on single‑GPU and 1.5–1.9× on high‑concurrency workloads while preserving output quality.

DFlashInference AccelerationQwen3.5
0 likes · 12 min read
Qwen3.5-27B-DFlash Delivers Up to 5× Faster Inference Without Quality Loss
Old Zhang's AI Learning
Old Zhang's AI Learning
Apr 13, 2026 · Artificial Intelligence

How Harness Engineering Makes or Breaks AI Agents – Lessons from Hsu’s 2026 Lecture

The article explains Harness Engineering—a set of tools that control an AI agent’s cognitive framework, capability boundaries, and behavior flow—showing how proper harnesses can turn modest models into high‑performing agents, while poor harnesses cause failures, with concrete examples, benchmarks, and research citations.

AI AgentContext EngineeringHarness Engineering
0 likes · 12 min read
How Harness Engineering Makes or Breaks AI Agents – Lessons from Hsu’s 2026 Lecture
Old Zhang's AI Learning
Old Zhang's AI Learning
Apr 13, 2026 · Artificial Intelligence

Fine‑Tune Any Large Model on Apple Silicon with mlx‑tune

The article introduces mlx‑tune, a community project that wraps the MLX library with Unsloth's API to enable local fine‑tuning of large language, vision, TTS, STT, OCR, and embedding models on Apple Silicon Macs, outlines its workflow from prototype to cloud, provides installation steps, code examples, and discusses its capabilities and limitations.

Apple SiliconUnsloth APIlarge language models
0 likes · 9 min read
Fine‑Tune Any Large Model on Apple Silicon with mlx‑tune
Old Zhang's AI Learning
Old Zhang's AI Learning
Apr 12, 2026 · Artificial Intelligence

How to Deploy MiniMax-M2.7 Quantized Models Locally on macOS and Linux

This guide explains the 22 GGUF quantized versions of MiniMax-M2.7 released by Unsloth, compares their accuracy and size, recommends the UD‑Q4_K_XL model for best quality‑to‑size trade‑off, and provides step‑by‑step instructions for local deployment via Unsloth Studio, llama.cpp, API server, or the MLX native solution, along with important pitfalls and performance‑tuning tips.

Dynamic 2.0Local DeploymentMLX
0 likes · 14 min read
How to Deploy MiniMax-M2.7 Quantized Models Locally on macOS and Linux
Old Zhang's AI Learning
Old Zhang's AI Learning
Apr 12, 2026 · Artificial Intelligence

Deploy the Open‑Source MiniMax‑M2.7 Model Locally: Step‑by‑Step Guide

MiniMax‑M2.7, the newly open‑sourced 230‑billion‑parameter MoE model, offers self‑evolution, professional software engineering and agent capabilities, and can be deployed locally using Ollama, vLLM, SGLang or Docker with 4‑8 H200 GPUs, while the article details hardware needs, performance gains and tool‑calling/Thinking features.

GPULLMMiniMax M2.7
0 likes · 11 min read
Deploy the Open‑Source MiniMax‑M2.7 Model Locally: Step‑by‑Step Guide
Old Zhang's AI Learning
Old Zhang's AI Learning
Apr 11, 2026 · Artificial Intelligence

Mastering SGLang: KV Cache and RadixAttention for Faster LLM Inference

This article reviews the DeepLearning.ai short course on SGLang, explains why large‑language‑model inference is slow, details how KV Cache reduces the computation from O(n²) to O(n), introduces RadixAttention for cross‑request caching, and presents code examples and benchmark results showing up to 10× speedup in real‑world RAG scenarios.

KV cacheLLM inferencePerformance optimization
0 likes · 13 min read
Mastering SGLang: KV Cache and RadixAttention for Faster LLM Inference
Old Zhang's AI Learning
Old Zhang's AI Learning
Apr 10, 2026 · Artificial Intelligence

How a 9B‑parameter Qwen3.5 model achieves full‑auto data analysis on a consumer GPU

The open‑source CoPaw‑Flash‑9B‑DataAnalyst‑LoRA model, fine‑tuned via LoRA, can autonomously load, explore, statistically analyze, visualize, and generate structured reports for CSV/Excel/JSON datasets, achieving a 90% success rate with an average of 26 iteration rounds, and it runs on a single consumer‑grade GPU using vLLM and the Data Analyst framework.

AgentData AnalystGPU
0 likes · 10 min read
How a 9B‑parameter Qwen3.5 model achieves full‑auto data analysis on a consumer GPU
Old Zhang's AI Learning
Old Zhang's AI Learning
Apr 9, 2026 · Artificial Intelligence

2026: The Real Turning Point for AI Coding Agents – Harness Explained

In 2026 the decisive factor for AI coding agents shifts from model size to the quality of their harness, as experiments show that redesigning the edit tool can boost success rates ten‑fold, while a growing open‑source harness ecosystem and Anthropic's managed agents illustrate the emerging competitive landscape.

AI agentsHarnessbenchmark
0 likes · 17 min read
2026: The Real Turning Point for AI Coding Agents – Harness Explained
Old Zhang's AI Learning
Old Zhang's AI Learning
Apr 7, 2026 · Industry Insights

Why OpenAI Has Lost Its Mission: A Deep Dive into Recent Decisions

The article analyzes OpenAI's recent strategic shifts, contrasting its declining product focus and safety commitments with Anthropic's focused growth, using revenue data, internal memos, and industry reports to argue that the company is now driven by deal‑making rather than its original AI mission.

AIAnthropicOpenAI
0 likes · 16 min read
Why OpenAI Has Lost Its Mission: A Deep Dive into Recent Decisions