Tagged articles

llama.cpp

26 articles · Page 1 of 1

Jun 19, 2026 · Artificial Intelligence

Gemma‑4‑12B‑v2 (Fable 5 Clone) Achieves 3.5× Telecom Benchmark Boost

The author reproduces Anthropic’s Fable 5 using Gemma‑4‑12B‑v2, showing a 3.5× improvement on the telecom tau2‑bench versus the base model, details the agentic, coding, and general training data, compares quantization sizes, provides llama.cpp launch commands, and notes speed gains from speculative MTP decoding and current limitations.

Agentic AIFable 5Gemma-4-12B

0 likes · 9 min read

Gemma‑4‑12B‑v2 (Fable 5 Clone) Achieves 3.5× Telecom Benchmark Boost

Old Zhang's AI Learning

Jun 14, 2026 · Artificial Intelligence

How Unsloth Packs Google’s DiffusionGemma into 18 GB and Achieves 2000+ Tokens/s on a Single GPU

Unsloth quantizes Google’s DiffusionGemma into five GGUF variants, the smallest fitting a 24 GB GPU, adds a dedicated llama‑diffusion‑cli, and demonstrates over 2000 tokens per second on an RTX 6000, while outlining usage steps, model‑size trade‑offs, and limitations.

DiffusionGemmaGGUFGPU

0 likes · 11 min read

How Unsloth Packs Google’s DiffusionGemma into 18 GB and Achieves 2000+ Tokens/s on a Single GPU

Old Zhang's AI Learning

Jun 2, 2026 · Artificial Intelligence

Turn Local LLMs into Actionable Agents – Unsloth Opens the MCP Path

Unsloth now lets locally‑run large language models act as real agents by exposing a Model Context Protocol (MCP) interface through a no‑code Studio UI or a llama.cpp + mcp‑cli command line, supporting tool calling, file access, web search, and multi‑model connections with detailed setup steps, hardware guidance, and security cautions.

AI agentsMCPModel Context Protocol

0 likes · 17 min read

Turn Local LLMs into Actionable Agents – Unsloth Opens the MCP Path

Old Zhang's AI Learning

May 24, 2026 · Artificial Intelligence

LM Studio Adds MTP Support, Boosting Qwen3.6‑35B to ~130 Tokens/s

LM Studio 0.4.14+ now implements Multi‑Token Prediction (MTP) speculative decoding, eliminating the need for a separate draft model and delivering roughly double the token throughput—e.g., Qwen3.6‑35B reaches about 130 tokens/s on RTX 3090—while providing a six‑step activation guide and a list of known pitfalls.

LM StudioMTPQwen3.6

0 likes · 6 min read

LM Studio Adds MTP Support, Boosting Qwen3.6‑35B to ~130 Tokens/s

Old Zhang's AI Learning

May 14, 2026 · Artificial Intelligence

Boost Qwen3.6 with MTP: 1.5× Faster Local Deployment for Claude Code

The article explains how to enable Multi‑Token Prediction (MTP) in Qwen3.6 using a specific llama.cpp PR, achieving up to 1.5× faster local inference, details compilation steps, optimal parameters, memory requirements, and how to integrate the accelerated model with Claude Code while avoiding common pitfalls.

Claude CodeLLM AccelerationMTP

0 likes · 11 min read

Boost Qwen3.6 with MTP: 1.5× Faster Local Deployment for Claude Code

Old Zhang's AI Learning

May 12, 2026 · Artificial Intelligence

How Unsloth’s MTP Boosts Qwen3.6 Inference Speed on Consumer GPUs

Unsloth adds MTP to Qwen3.6‑27B and 35B‑A3B models, delivering 1.5‑2× decoding speed gains on consumer‑grade GPUs, with ~80% draft acceptance, while providing installation steps, usage parameters, benchmark results, and guidance on suitable scenarios.

GGUFGPUMTP

0 likes · 9 min read

How Unsloth’s MTP Boosts Qwen3.6 Inference Speed on Consumer GPUs

Lao Guo's Learning Space

May 12, 2026 · Artificial Intelligence

Which Inference Framework Maximizes Your GPU Performance in 2026?

This article compares six popular LLM inference frameworks—vLLM, TensorRT‑LLM, llama.cpp, ds4.c, Ollama, and Omlx—across performance, ease of use, and hardware compatibility, then provides a practical matrix to help users select the best fit for their GPU.

Apple SiliconGPU performanceLLM Inference

0 likes · 10 min read

Which Inference Framework Maximizes Your GPU Performance in 2026?

Geek Labs

May 7, 2026 · Artificial Intelligence

Running Large Language Models Locally on RTX 3090: Two Open‑Source Solutions

This article introduces two recent GitHub projects—club‑3090, which enables single‑ or dual‑RTX 3090 inference of 27‑billion‑parameter models with detailed performance benchmarks, and library‑skills, a tool that keeps AI agents synchronized with the latest official library APIs—explaining their configurations, usage steps, hardware requirements, and target audiences.

AI agentsDockerRTX 3090

0 likes · 7 min read

Running Large Language Models Locally on RTX 3090: Two Open‑Source Solutions

Old Zhang's AI Learning

Apr 26, 2026 · Artificial Intelligence

Distilling Claude Opus into Qwen3.6-27B – GGUF Lets You Run Locally on Consumer GPUs

The preview model Qwopus3.6-27B‑v1, distilled from Claude Opus onto Qwen3.6‑27B using SFT with the Unsloth stack and a curated 12 K high‑quality inference sample set, is evaluated on agentic reasoning, front‑end design, and Canvas/WebGL tasks with an RTX 5090, and can be deployed locally via llama.cpp GGUF quantizations with detailed memory guidelines.

Apache 2.0Claude OpusGGUF

0 likes · 7 min read

Distilling Claude Opus into Qwen3.6-27B – GGUF Lets You Run Locally on Consumer GPUs

DevOps Coach

Apr 23, 2026 · Artificial Intelligence

Can Gemma 4 on a MacBook Pro or NVIDIA Blackwell Replace Cloud LLMs? A Hands‑On Performance Study

The author benchmarks Gemma 4 locally on a 24 GB M4 Pro MacBook Pro (llama.cpp) and on a Dell GB10 with an NVIDIA Blackwell GPU (Ollama), comparing token speed, tool‑call reliability, and task completion against cloud GPT‑5.4, showing the Mac runs faster per token but the Blackwell system achieves higher first‑pass success with fewer retries, and that the jump from Gemma 3 to Gemma 4 dramatically improves agentic coding viability.

Gemma 4MacBook ProNVIDIA Blackwell

0 likes · 15 min read

Can Gemma 4 on a MacBook Pro or NVIDIA Blackwell Replace Cloud LLMs? A Hands‑On Performance Study

AI Algorithm Path

Apr 21, 2026 · Artificial Intelligence

Run Claude Code Locally or in the Cloud in 5 Minutes with Ollama, LM Studio, llama.cpp, and OpenRouter

This guide shows how to configure Claude Code to run on local or cloud models within five minutes, covering hardware requirements, recommended models, step‑by‑step installation for Ollama, llama.cpp, LM Studio, and cloud‑based options, plus performance and cost comparisons.

AI model deploymentClaude CodeLM Studio

0 likes · 12 min read

Run Claude Code Locally or in the Cloud in 5 Minutes with Ollama, LM Studio, llama.cpp, and OpenRouter

Lao Guo's Learning Space

Apr 19, 2026 · Artificial Intelligence

Which Framework Wins for Running Large Models? vLLM vs llama.cpp vs MLX (2026 Deep Comparison)

The article provides a 2026 deep comparative analysis of three major large‑model inference frameworks—vLLM, llama.cpp, and MLX—detailing their core designs, recent updates, benchmark results on various hardware, deployment complexity, and recommended use cases to help developers choose the right tool.

MLXbenchmarkframework comparison

0 likes · 15 min read

Which Framework Wins for Running Large Models? vLLM vs llama.cpp vs MLX (2026 Deep Comparison)

TonyBai

Apr 18, 2026 · Industry Insights

Why Ollama Fell From Open‑Source Hero to Community Villain

The article revisits Ollama’s rise as a user‑friendly local LLM runner, then details the community backlash over its omission of llama.cpp credit, the introduction of a private model format, performance regressions, and a VC‑driven commercialization pattern, while presenting open‑source alternatives.

OllamaVC trapcommunity backlash

0 likes · 9 min read

Why Ollama Fell From Open‑Source Hero to Community Villain

Old Zhang's AI Learning

Apr 12, 2026 · Artificial Intelligence

How to Deploy MiniMax-M2.7 Quantized Models Locally on macOS and Linux

This guide explains the 22 GGUF quantized versions of MiniMax-M2.7 released by Unsloth, compares their accuracy and size, recommends the UD‑Q4_K_XL model for best quality‑to‑size trade‑off, and provides step‑by‑step instructions for local deployment via Unsloth Studio, llama.cpp, API server, or the MLX native solution, along with important pitfalls and performance‑tuning tips.

Dynamic 2.0MLXMiniMax M2.7

0 likes · 14 min read

How to Deploy MiniMax-M2.7 Quantized Models Locally on macOS and Linux

Old Zhang's AI Learning

Apr 4, 2026 · Artificial Intelligence

Deploy Gemma 4 Locally: Ollama, llama.cpp, MLX, vLLM + TurboQuant Optimization

The article reviews the four Gemma 4 model variants, analyzes their architecture and benchmark results versus Qwen3.5, and provides step‑by‑step instructions for local deployment using Ollama, llama.cpp, MLX and vLLM, while highlighting TurboQuant memory and weight compression techniques.

AI benchmarkingGemma 4MLX

0 likes · 15 min read

Deploy Gemma 4 Locally: Ollama, llama.cpp, MLX, vLLM + TurboQuant Optimization

Old Zhang's AI Learning

Mar 28, 2026 · Artificial Intelligence

vLLM, llama.cpp, and MLX Embrace Google’s TurboQuant: 8× Memory Savings for Local LLMs

The article reviews how the leading LLM inference frameworks—oMLX, mlx‑vlm, llama.cpp, and vLLM—are integrating Google’s TurboQuant compression, showing up to 79% KV‑cache memory reduction, near‑full‑precision decoding speed, and detailed integration steps for each project.

KV cacheLLM InferenceTurboQuant

0 likes · 8 min read

vLLM, llama.cpp, and MLX Embrace Google’s TurboQuant: 8× Memory Savings for Local LLMs

Old Zhang's AI Learning

Mar 20, 2026 · Artificial Intelligence

Auto‑Detect Which LLMs Your PC Can Run and Launch a Coding Agent

This article shows how the HF‑agent plugin uses llmfit to analyze your hardware, recommends runnable large language models, starts a llama.cpp server, and automatically launches the Pi coding agent, with step‑by‑step commands and a real‑world test on an M2 MacBook Air.

HF-agentcoding agentllama.cpp

0 likes · 5 min read

Auto‑Detect Which LLMs Your PC Can Run and Launch a Coding Agent

Old Zhang's AI Learning

Mar 18, 2026 · Artificial Intelligence

Running Claude‑Opus‑4.6‑Distilled Qwen3.5 27B on a Single RTX 4090 with llama.cpp: 46 tokens/s Performance

The article details a hands‑on test of the Claude‑Opus‑4.6‑distilled Qwen3.5 27B model running on a single RTX 4090 via llama.cpp, showing a steady 46 tokens per second generation speed, a 64K context window, and a step‑by‑step Docker‑based setup while comparing it to GLM‑4.7‑Flash‑AWQ‑4bit and discussing llama.cpp’s limitations for multi‑GPU inference.

Claude OpusDockerLLM Inference

0 likes · 5 min read

Running Claude‑Opus‑4.6‑Distilled Qwen3.5 27B on a Single RTX 4090 with llama.cpp: 46 tokens/s Performance

AI Engineering

Mar 11, 2026 · Artificial Intelligence

Run Claude Code Locally with Qwen 3.5 to Skip Anthropic API Costs

This guide shows how to replace Anthropic's API by running a local Qwen 3.5 model with llama.cpp, configuring Claude Code via ANTHROPIC_BASE_URL, and includes hardware checks, build steps, model download, server launch, speed‑fix tips, and usage instructions for secure, cost‑free development.

Anthropic APIClaude CodeGPU Acceleration

0 likes · 8 min read

Run Claude Code Locally with Qwen 3.5 to Skip Anthropic API Costs

Old Zhang's AI Learning

Mar 4, 2026 · Artificial Intelligence

How to Turn Thinking Mode On or Off for Qwen3.5 Models in Ollama, LM Studio, llama.cpp, and vLLM

This guide shows step‑by‑step how to enable or disable the thinking mode of Qwen3.5 series large language models across Ollama, LM Studio (GGUF and MLX), llama.cpp, and vLLM/SGLang using command‑line flags, custom model YAML files, and API parameters.

LM StudioOllamaQwen3.5

0 likes · 4 min read

How to Turn Thinking Mode On or Off for Qwen3.5 Models in Ollama, LM Studio, llama.cpp, and vLLM

Old Zhang's AI Learning

Mar 3, 2026 · Artificial Intelligence

How to Deploy and Fine‑Tune Qwen3.5 Small Models (0.8B‑9B) Locally

This guide walks you through deploying Qwen3.5's 0.8B, 2B, 4B and 9B models on CPUs or modest GPUs using Unsloth's GGUF quantization, explains hardware requirements, shows how to run them with llama.cpp, llama‑server, vLLM or SGLang, and provides a free Colab fine‑tuning workflow with export options.

AI modelsGGUFQwen3.5

0 likes · 19 min read

How to Deploy and Fine‑Tune Qwen3.5 Small Models (0.8B‑9B) Locally

Old Zhang's AI Learning

Feb 26, 2026 · Artificial Intelligence

Ultimate Guide to Local Deployment of Qwen3.5 Models (27B‑397B)

This guide reviews the Qwen3.5 model lineup, explains mixed‑inference and MoE architecture, presents benchmark comparisons with GPT‑5.2, Claude 4.5 and Gemini‑3 Pro, evaluates 4‑bit and 3‑bit quantization loss, outlines hardware requirements, and provides step‑by‑step deployment options using llama.cpp or llama‑server.

Large Language ModelMoEQuantization

0 likes · 14 min read

Ultimate Guide to Local Deployment of Qwen3.5 Models (27B‑397B)

Old Zhang's AI Learning

Feb 17, 2026 · Artificial Intelligence

Running Qwen3.5 Locally: Step‑by‑Step Guide with Unsloth Dynamic Quantization

This article explains how to run the 397B Qwen3.5 model on a Mac by using Unsloth Dynamic 2.0 quantization (2‑bit, 3‑bit, or 4‑bit), outlines hardware requirements, provides compilation and download commands for llama.cpp, shows how to launch inference in thinking and non‑thinking modes, and compares several deployment options such as llama‑server, Transformers, SGLang/vLLM, and MLX.

Dynamic QuantizationGGUFLLM deployment

0 likes · 14 min read

Running Qwen3.5 Locally: Step‑by‑Step Guide with Unsloth Dynamic Quantization

Old Zhang's AI Learning

Feb 5, 2026 · Artificial Intelligence

Distilling GLM‑4.7‑Flash with Claude‑Opus‑4.5 for Easy Consumer‑GPU Deployment

The article explains how TeichAI used Claude‑Opus‑4.5 to generate a high‑quality 250‑sample reasoning dataset and distill the GLM‑4.7‑Flash model into a compact GGUF version that runs on a single consumer‑grade GPU via llama.cpp, detailing the workflow, quantization options, and practical considerations.

AI datasetsGGUFUnsloth

0 likes · 6 min read

Distilling GLM‑4.7‑Flash with Claude‑Opus‑4.5 for Easy Consumer‑GPU Deployment

Architect's Alchemy Furnace

May 7, 2025 · Artificial Intelligence

Which LLM Inference Engine Reigns Supreme? A Deep Dive into Transformers, vLLM, Llama.cpp, SGLang, MLX and Ollama

This article provides a comprehensive comparison of seven popular large‑language‑model inference engines—Transformers, vLLM, Llama.cpp, SGLang, MLX, Ollama and others—detailing their core features, performance characteristics, hardware compatibility, concurrency support, and ideal use‑cases, plus practical installation guidance for Xinference.

LLMMLXSGLang

0 likes · 17 min read

Which LLM Inference Engine Reigns Supreme? A Deep Dive into Transformers, vLLM, Llama.cpp, SGLang, MLX and Ollama

System Architect Go

Oct 15, 2024 · Artificial Intelligence

Overview of Ollama: Architecture, Storage Structure, and Dialogue Process

This article provides a comprehensive overview of Ollama, a lightweight tool for running large language models, detailing its client‑server architecture, local storage layout, and the step‑by‑step workflow of user interactions with the model.

AI toolsLLMOllama

0 likes · 7 min read

Overview of Ollama: Architecture, Storage Structure, and Dialogue Process