LLM performance — 5 Technical Articles

Apr 25, 2026 · Artificial Intelligence

Deploying DeepSeek‑V4‑Flash Locally on 2 × NVIDIA H20 (96 GB) – Quick Performance Test

This article walks through deploying DeepSeek‑V4‑Flash on a server with two NVIDIA H20 GPUs (96 GB each), detailing model download, Docker image preparation, launch script tweaks, memory compression via FP8 and expert parallelism, and reports observed concurrency limits and token‑per‑second speeds, including a test that disables the model's thinking mode.

DeepSeek V4DockerFP8 quantization

0 likes · 6 min read

Deploying DeepSeek‑V4‑Flash Locally on 2 × NVIDIA H20 (96 GB) – Quick Performance Test

High Availability Architecture

Mar 2, 2026 · Artificial Intelligence

Why Agent Harnesses Outperform Models: The Power of Scaffolding in AI Agents

This article examines how the design of agent harnesses—simple loops with atomic tools and progressive disclosure—determines the performance ceiling of AI agents, showing that optimized scaffolding can double success rates, cut token usage by up to 47%, and outweigh model selection.

AI agentsHarness EngineeringLLM performance

0 likes · 19 min read

Why Agent Harnesses Outperform Models: The Power of Scaffolding in AI Agents

Alibaba Cloud Observability

Jun 16, 2025 · Artificial Intelligence

Mastering AI Application Observability: From Metrics to Full‑Stack Tracing

This article explains why cost and performance are critical in the AI era, outlines the three main pain points of AI application development, and details a full‑stack observability solution—including architecture layers, key metrics like TTFT and TPOT, OpenTelemetry tracing, and practical tips for frameworks such as Dify—integrated into Alibaba Cloud CloudMonitor 2.0.

AI ObservabilityAI application monitoringLLM performance

0 likes · 21 min read

Mastering AI Application Observability: From Metrics to Full‑Stack Tracing

Architect's Alchemy Furnace

Mar 31, 2025 · Artificial Intelligence

Which Model Quantization Wins? Deep Dive into q4_0, q5_K_M, and q8_0

An in‑depth technical analysis compares popular model quantization schemes—q4_0, q5_K_M, and q8_0—detailing their precision trade‑offs, memory savings, inference speed, hardware compatibility, and ideal use‑cases, complemented by performance benchmarks on Llama‑3‑8B and practical selection guidelines.

AI OptimizationInference SpeedLLM performance

0 likes · 7 min read

Which Model Quantization Wins? Deep Dive into q4_0, q5_K_M, and q8_0

Baidu Intelligent Cloud Tech Hub

Jan 7, 2025 · Artificial Intelligence

How Baidu’s AIAK Boosts LLM Inference Speed by Cutting Token Latency

This article explains the architecture of large‑model inference engines, key performance metrics like TTFT and TPOT, the limitations of popular engines such as vLLM, and Baidu Baige's AIAK solutions—including multi‑process, static slot, and asynchronous execution—that dramatically reduce token‑interval latency and increase GPU utilization.

AIAKGPU utilizationLLM performance

0 likes · 10 min read

How Baidu’s AIAK Boosts LLM Inference Speed by Cutting Token Latency

Deploying DeepSeek‑V4‑Flash Locally on 2 × NVIDIA H20 (96 GB) – Quick Performance Test

Why Agent Harnesses Outperform Models: The Power of Scaffolding in AI Agents

Mastering AI Application Observability: From Metrics to Full‑Stack Tracing

Which Model Quantization Wins? Deep Dive into q4_0, q5_K_M, and q8_0

How Baidu’s AIAK Boosts LLM Inference Speed by Cutting Token Latency

Deploying DeepSeek‑V4‑Flash Locally on 2 × NVIDIA H20 (96 GB) – Quick Performance Test