Old Zhang's AI Learning
Old Zhang's AI Learning
Apr 25, 2026 · Artificial Intelligence

Deploying DeepSeek‑V4‑Flash Locally on 2 × NVIDIA H20 (96 GB) – Quick Performance Test

This article walks through deploying DeepSeek‑V4‑Flash on a server with two NVIDIA H20 GPUs (96 GB each), detailing model download, Docker image preparation, launch script tweaks, memory compression via FP8 and expert parallelism, and reports observed concurrency limits and token‑per‑second speeds, including a test that disables the model's thinking mode.

DeepSeek V4DockerFP8 quantization
0 likes · 6 min read
Deploying DeepSeek‑V4‑Flash Locally on 2 × NVIDIA H20 (96 GB) – Quick Performance Test
Alibaba Cloud Observability
Alibaba Cloud Observability
Jun 16, 2025 · Artificial Intelligence

Mastering AI Application Observability: From Metrics to Full‑Stack Tracing

This article explains why cost and performance are critical in the AI era, outlines the three main pain points of AI application development, and details a full‑stack observability solution—including architecture layers, key metrics like TTFT and TPOT, OpenTelemetry tracing, and practical tips for frameworks such as Dify—integrated into Alibaba Cloud CloudMonitor 2.0.

AI ObservabilityAI application monitoringLLM performance
0 likes · 21 min read
Mastering AI Application Observability: From Metrics to Full‑Stack Tracing
Architect's Alchemy Furnace
Architect's Alchemy Furnace
Mar 31, 2025 · Artificial Intelligence

Which Model Quantization Wins? Deep Dive into q4_0, q5_K_M, and q8_0

An in‑depth technical analysis compares popular model quantization schemes—q4_0, q5_K_M, and q8_0—detailing their precision trade‑offs, memory savings, inference speed, hardware compatibility, and ideal use‑cases, complemented by performance benchmarks on Llama‑3‑8B and practical selection guidelines.

AI OptimizationInference SpeedLLM performance
0 likes · 7 min read
Which Model Quantization Wins? Deep Dive into q4_0, q5_K_M, and q8_0
Baidu Intelligent Cloud Tech Hub
Baidu Intelligent Cloud Tech Hub
Jan 7, 2025 · Artificial Intelligence

How Baidu’s AIAK Boosts LLM Inference Speed by Cutting Token Latency

This article explains the architecture of large‑model inference engines, key performance metrics like TTFT and TPOT, the limitations of popular engines such as vLLM, and Baidu Baige's AIAK solutions—including multi‑process, static slot, and asynchronous execution—that dramatically reduce token‑interval latency and increase GPU utilization.

AIAKGPU utilizationLLM performance
0 likes · 10 min read
How Baidu’s AIAK Boosts LLM Inference Speed by Cutting Token Latency