Tagged articles

GPU deployment

7 articles · Page 1 of 1

Apr 25, 2026 · Artificial Intelligence

Deploying DeepSeek‑V4‑Flash Locally on 2 × NVIDIA H20 (96 GB) – Quick Performance Test

This article walks through deploying DeepSeek‑V4‑Flash on a server with two NVIDIA H20 GPUs (96 GB each), detailing model download, Docker image preparation, launch script tweaks, memory compression via FP8 and expert parallelism, and reports observed concurrency limits and token‑per‑second speeds, including a test that disables the model's thinking mode.

DeepSeek-V4DockerFP8 quantization

0 likes · 6 min read

Deploying DeepSeek‑V4‑Flash Locally on 2 × NVIDIA H20 (96 GB) – Quick Performance Test

Ops Development Stories

Jun 15, 2025 · Artificial Intelligence

How to Deploy vLLM for Fast LLM Inference on GPU and CPU – A Step‑by‑Step Guide

This article walks through deploying the high‑performance vLLM LLM inference framework, covering GPU and CPU backend installation, environment setup, offline and online serving, API usage, and a performance comparison that highlights the ten‑fold speed advantage of GPU over CPU.

CPU deploymentGPU deploymentLLM Inference

0 likes · 38 min read

How to Deploy vLLM for Fast LLM Inference on GPU and CPU – A Step‑by‑Step Guide

Baidu Intelligent Cloud Tech Hub

Mar 7, 2025 · Artificial Intelligence

Deploy DeepSeek R1 with Prefill‑Decode Separation on Baidu Baige

This guide explains how to set up Baidu Baige's PD‑separated deployment for the DeepSeek R1 large‑language model, covering resource preparation, data acquisition, Prefill and Decode service configuration, and API invocation to achieve lower latency and higher throughput.

Baidu BaigeDeepSeekGPU deployment

0 likes · 7 min read

Deploy DeepSeek R1 with Prefill‑Decode Separation on Baidu Baige

Meituan Technology Team

Mar 6, 2025 · Artificial Intelligence

INT8 Quantization and Inference Optimization of DeepSeek R1 Model

Meituan’s search and recommendation team converted the FP8‑only DeepSeek‑R1 model to INT8 by first casting weights to BF16 and then applying block‑wise or channel‑wise quantization, which preserves GSM8K and MMLU accuracy while delivering 33% to 50% higher throughput on A100‑80G GPUs, and they released the SGLang‑based inference scripts and quantized weights publicly, enabling deployment on older NVIDIA hardware without accuracy loss.

DeepSeek-R1GPU deploymentINT8 Quantization

0 likes · 11 min read

INT8 Quantization and Inference Optimization of DeepSeek R1 Model

Tencent Cloud Developer

Apr 20, 2023 · Artificial Intelligence

Master Stable Diffusion: From Hardware Setup to Advanced Prompt Engineering

This comprehensive guide walks you through the hardware requirements, environment deployment, key parameters, prompt techniques, ControlNet integration, model download and installation, as well as style and character training for Stable Diffusion, providing practical code snippets and visual examples for each step.

AI image generationControlNetGPU deployment

0 likes · 38 min read

Master Stable Diffusion: From Hardware Setup to Advanced Prompt Engineering

DataFunSummit

Apr 18, 2023 · Artificial Intelligence

Best Practices for Deploying Speech AI on GPUs with Triton and TensorRT

This article presents comprehensive best‑practice guidelines for deploying conversational speech AI—including ASR and TTS pipelines—on GPU servers using NVIDIA Triton Inference Server and TensorRT, covering workflow overview, performance optimizations, streaming inference, and real‑world deployment tips.

ASRConversational AIGPU deployment

0 likes · 14 min read

Best Practices for Deploying Speech AI on GPUs with Triton and TensorRT

Airbnb Technology Team

Nov 11, 2021 · Artificial Intelligence

Airbnb’s Task‑Oriented Dialogue System for Mutual Cancellation: Architecture, Data Collection, Modeling, and Deployment

Airbnb’s ATIS task‑oriented dialogue system for Mutual Cancellation combines hierarchical domain classification, Q&A‑style intent annotation, large‑scale RoBERTa pre‑training with multilingual fine‑tuning, multi‑turn context handling, GPU‑accelerated inference, and contextual‑bandit reinforcement learning to deliver a scalable, efficient customer‑support solution.

AIGPU deploymentmultilingual

0 likes · 22 min read

Airbnb’s Task‑Oriented Dialogue System for Mutual Cancellation: Architecture, Data Collection, Modeling, and Deployment