Tagged articles

latency optimization

9 articles · Page 1 of 1

May 26, 2026 · Artificial Intelligence

How Prompt Caching Works in LLMs and How to Write More Efficient Prompts

The article explains that LLM prompt caching reuses internal KV states rather than full answers, compares provider implementations, quantifies cost and latency savings, and provides concrete guidelines for structuring prompts to maximize cache hits, along with monitoring signals and a practical evaluation checklist.

AI inferenceLLMPrompt Engineering

0 likes · 13 min read

How Prompt Caching Works in LLMs and How to Write More Efficient Prompts

Lobster Programming

May 11, 2026 · Backend Development

Designing Effective Ad Mixing in Short‑Video Feed Streams

The article examines common pitfalls of naïve ad insertion in short‑video feeds, explains how cursor‑based pagination prevents duplicate ads, and outlines a client‑side mixing architecture that isolates services, meets strict latency requirements, and ensures accurate ad billing.

Cursor Paginationad mixingbackend design

0 likes · 4 min read

Designing Effective Ad Mixing in Short‑Video Feed Streams

Tencent Advertising Technology

Jul 17, 2025 · Artificial Intelligence

LEADRE: Knowledge‑Enhanced LLMs Supercharge Display Ad Recommendations

The paper introduces LEADRE, a multi‑faceted knowledge‑enhanced large language model‑driven display advertisement recommender that tackles user interest modeling, knowledge alignment, and low‑latency deployment, achieving significant GMV gains in Tencent’s ad platforms through innovative prompt engineering, semantic alignment, and TensorRT‑accelerated inference.

Knowledge AlignmentLLMPrompt Engineering

0 likes · 16 min read

LEADRE: Knowledge‑Enhanced LLMs Supercharge Display Ad Recommendations

Bilibili Tech

Apr 29, 2025 · Cloud Computing

Bilibili Live Streaming Technology for the Spring Festival Gala: Experience Enhancement and Interactive Features

Bilibili's R&D built a cloud-based broadcast console for the 2024 CCTV Spring Festival Gala, delivering 4K HDR streaming, AI SDR-to-HDR conversion, low latency, bandwidth‑aware transcoding, and a synchronized “send bullet screen” interactive feature using custom SEI timestamps for hundreds of millions of viewers.

HDRLive StreamingSEI

0 likes · 15 min read

Bilibili Live Streaming Technology for the Spring Festival Gala: Experience Enhancement and Interactive Features

DeWu Technology

Apr 14, 2023 · Backend Development

Async-fork: Mitigating Query Latency Spikes Incurred by the Fork-based Snapshot Mechanism from the OS Level

Async‑fork shifts the costly page‑table copying from Redis’s parent process to its child, allowing the parent to resume handling queries instantly and cutting snapshot‑induced latency spikes by over 98%, thereby dramatically improving tail latency during AOF rewrites, RDB backups, and master‑slave synchronizations.

Async-forkBackend DevelopmentPage Table

0 likes · 21 min read

Async-fork: Mitigating Query Latency Spikes Incurred by the Fork-based Snapshot Mechanism from the OS Level

OPPO Kernel Craftsman

Jul 1, 2022 · Operations

Linux Kernel Performance Profiling: A Comprehensive Guide to On-CPU and Off-CPU Analysis

This comprehensive guide explains Linux kernel performance profiling—both on‑CPU and off‑CPU—by stressing the need to target the critical 3 % of code, covering throughput, latency and power metrics, scalability laws, flame‑graph visualizations, perf and eBPF tools, lock‑contention analysis, and further reading recommendations.

Linux kernelThroughputeBPF

0 likes · 27 min read

Linux Kernel Performance Profiling: A Comprehensive Guide to On-CPU and Off-CPU Analysis

HaoDF Tech Team

Nov 8, 2021 · Operations

Service Risk Governance: Exploration, Mitigation, and Hands‑On Workshop

This talk recounts how the Good Doctor platform tackled severe online incidents by launching the DOA project, then a service risk governance initiative that identifies, quantifies, and mitigates latency‑related risks through metrics‑driven development, dependency analysis, middleware reliability, and a dedicated risk‑management platform.

MicroservicesRisk GovernanceSRE

0 likes · 16 min read

Service Risk Governance: Exploration, Mitigation, and Hands‑On Workshop

vivo Internet Technology

Oct 27, 2021 · Backend Development

JVM Garbage Collection Tuning for a Video Service to Reduce P99 Latency

By replacing the default Parallel GC with a ParNew‑CMS collector, enlarging the Young generation, fixing Metaspace settings, and tuning CMS occupancy thresholds, the video service cut Young and Full GC pauses dramatically, lowered Full GC count by over 80%, and achieved more than 30% P99 latency reduction, with some APIs improving up to 80%.

CMSGarbage CollectionJVM

0 likes · 16 min read

JVM Garbage Collection Tuning for a Video Service to Reduce P99 Latency

iQIYI Technical Product Team

Nov 27, 2020 · Artificial Intelligence

Optimizing TensorFlow Serving Model Hot‑Update to Eliminate Latency Spikes in CTR Recommendation Systems

By adding model warm‑up files, separating load/unload threads, switching to the Jemalloc allocator, and isolating TensorFlow’s parameter memory from RPC request buffers, iQIYI’s engineers reduced TensorFlow Serving hot‑update latency spikes in high‑throughput CTR recommendation services from over 120 ms to about 2 ms, eliminating jitter.

Model Hot UpdateTensorFlow ServingWarmup

0 likes · 11 min read

Optimizing TensorFlow Serving Model Hot‑Update to Eliminate Latency Spikes in CTR Recommendation Systems