Backend Development 15 min read

Performance Optimization Practices in Vivo Push Recommendation Service

This article details the performance tuning of Vivo's Java-based push recommendation service, covering background challenges, measurement metrics, hotspot code and JVM GC optimizations, custom split utilities, map key improvements, off‑heap caching, and the resulting throughput and latency gains.

High Availability Architecture

Sep 15, 2022

Performance Optimization Practices in Vivo Push Recommendation Service

The article introduces a performance optimization case study for Vivo's push recommendation service, a CPU‑intensive Java backend that processes user events from Kafka and selects articles for delivery. As traffic grew, the service faced severe throughput bottlenecks and Kafka backlog, prompting a systematic performance improvement effort.

Performance is measured by throughput (TPS), defined as concurrent requests divided by average response time (RT). Since CPU utilization was already above 80%, the team focused on reducing RT rather than increasing concurrency.

Hotspot Code Optimization

Using the Arthas tool, a flame graph revealed that java.lang.String.split consumed about 13% of CPU time. The team identified three inefficiencies: (1) split uses regular expressions for multi‑character delimiters, which is slow; (2) for single‑character delimiters, internal conversions and list‑to‑array transformations add overhead; (3) many calls only need the first token, yet the code always splits the whole string.

To address these, a custom SplitUtils class was created, providing splitFirst (returns only the first token) and split (returns a List<String> without intermediate array conversion). The implementation avoids regex, uses indexOf and substring, and eliminates unnecessary list‑array conversions.

import java.util.ArrayList;</code>
<code>import java.util.List;</code>
<code>import org.apache.commons.lang3.StringUtils;</code>
<code>/**
 * Custom split utility
 */</code>
<code>public class SplitUtils {
    /** Return the first part before the delimiter */
    public static String splitFirst(final String str, final String delim) {
        if (str == null || StringUtils.isEmpty(delim)) {
            return str;
        }
        int index = str.indexOf(delim);
        if (index < 0) return str;
        if (index == 0) return "";
        return str.substring(0, index);
    }

    /** Split the whole string into a list */
    public static List<String> split(String str, final String delim) {
        if (str == null) return new ArrayList<>(0);
        if (StringUtils.isEmpty(delim)) {
            List<String> result = new ArrayList<>(1);
            result.add(str);
            return result;
        }
        List<String> stringList = new ArrayList<>();
        while (true) {
            int index = str.indexOf(delim);
            if (index < 0) {
                stringList.add(str);
                break;
            }
            stringList.add(str.substring(0, index));
            str = str.substring(index + delim.length());
        }
        return stringList;
    }
}

Micro‑benchmarks using JMH showed the custom implementation improves split performance by ~50% for multi‑character delimiters and up to 2‑5× for single‑character delimiters when only the first token is needed.

Map Lookup Optimization

The flame graph also highlighted HashMap.getOrDefault consuming ~20% of CPU due to large feature‑weight maps (over 10 million entries) and long string keys (average length >20). To reduce key comparison cost, the team switched keys from strings to long values using a custom hash, decreasing collision probability and speeding up lookups.

JVM GC Optimization

GC overhead was significant on a 64‑core, 256 GB machine, with YGC pauses averaging 10 seconds per minute. Two major heap objects were identified: a local cache (Caffeine) and the weight map. To alleviate heap pressure, the cache was moved off‑heap using the Open‑Source Off‑Heap Cache (OHC) library.

OHC stores data outside the Java heap, keeping only minimal metadata on‑heap, thus avoiding GC impact. Example usage:

OHCache<Key, Value> ohCache = OHCacheBuilder.newBuilder()
        .keySerializer(yourKeySerializer)
        .valueSerializer(yourValueSerializer)
        .build();

Configuration in the service set a 12 GB capacity with 1024 segments and Kryo serialization. After migration, YGC time dropped to ~800 ms per minute, and overall throughput increased by ~20%.

Finally, the weight map itself was moved out of the JVM heap by re‑implementing the inference engine in C++ and exposing it via a native .so library, further reducing GC pressure.

Results

Combined optimizations (hotspot code, map key conversion, off‑heap caching, native weight map) reduced end‑to‑end latency by 31.77%, increased throughput by 45.24%, and halved the split method’s CPU share. The service now handles roughly double the previous load.

The article concludes that performance tuning is an ongoing process and encourages readers to adopt a systematic approach: identify bottlenecks, apply targeted optimizations, and validate gains with micro‑benchmarks and production metrics.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

java JVM Performance Optimization GC microbenchmark off-heap cache

Written by

High Availability Architecture

Official account for High Availability Architecture.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.