How OSP Proxy Integration Supercharges Janus Throughput—and What Causes the Memory Leak

This report details how integrating OSP Proxy as an SDK into the Netty‑based Janus gateway dramatically improves throughput, yet introduces a 100 KB payload memory‑leak bug, analyzes its root causes, and evaluates mitigation strategies such as connection and worker‑thread tuning.

Vipshop Quality Engineering
Vipshop Quality Engineering
Vipshop Quality Engineering
How OSP Proxy Integration Supercharges Janus Throughput—and What Causes the Memory Leak

Introduction

To improve Janus forwarding performance, we integrated OSP Proxy as an SDK. Initial tests showed that when response data exceeds 100 KB, the Netty threads on the OSP Service side cannot keep up, leading to data backlog and off‑heap memory overflow.

Background

Janus is a Netty‑based HTTP gateway in the Venus system, providing unified access to OSP, REST services and governance features. Previously, Janus forwarded requests to an OSP Local Proxy process on the same machine, which then forwarded to OSP Service.

The communication between Janus and OSP Local Proxy relied on a TCP connection, limiting performance.

With help from the OSP team, we replaced the TCP‑based proxy with the OSP comm‑client SDK, allowing Janus Netty worker threads to hand off requests directly to the SDK’s Netty workers.

Performance tests showed up to a 50 % increase in TPS in some scenarios.

Bug Localization and Analysis

During testing, payloads around 100 KB caused unstable behavior and occasional off‑heap memory spikes.

Bug Location

Normal off‑heap memory usage:

Abnormal usage when the bug occurs:

Increasing concurrency made the bug reproducible.

We observed off‑heap memory overflow; adjusting Netty’s io.netty.leakDetectionLevel did not reveal leaks.

Leak detection levels:

DISABLED – no detection.

SIMPLE – samples 1 % of ByteBufs, prints once.

ADVANCED – reports the location of leaked ByteBufs, limited impact.

PARANOID – checks all ByteBufs, heavy performance cost.

Bug Fix

Increasing the number of connections and Netty worker threads for each backend service can mitigate the issue, but it also raises CPU contention and overall latency. Simulated tests with two OSP Service instances and 512 connections showed about a 20 % TPS drop.

In production, we tune three key parameters per case, with OSP team support.

Conclusion

Thorough performance testing under realistic extreme scenarios and continuous monitoring of both upstream and downstream services are essential for a robust framework.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Backend DevelopmentNettyMemory LeakJanusOSP Proxy
Vipshop Quality Engineering
Written by

Vipshop Quality Engineering

Technology exchange and sharing for quality engineering

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.