Scaling Netty RPC with Protostuff: 400k Objects/sec per Server

An in‑depth guide demonstrates how to combine Netty with Protostuff for high‑performance RPC, detailing serialization setup, custom encoder/decoder implementation, multi‑threading pitfalls, and stress‑testing results that achieve up to 400,000 objects per second per server in a clustered environment.

ITFLY8 Architecture Home
ITFLY8 Architecture Home
ITFLY8 Architecture Home
Scaling Netty RPC with Protostuff: 400k Objects/sec per Server

Many Netty‑Protostuff RPC demos online follow the same template; the author initially copied one that worked locally without errors.

When deploying to a pre‑release environment and performing stress tests, numerous errors appeared. The test sent 20,000 objects every 100 ms from each of 40 client machines to two Netty servers, resulting in an average of 400,000 objects per server per second, while business logic could only process about 350,000 objects per second.

Protostuff Serialization and Deserialization

The required Maven dependencies are straightforward:

<protostuff.version>1.7.2</protostuff.version>
<dependency>
    <groupId>io.protostuff</groupId>
    <artifactId>protostuff-core</artifactId>
    <version>${protostuff.version}</version>
</dependency>

<dependency>
    <groupId>io.protostuff</groupId>
    <artifactId>protostuff-runtime</artifactId>
    <version>${protostuff.version}</version>
</dependency>

The utility class handles serialization and deserialization while caching schemas to avoid repeated buffer allocation:

public class ProtostuffUtils {
    /** Avoid allocating a new Buffer for each serialization. */
    // private static LinkedBuffer buffer = LinkedBuffer.allocate(LinkedBuffer.DEFAULT_BUFFER_SIZE);
    /** Cache Schema */
    private static Map<Class<?>, Schema<?>> schemaCache = new ConcurrentHashMap<>();

    @SuppressWarnings("unchecked")
    public static <T> byte[] serialize(T obj) {
        Class<T> clazz = (Class<T>) obj.getClass();
        Schema<T> schema = getSchema(clazz);
        LinkedBuffer buffer = LinkedBuffer.allocate(LinkedBuffer.DEFAULT_BUFFER_SIZE);
        byte[] data;
        try {
            data = ProtobufIOUtil.toByteArray(obj, schema, buffer);
        } finally {
            buffer.clear();
        }
        return data;
    }

    /** Deserialize byte array to specified class */
    public static <T> T deserialize(byte[] data, Class<T> clazz) {
        Schema<T> schema = getSchema(clazz);
        T obj = schema.newMessage();
        ProtobufIOUtil.mergeFrom(data, obj, schema);
        return obj;
    }

    @SuppressWarnings("unchecked")
    private static <T> Schema<T> getSchema(Class<T> clazz) {
        Schema<T> schema = (Schema<T>) schemaCache.get(clazz);
        if (Objects.isNull(schema)) {
            schema = RuntimeSchema.getSchema(clazz);
            if (Objects.nonNull(schema)) {
                schemaCache.put(clazz, schema);
            }
        }
        return schema;
    }
}

A common pitfall is using a static buffer; it works in single‑threaded scenarios but causes exceptions under high concurrency because the buffer may be reused before being cleared. The performance gain from reusing the buffer is negligible.

Both calls to ProtostuffIOUtil were replaced with ProtobufIOUtil to avoid previously observed exceptions.

Custom Serialization Method

Decoder implementation:

import com.jd.platform.hotkey.common.model.HotKeyMsg;
import com.jd.platform.hotkey.common.tool.ProtostuffUtils;
import io.netty.buffer.ByteBuf;
import io.netty.channel.ChannelHandlerContext;
import io.netty.handler.codec.ByteToMessageDecoder;
import java.util.List;

public class MsgDecoder extends ByteToMessageDecoder {
    @Override
    protected void decode(ChannelHandlerContext ctx, ByteBuf in, List<Object> out) {
        try {
            byte[] body = new byte[in.readableBytes()]; // normal transmission
            in.readBytes(body);
            out.add(ProtostuffUtils.deserialize(body, HotKeyMsg.class));
        } catch (Exception e) {
            e.printStackTrace();
        }
    }
}

Encoder implementation:

import com.jd.platform.hotkey.common.model.HotKeyMsg;
import com.jd.platform.hotkey.common.tool.Constant;
import com.jd.platform.hotkey.common.tool.ProtostuffUtils;
import io.netty.buffer.ByteBuf;
import io.netty.channel.ChannelHandlerContext;
import io.netty.handler.codec.MessageToByteEncoder;

public class MsgEncoder extends MessageToByteEncoder {
    @Override
    public void encode(ChannelHandlerContext ctx, Object in, ByteBuf out) {
        if (in instanceof HotKeyMsg) {
            byte[] bytes = ProtostuffUtils.serialize(in);
            byte[] delimiter = Constant.DELIMITER.getBytes();
            byte[] total = new byte[bytes.length + delimiter.length];
            System.arraycopy(bytes, 0, total, 0, bytes.length);
            System.arraycopy(delimiter, 0, total, bytes.length, delimiter.length);
            out.writeBytes(total);
        }
    }
}

The decoder reads the raw bytes, deserializes them into HotKeyMsg, and passes the object downstream. To avoid packet‑sticking under high load, a DelimiterBasedFrameDecoder is placed before this decoder, using a custom delimiter string defined in Constant.DELIMITER.

The encoder serializes the object with ProtostuffUtils, appends the same delimiter, and writes the combined byte array to the channel.

Single‑Machine and Cluster Deployments

After implementing the encoder/decoder, a simple client‑server test shows smooth operation, handling tens of thousands of messages per second without errors when the same ProtoBuf utility class is used and buffers are not shared.

When multiple clients connect to a single server, or multiple servers run concurrently, packet‑sticking becomes evident if messages are sent without spacing. Introducing a 100 ms interval between sends or using synchronous writeAndFlush().sync() eliminates the issue.

Stress tests with 40 client machines and 2 server machines demonstrated a stable throughput of approximately 400,000 objects per server per second.

Server architecture diagram:

Client‑server interaction diagram:

Cluster deployment diagram:

Overall, the article records the modifications made to a standard Netty‑Protostuff RPC demo to achieve high‑throughput, stable operation in both single‑machine and clustered environments.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

javaRPCProtostuffserializationNetty
ITFLY8 Architecture Home
Written by

ITFLY8 Architecture Home

ITFLY8 Architecture Home - focused on architecture knowledge sharing and exchange, covering project management and product design. Includes large-scale distributed website architecture (high performance, high availability, caching, message queues...), design patterns, architecture patterns, big data, project management (SCRUM, PMP, Prince2), product design, and more.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.