Backend Development 38 min read

How to Supercharge Java Backend Performance with CompletableFuture, Thread Pools, Caching, and Lock Tuning

This article analyzes Java performance bottlenecks and demonstrates how to use CompletableFuture for parallelism, fine‑tune ThreadPoolExecutor parameters, minimize transaction scope, apply cache‑line padding, object pooling, lock‑granularity techniques, copy‑on‑write collections, and reduce network payloads to achieve lower latency and higher throughput.

Architect

Oct 8, 2024

How to Supercharge Java Backend Performance with CompletableFuture, Thread Pools, Caching, and Lock Tuning

CompletableFuture for Parallel Price Queries

When a price‑query flow needs to fetch several independent configuration items (base price, discount price, merchant activity price, platform activity price, etc.), CompletableFuture can run the I/O‑bound calls in parallel. However, creating too many threads can cause contention; the thread‑pool size must match the workload characteristics.

Performance Test

private void sync(){
    long s = System.currentTimeMillis();
    a(10); b(10); c(10); d(10);
    long e = System.currentTimeMillis();
    System.out.println(e - s);
}

private void async(){
    long s = System.currentTimeMillis();
    List<CompletableFuture<?>> list = new ArrayList<>();
    list.add(CompletableFuture.runAsync(() -> a(10)));
    list.add(CompletableFuture.runAsync(() -> b(10)));
    list.add(CompletableFuture.runAsync(() -> c(10)));
    list.add(CompletableFuture.runAsync(() -> d(10)));
    CompletableFuture.allOf(list.toArray(new CompletableFuture[0])).join();
    long e = System.currentTimeMillis();
    System.out.println(e - s);
}

With a small number of short‑lived tasks, the synchronous version is faster because thread scheduling overhead dominates the async version.

Hybrid Strategy

Run fast, lightweight tasks synchronously and submit only the relatively heavy tasks to a CompletableFuture pool. This reduces scheduling cost while still gaining parallelism for the bottleneck steps.

Minimizing Transaction Scope

Large transaction scopes increase lock contention. A programmatic transaction template gives fine‑grained control:

public interface TransactionControlService {
    <T> T execute(ObjectLogicFunction<T> logic) throws Exception;
    void execute(VoidLogicFunction logic) throws Exception;
}

@Service
public class TransactionControlServiceImpl implements TransactionControlService {
    @Autowired private PlatformTransactionManager ptm;
    @Autowired private TransactionDefinition td;

    @Override
    public <T> T execute(ObjectLogicFunction<T> fn) throws Exception {
        TransactionStatus ts = ptm.getTransaction(td);
        try {
            T r = fn.logic();
            ptm.commit(ts);
            return r;
        } catch (Exception e) {
            ptm.rollback(ts);
            throw e;
        }
    }

    @Override
    public void execute(VoidLogicFunction fn) throws Exception {
        TransactionStatus ts = ptm.getTransaction(td);
        try {
            fn.logic();
            ptm.commit(ts);
        } catch (Exception e) {
            ptm.rollback(ts);
            throw e;
        }
    }
}

Thread‑Pool Creation and Configuration

Prefer a direct ThreadPoolExecutor over the factory methods in Executors so that all parameters are explicit.

private static final ExecutorService EXECUTOR = new ThreadPoolExecutor(
    2,                     // corePoolSize
    4,                     // maximumPoolSize
    1L, TimeUnit.MINUTES, // keepAliveTime
    new LinkedBlockingQueue<>(100),
    new ThreadFactoryBuilder().setNameFormat("common-pool-%d").build(),
    new ThreadPoolExecutor.CallerRunsPolicy()
);

CorePoolSize : usually equal to the number of CPU cores for CPU‑bound work.

MaximumPoolSize : larger than core to absorb traffic spikes.

KeepAliveTime : idle time after which extra threads are terminated.

WorkQueue : size must balance memory consumption and task latency.

Typical sizing formula for I/O‑bound workloads:

int cpu = Runtime.getRuntime().availableProcessors();
double blockingCoeff = 0.9; // proportion of time spent waiting
int core = (int)(cpu / (1 - blockingCoeff));

Monitor the pool with Micrometer/Prometheus (active count, queue size, completed tasks) to adjust parameters at runtime.

Cache‑Line Alignment

CPU caches are organized in 64‑byte lines. Accessing a 2‑D array row‑wise exploits spatial locality, while column‑wise access causes many cache misses.

public class CacheLineDemo {
    public static void main(String[] args) {
        int[][] arr = new int[10000][10000];
        long s = System.currentTimeMillis();
        for (int i = 0; i < arr.length; i++) {
            for (int j = 0; j < arr[i].length; j++) {
                arr[i][j] = 0; // row‑major (fast)
            }
        }
        System.out.println("row: " + (System.currentTimeMillis() - s));

        s = System.currentTimeMillis();
        for (int i = 0; i < arr.length; i++) {
            for (int j = 0; j < arr[i].length; j++) {
                arr[j][i] = 0; // column‑major (slow)
            }
        }
        System.out.println("col: " + (System.currentTimeMillis() - s));
    }
}

False sharing can be avoided by padding objects to a full cache line:

class Padding {
    volatile long p1, p2, p3, p4, p5, p6, p7; // 7×8 bytes = 56 bytes
}

class Cell extends Padding {
    public volatile long x = 0L; // occupies the remaining 8 bytes → one cache line
}

public class CacheLinePadding {
    static final Cell[] CELLS = {new Cell(), new Cell()};
    public static void main(String[] args) throws Exception {
        Thread t1 = new Thread(() -> {
            for (long i = 0; i < 10_000_000L; i++) CELLS[0].x = i;
        });
        Thread t2 = new Thread(() -> {
            for (long i = 0; i < 10_000_000L; i++) CELLS[1].x = i;
        });
        long start = System.nanoTime();
        t1.start(); t2.start();
        t1.join(); t2.join();
        System.out.println("ns per op: " + (System.nanoTime() - start) / 100_000);
    }
}

Reducing Object Creation

Prefer primitive types over boxed wrappers for high‑frequency operations. The following benchmark shows orders‑of‑magnitude difference between int and Integer loops.

private static void testInt(){
    int sum = 1;
    for (int i = 1; i < 50_000_000; i++) sum++;
    System.out.println(sum);
}

private static void testInteger(){
    Integer sum = 1;
    for (int i = 1; i < 50_000_000; i++) sum++;
    System.out.println(sum);
}

Use immutable objects (e.g., String) and builders ( StringBuilder) to avoid temporary allocations. Reuse instances via static factories or enum singletons:

public enum EnumSingleton { INSTANCE; }

public class StaticSingleton {
    private StaticSingleton() {}
    private static class Holder { private static final StaticSingleton INSTANCE = new StaticSingleton(); }
    public static StaticSingleton getInstance() { return Holder.INSTANCE; }
}

Concurrent Collections and Lock Granularity

volatile : guarantees visibility but not atomicity.

CAS (compare‑and‑set): lock‑free primitive, used by Atomic* classes.

synchronized : simple monitor lock.

ReentrantReadWriteLock : many readers, exclusive writer.

Segmented locks : employed by ConcurrentHashMap to reduce contention.

CopyOnWriteArrayList / CopyOnWriteArraySet : ideal for read‑heavy, write‑light scenarios; writes copy the underlying array.

Example of CopyOnWriteArraySet add implementation:

public boolean add(E e) {
    synchronized (lock) {
        Object[] elements = getArray();
        int len = elements.length;
        elements = Arrays.copyOf(elements, len + 1);
        elements[len] = e;
        setArray(elements);
        return true;
    }
}

Loop and Batch Optimizations

Replace many single‑row queries with a bulk query and cache repeated look‑ups:

Map<String, User> userMap = userMapper.queryByIds(userIds);
Map<String, Role> roleCache = new HashMap<>();
for (String uid : userIds) {
    User u = userMap.get(uid);
    Role r = roleCache.computeIfAbsent(u.getRoleId(), id -> roleMapper.queryById(id));
    // process u and r
}

Network Payload Reduction

Select only required fields in SQL (e.g., SELECT id, price FROM product).

Prefer binary protocols such as Protobuf over JSON for large payloads.

Compress payloads with GZIP or ZLIB when size matters.

StringBuilder sb = new StringBuilder();
for (int i = 0; i < 1000; i++) sb.append(i);
byte[] gz = ZipUtil.gzip(sb.toString(), CharsetUtil.UTF_8);
System.out.println("compressed size: " + gz.length);

Service Dependency Minimization

Avoid circular calls and duplicate remote requests. Strategies include:

Data redundancy (local copy of reference data).

Result caching (Redis, local cache).

Asynchronous messaging (MQ) to decouple producers and consumers.

Thread‑Pool Pre‑warming

Pre‑create core threads during application startup to eliminate first‑request latency:

EXECUTOR.prestartAllCoreThreads();

Object Pooling (Apache Commons Pool2)

For heavyweight objects (e.g., large buffers) reuse instances via a pool.

@Data
public class Cache { private byte[] data; }

public class CacheFactory extends BasePooledObjectFactory<Cache> {
    @Override public Cache create() { return new Cache(new byte[16 * 1024 * 1024]); }
    @Override public PooledObject<Cache> wrap(Cache obj) { return new DefaultPooledObject<>(obj); }
}

public enum CachePool {
    INSTANCE;
    private final GenericObjectPool<Cache> pool;
    CachePool() {
        GenericObjectPoolConfig<Cache> cfg = new GenericObjectPoolConfig<>();
        cfg.setMaxTotal(50);
        cfg.setMinIdle(20);
        cfg.setMaxIdle(20);
        cfg.setMaxWait(Duration.ofSeconds(3));
        pool = new GenericObjectPool<>(new CacheFactory(), cfg);
    }
    public Cache borrow() throws Exception { return pool.borrowObject(); }
    public void release(Cache c) { pool.returnObject(c); }
}

Lock Granularity Control

volatile : lightweight visibility guarantee.

CAS : used by AtomicInteger, ThreadLocalRandom etc.

Object/Class lock : synchronized on instance or static method.

Spin lock : busy‑wait loop, useful when wait time is expected to be very short.

Segmented lock : each hash bucket in ConcurrentHashMap has its own lock.

ReadWriteLock : multiple concurrent reads, exclusive write.

Copy‑On‑Write Collections

Writes acquire a lock, copy the underlying array, modify it, and replace the reference. Reads are lock‑free but see a snapshot taken before the write.

public boolean add(E e) {
    synchronized (lock) {
        Object[] old = getArray();
        Object[] newArr = Arrays.copyOf(old, old.length + 1);
        newArr[old.length] = e;
        setArray(newArr);
        return true;
    }
}

Best for read‑dominant workloads; not suitable when writes are frequent because of copy overhead.

Parallel Stream Example

int sum = numbers.parallelStream().reduce(0, Integer::sum);

Loop Reduction and Batch Fetching

Instead of querying the database per ID, fetch all IDs in one batch and process locally.

Map<String, User> users = userMapper.queryByIds(userIds);
for (String id : userIds) {
    User u = users.get(id);
    // process u
}

Result Caching in Loops

Map<String, Role> roleCache = new HashMap<>();
for (User u : users) {
    Role r = roleCache.computeIfAbsent(u.getRoleId(), rid -> roleMapper.queryById(rid));
    // use r
}

Network Size Optimization

Trim JSON fields; avoid large text columns in SELECT.

Use binary serialization (Protobuf) for high‑throughput services.

Compress payloads with GZIP/ZLIB when bandwidth is limited.

Reducing Service Dependencies

Design micro‑services to avoid circular calls, duplicate requests, and tight coupling. Techniques:

Data redundancy (local copies of reference data).

Result caching (Redis, local in‑memory cache).

Asynchronous messaging (MQ) to break call chains.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Java Microservices PerformanceOptimization concurrency ThreadPool Caching CompletableFuture

Written by

Architect

Professional architect sharing high‑quality architecture insights. Topics include high‑availability, high‑performance, high‑stability architectures, big data, machine learning, Java, system and distributed architecture, AI, and practical large‑scale architecture case studies. Open to ideas‑driven architects who enjoy sharing and learning.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.