How to Supercharge Java Backend Performance with CompletableFuture, Thread Pools, Caching, and Lock Tuning
This article analyzes Java performance bottlenecks and demonstrates how to use CompletableFuture for parallelism, fine‑tune ThreadPoolExecutor parameters, minimize transaction scope, apply cache‑line padding, object pooling, lock‑granularity techniques, copy‑on‑write collections, and reduce network payloads to achieve lower latency and higher throughput.
CompletableFuture for Parallel Price Queries
When a price‑query flow needs to fetch several independent configuration items (base price, discount price, merchant activity price, platform activity price, etc.), CompletableFuture can run the I/O‑bound calls in parallel. However, creating too many threads can cause contention; the thread‑pool size must match the workload characteristics.
Performance Test
private void sync(){
long s = System.currentTimeMillis();
a(10); b(10); c(10); d(10);
long e = System.currentTimeMillis();
System.out.println(e - s);
}
private void async(){
long s = System.currentTimeMillis();
List<CompletableFuture<?>> list = new ArrayList<>();
list.add(CompletableFuture.runAsync(() -> a(10)));
list.add(CompletableFuture.runAsync(() -> b(10)));
list.add(CompletableFuture.runAsync(() -> c(10)));
list.add(CompletableFuture.runAsync(() -> d(10)));
CompletableFuture.allOf(list.toArray(new CompletableFuture[0])).join();
long e = System.currentTimeMillis();
System.out.println(e - s);
}With a small number of short‑lived tasks, the synchronous version is faster because thread scheduling overhead dominates the async version.
Hybrid Strategy
Run fast, lightweight tasks synchronously and submit only the relatively heavy tasks to a CompletableFuture pool. This reduces scheduling cost while still gaining parallelism for the bottleneck steps.
Minimizing Transaction Scope
Large transaction scopes increase lock contention. A programmatic transaction template gives fine‑grained control:
public interface TransactionControlService {
<T> T execute(ObjectLogicFunction<T> logic) throws Exception;
void execute(VoidLogicFunction logic) throws Exception;
}
@Service
public class TransactionControlServiceImpl implements TransactionControlService {
@Autowired private PlatformTransactionManager ptm;
@Autowired private TransactionDefinition td;
@Override
public <T> T execute(ObjectLogicFunction<T> fn) throws Exception {
TransactionStatus ts = ptm.getTransaction(td);
try {
T r = fn.logic();
ptm.commit(ts);
return r;
} catch (Exception e) {
ptm.rollback(ts);
throw e;
}
}
@Override
public void execute(VoidLogicFunction fn) throws Exception {
TransactionStatus ts = ptm.getTransaction(td);
try {
fn.logic();
ptm.commit(ts);
} catch (Exception e) {
ptm.rollback(ts);
throw e;
}
}
}Thread‑Pool Creation and Configuration
Prefer a direct ThreadPoolExecutor over the factory methods in Executors so that all parameters are explicit.
private static final ExecutorService EXECUTOR = new ThreadPoolExecutor(
2, // corePoolSize
4, // maximumPoolSize
1L, TimeUnit.MINUTES, // keepAliveTime
new LinkedBlockingQueue<>(100),
new ThreadFactoryBuilder().setNameFormat("common-pool-%d").build(),
new ThreadPoolExecutor.CallerRunsPolicy()
);CorePoolSize : usually equal to the number of CPU cores for CPU‑bound work.
MaximumPoolSize : larger than core to absorb traffic spikes.
KeepAliveTime : idle time after which extra threads are terminated.
WorkQueue : size must balance memory consumption and task latency.
Typical sizing formula for I/O‑bound workloads:
int cpu = Runtime.getRuntime().availableProcessors();
double blockingCoeff = 0.9; // proportion of time spent waiting
int core = (int)(cpu / (1 - blockingCoeff));Monitor the pool with Micrometer/Prometheus (active count, queue size, completed tasks) to adjust parameters at runtime.
Cache‑Line Alignment
CPU caches are organized in 64‑byte lines. Accessing a 2‑D array row‑wise exploits spatial locality, while column‑wise access causes many cache misses.
public class CacheLineDemo {
public static void main(String[] args) {
int[][] arr = new int[10000][10000];
long s = System.currentTimeMillis();
for (int i = 0; i < arr.length; i++) {
for (int j = 0; j < arr[i].length; j++) {
arr[i][j] = 0; // row‑major (fast)
}
}
System.out.println("row: " + (System.currentTimeMillis() - s));
s = System.currentTimeMillis();
for (int i = 0; i < arr.length; i++) {
for (int j = 0; j < arr[i].length; j++) {
arr[j][i] = 0; // column‑major (slow)
}
}
System.out.println("col: " + (System.currentTimeMillis() - s));
}
}False sharing can be avoided by padding objects to a full cache line:
class Padding {
volatile long p1, p2, p3, p4, p5, p6, p7; // 7×8 bytes = 56 bytes
}
class Cell extends Padding {
public volatile long x = 0L; // occupies the remaining 8 bytes → one cache line
}
public class CacheLinePadding {
static final Cell[] CELLS = {new Cell(), new Cell()};
public static void main(String[] args) throws Exception {
Thread t1 = new Thread(() -> {
for (long i = 0; i < 10_000_000L; i++) CELLS[0].x = i;
});
Thread t2 = new Thread(() -> {
for (long i = 0; i < 10_000_000L; i++) CELLS[1].x = i;
});
long start = System.nanoTime();
t1.start(); t2.start();
t1.join(); t2.join();
System.out.println("ns per op: " + (System.nanoTime() - start) / 100_000);
}
}Reducing Object Creation
Prefer primitive types over boxed wrappers for high‑frequency operations. The following benchmark shows orders‑of‑magnitude difference between int and Integer loops.
private static void testInt(){
int sum = 1;
for (int i = 1; i < 50_000_000; i++) sum++;
System.out.println(sum);
}
private static void testInteger(){
Integer sum = 1;
for (int i = 1; i < 50_000_000; i++) sum++;
System.out.println(sum);
}Use immutable objects (e.g., String) and builders ( StringBuilder) to avoid temporary allocations. Reuse instances via static factories or enum singletons:
public enum EnumSingleton { INSTANCE; }
public class StaticSingleton {
private StaticSingleton() {}
private static class Holder { private static final StaticSingleton INSTANCE = new StaticSingleton(); }
public static StaticSingleton getInstance() { return Holder.INSTANCE; }
}Concurrent Collections and Lock Granularity
volatile : guarantees visibility but not atomicity.
CAS (compare‑and‑set): lock‑free primitive, used by Atomic* classes.
synchronized : simple monitor lock.
ReentrantReadWriteLock : many readers, exclusive writer.
Segmented locks : employed by ConcurrentHashMap to reduce contention.
CopyOnWriteArrayList / CopyOnWriteArraySet : ideal for read‑heavy, write‑light scenarios; writes copy the underlying array.
Example of CopyOnWriteArraySet add implementation:
public boolean add(E e) {
synchronized (lock) {
Object[] elements = getArray();
int len = elements.length;
elements = Arrays.copyOf(elements, len + 1);
elements[len] = e;
setArray(elements);
return true;
}
}Loop and Batch Optimizations
Replace many single‑row queries with a bulk query and cache repeated look‑ups:
Map<String, User> userMap = userMapper.queryByIds(userIds);
Map<String, Role> roleCache = new HashMap<>();
for (String uid : userIds) {
User u = userMap.get(uid);
Role r = roleCache.computeIfAbsent(u.getRoleId(), id -> roleMapper.queryById(id));
// process u and r
}Network Payload Reduction
Select only required fields in SQL (e.g., SELECT id, price FROM product).
Prefer binary protocols such as Protobuf over JSON for large payloads.
Compress payloads with GZIP or ZLIB when size matters.
StringBuilder sb = new StringBuilder();
for (int i = 0; i < 1000; i++) sb.append(i);
byte[] gz = ZipUtil.gzip(sb.toString(), CharsetUtil.UTF_8);
System.out.println("compressed size: " + gz.length);Service Dependency Minimization
Avoid circular calls and duplicate remote requests. Strategies include:
Data redundancy (local copy of reference data).
Result caching (Redis, local cache).
Asynchronous messaging (MQ) to decouple producers and consumers.
Thread‑Pool Pre‑warming
Pre‑create core threads during application startup to eliminate first‑request latency:
EXECUTOR.prestartAllCoreThreads();Object Pooling (Apache Commons Pool2)
For heavyweight objects (e.g., large buffers) reuse instances via a pool.
@Data
public class Cache { private byte[] data; }
public class CacheFactory extends BasePooledObjectFactory<Cache> {
@Override public Cache create() { return new Cache(new byte[16 * 1024 * 1024]); }
@Override public PooledObject<Cache> wrap(Cache obj) { return new DefaultPooledObject<>(obj); }
}
public enum CachePool {
INSTANCE;
private final GenericObjectPool<Cache> pool;
CachePool() {
GenericObjectPoolConfig<Cache> cfg = new GenericObjectPoolConfig<>();
cfg.setMaxTotal(50);
cfg.setMinIdle(20);
cfg.setMaxIdle(20);
cfg.setMaxWait(Duration.ofSeconds(3));
pool = new GenericObjectPool<>(new CacheFactory(), cfg);
}
public Cache borrow() throws Exception { return pool.borrowObject(); }
public void release(Cache c) { pool.returnObject(c); }
}Lock Granularity Control
volatile : lightweight visibility guarantee.
CAS : used by AtomicInteger, ThreadLocalRandom etc.
Object/Class lock : synchronized on instance or static method.
Spin lock : busy‑wait loop, useful when wait time is expected to be very short.
Segmented lock : each hash bucket in ConcurrentHashMap has its own lock.
ReadWriteLock : multiple concurrent reads, exclusive write.
Copy‑On‑Write Collections
Writes acquire a lock, copy the underlying array, modify it, and replace the reference. Reads are lock‑free but see a snapshot taken before the write.
public boolean add(E e) {
synchronized (lock) {
Object[] old = getArray();
Object[] newArr = Arrays.copyOf(old, old.length + 1);
newArr[old.length] = e;
setArray(newArr);
return true;
}
}Best for read‑dominant workloads; not suitable when writes are frequent because of copy overhead.
Parallel Stream Example
int sum = numbers.parallelStream().reduce(0, Integer::sum);Loop Reduction and Batch Fetching
Instead of querying the database per ID, fetch all IDs in one batch and process locally.
Map<String, User> users = userMapper.queryByIds(userIds);
for (String id : userIds) {
User u = users.get(id);
// process u
}Result Caching in Loops
Map<String, Role> roleCache = new HashMap<>();
for (User u : users) {
Role r = roleCache.computeIfAbsent(u.getRoleId(), rid -> roleMapper.queryById(rid));
// use r
}Network Size Optimization
Trim JSON fields; avoid large text columns in SELECT.
Use binary serialization (Protobuf) for high‑throughput services.
Compress payloads with GZIP/ZLIB when bandwidth is limited.
Reducing Service Dependencies
Design micro‑services to avoid circular calls, duplicate requests, and tight coupling. Techniques:
Data redundancy (local copies of reference data).
Result caching (Redis, local in‑memory cache).
Asynchronous messaging (MQ) to break call chains.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Architect
Professional architect sharing high‑quality architecture insights. Topics include high‑availability, high‑performance, high‑stability architectures, big data, machine learning, Java, system and distributed architecture, AI, and practical large‑scale architecture case studies. Open to ideas‑driven architects who enjoy sharing and learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
