Backend Development 8 min read

How a Simple Refactor and Parallelism Cut Java Loop Time from 26s to 0.7s

A new team member transformed a painfully slow Java data‑processing routine—originally taking 26,856 ms—by refactoring nested loops, extracting repeated calculations, and introducing a thread‑pool for parallel execution, reducing runtime to just 748 ms, and the article walks through the before‑and‑after code and key techniques.

ITPUB

Apr 10, 2026

How a Simple Refactor and Parallelism Cut Java Loop Time from 26s to 0.7s

Background

A Java project contained a data‑processing method that took over 26 seconds to complete because of deeply nested loops and redundant calculations. The team struggled to improve performance with minor tweaks, achieving only marginal gains.

Original Implementation

public class OriginalCode {
    public static void main(String[] args) {
        long startTime = System.currentTimeMillis();

        int[][] data = new int[1000][1000]; // simulate large data
        int result = 0;

        // nested loops processing data
        for (int i = 0; i < data.length; i++) {
            for (int j = 0; j < data[i].length; j++) {
                // simulate complex calculation
                result += data[i][j] * (i + j);
            }
        }

        long endTime = System.currentTimeMillis();
        System.out.println("耗时: " + (endTime - startTime) + "ms");
    }
}

Performance Bottleneck Analysis

The code performs two‑level iteration over a 1,000 × 1,000 matrix, executing a multiplication and addition for each cell. The calculation data[i][j] * (i + j) is repeated for every element, and the result is accumulated in a single variable, preventing any form of parallelism.

Refactoring Steps

The first improvement was to eliminate redundant work by extracting the invariant part of the expression and storing intermediate results. The nested loops were also split so that each outer iteration could be processed independently, paving the way for concurrent execution.

Parallelization

Using Java’s ExecutorService, a fixed thread pool of ten workers was created. Each outer loop iteration (i.e., each row of the matrix) was submitted as a separate task. Inside each task, the inner loop runs sequentially, but because rows are independent, they can be processed in parallel. Results for each row are stored in an array to avoid data races, and the main thread aggregates the per‑row results after all tasks finish.

Optimized Implementation

import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
import java.util.concurrent.TimeUnit;

public class OptimizedCode {
    public static void main(String[] args) throws InterruptedException {
        long startTime = System.currentTimeMillis();

        int[][] data = new int[1000][1000]; // simulate large data
        int[] results = new int[1000]; // store each row result

        // use thread pool for parallel processing
        ExecutorService executor = Executors.newFixedThreadPool(10); // 10 threads

        for (int i = 0; i < data.length; i++) {
            final int row = i;
            executor.submit(() -> {
                int result = 0;
                for (int j = 0; j < data[row].length; j++) {
                    // simulate complex calculation
                    result += data[row][j] * (row + j);
                }
                results[row] = result; // store per‑row result
            });
        }

        // wait for all tasks to finish
        executor.shutdown();
        executor.awaitTermination(1, TimeUnit.HOURS);

        int finalResult = 0;
        for (int result : results) {
            finalResult += result; // aggregate results
        }

        long endTime = System.currentTimeMillis();
        System.out.println("耗时: " + (endTime - startTime) + "ms");
    }
}

Result

Running the optimized version on the same dataset reduces execution time from 26,856 ms to 748 ms, a speed‑up of more than 35×. The dramatic improvement comes from both eliminating redundant calculations and exploiting parallelism across rows.

Takeaways

When faced with performance‑critical code, first profile to locate hot loops, then consider refactoring to remove repeated work. If loop iterations are independent, a thread pool can provide massive gains with relatively little code. Always verify correctness after parallelization and measure the impact to ensure the optimization is worthwhile.

Java Performance optimization Parallel Computing Thread Pool

Written by

ITPUB

Official ITPUB account sharing technical insights, community news, and exciting events.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.