Backend Development 12 min read

Optimizing Large‑Scale Excel Import in Java: From POI to EasyExcel with Caching, Batch Inserts, and Parallel Streams

This article details how to dramatically speed up the import of massive Excel files (over 100,000 rows) into a Java backend by switching to EasyExcel, caching database lookups, using batch VALUES inserts, and parallelizing the write operations with functional utilities.

Code Ape Tech Column

Feb 5, 2024

Optimizing Large‑Scale Excel Import in Java: From POI to EasyExcel with Caching, Batch Inserts, and Parallel Streams

The author shares a step‑by‑step tutorial on optimizing the import of large Excel files (10w+ rows) containing payment records into a Java backend that persists data in MySQL.

Requirement : Users fill an XLSX template, the application reads, validates, transforms, and stores arrears, invoices, and invoice details, resulting in more than 300,000 database rows for a 100,000‑row Excel file.

Key challenges include field‑length/regex checks (in‑memory) and duplicate checks that require a database query for each row, which become major performance bottlenecks.

Iteration 1 – POI + row‑by‑row query + row‑by‑row insert : The original implementation read the file with raw POI, performed per‑row DB validation and insertion, leading to N‑times network I/O and poor maintainability.

Iteration 2 – EasyPOI + cache database queries + batch insert : To eliminate repeated DB round‑trips, all house data needed for validation is loaded into a HashMap. A custom SessionMapper uses MapResultHandler to fill the map, and MySQL INSERT … VALUES batch syntax inserts many rows at once.

@Repository
public class SessionMapper extends SqlSessionDaoSupport {
    @Resource
    public void setSqlSessionFactory(SqlSessionFactory sqlSessionFactory) {
        super.setSqlSessionFactory(sqlSessionFactory);
    }
    @SuppressWarnings("unchecked")
    public Map<String, Long> getHouseMapByAreaId(Long areaId) {
        MapResultHandler handler = new MapResultHandler();
        this.getSqlSession().select(BaseUnitMapper.class.getName()+".getHouseMapByAreaId", areaId, handler);
        return handler.getMappedResults();
    }
}

public class MapResultHandler implements ResultHandler {
    private final Map mappedResults = new HashMap();
    @Override
    public void handleResult(ResultContext context) {
        @SuppressWarnings("rawtypes")
        Map map = (Map)context.getResultObject();
        mappedResults.put(map.get("key"), map.get("value"));
    }
    public Map getMappedResults() { return mappedResults; }
}

@Mapper
@Repository
public interface BaseUnitMapper {
    // area+building+unit+room -> houseId
    Map<String, Long> getHouseMapByAreaId(@Param("areaId") Long areaId);
}

<select id="getHouseMapByAreaId" resultMap="mapResultLong">
    SELECT CONCAT(h.bulid_area_name, h.build_name, h.unit_name, h.house_num) k,
           h.house_id v
    FROM base_house h
    WHERE h.area_id = #{areaId}
    GROUP BY h.house_id
</select>
<resultMap id="mapResultLong" type="java.util.HashMap">
    <result property="key" column="k" javaType="string" jdbcType="VARCHAR"/>
    <result property="value" column="v" javaType="long" jdbcType="INTEGER"/>
</resultMap>

Iteration 3 – EasyExcel + cache + batch insert : For 41 w rows, EasyExcel reads the file in ~50 s, far faster than EasyPOI, which ran out of memory. The same caching and batch‑insert strategy is retained.

Iteration 4 – Tuning batch size : Experiments with batch sizes (30 000, 10 000, 5 000, 3 000, 1 000, 500) showed 1 000 rows per INSERT gave the best throughput, likely because larger statements exceed the InnoDB buffer pool and trigger disk swapping.

Parallel insertion utility : The author provides a functional class that splits a list into sub‑lists of 1 000 items, sets the ForkJoinPool parallelism to 4, and executes the batch insert via a parallel stream.

/**
 * Function‑style utility for fast parallel insertion.
 * @author Keats
 * @date 2020/7/1 9:25
 */
public class InsertConsumer {
    private static final int SIZE = 1000;
    static { System.setProperty("java.util.concurrent.ForkJoinPool.common.parallelism", "4"); }
    public static <T> void insertData(List<T> list, Consumer<List<T>> consumer) {
        if (list == null || list.isEmpty()) return;
        List<List<T>> streamList = new ArrayList<>();
        for (int i = 0; i < list.size(); i += SIZE) {
            int j = Math.min(i + SIZE, list.size());
            streamList.add(list.subList(i, j));
        }
        streamList.parallelStream().forEach(consumer);
    }
}

Usage example:

InsertConsumer.insertData(feeList, arrearageMapper::insertList);

Other performance tips : Use INSERT … VALUES batch syntax, avoid excessive info logging inside loops (logging can add a ten‑fold slowdown), and increase JVM heap when necessary.

Conclusion : By adopting a faster Excel parser (EasyExcel), caching reference data, batching inserts, and parallelizing the write phase, the total import time for 100 k rows can be reduced to under two minutes, with the database becoming the remaining bottleneck.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Java Performance Optimization mysql Batch Insert EasyExcel Excel Parallel Stream

Written by

Code Ape Tech Column

Former Ant Group P8 engineer, pure technologist, sharing full‑stack Java, job interview and career advice through a column. Site: java-family.cn

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.