Optimizing Large‑Scale Excel Import in Java: From POI to EasyExcel with Caching, Batch Inserts, and Parallel Streams
This article details how to dramatically speed up the import of massive Excel files (over 100,000 rows) into a Java backend by switching to EasyExcel, caching database lookups, using batch VALUES inserts, and parallelizing the write operations with functional utilities.
The author shares a step‑by‑step tutorial on optimizing the import of large Excel files (10w+ rows) containing payment records into a Java backend that persists data in MySQL.
Requirement : Users fill an XLSX template, the application reads, validates, transforms, and stores arrears, invoices, and invoice details, resulting in more than 300,000 database rows for a 100,000‑row Excel file.
Key challenges include field‑length/regex checks (in‑memory) and duplicate checks that require a database query for each row, which become major performance bottlenecks.
Iteration 1 – POI + row‑by‑row query + row‑by‑row insert : The original implementation read the file with raw POI, performed per‑row DB validation and insertion, leading to N‑times network I/O and poor maintainability.
Iteration 2 – EasyPOI + cache database queries + batch insert : To eliminate repeated DB round‑trips, all house data needed for validation is loaded into a HashMap . A custom SessionMapper uses MapResultHandler to fill the map, and MySQL INSERT … VALUES batch syntax inserts many rows at once.
@Repository
public class SessionMapper extends SqlSessionDaoSupport {
@Resource
public void setSqlSessionFactory(SqlSessionFactory sqlSessionFactory) {
super.setSqlSessionFactory(sqlSessionFactory);
}
@SuppressWarnings("unchecked")
public Map
getHouseMapByAreaId(Long areaId) {
MapResultHandler handler = new MapResultHandler();
this.getSqlSession().select(BaseUnitMapper.class.getName()+".getHouseMapByAreaId", areaId, handler);
return handler.getMappedResults();
}
} public class MapResultHandler implements ResultHandler {
private final Map mappedResults = new HashMap();
@Override
public void handleResult(ResultContext context) {
@SuppressWarnings("rawtypes")
Map map = (Map)context.getResultObject();
mappedResults.put(map.get("key"), map.get("value"));
}
public Map getMappedResults() { return mappedResults; }
} @Mapper
@Repository
public interface BaseUnitMapper {
// area+building+unit+room -> houseId
Map
getHouseMapByAreaId(@Param("areaId") Long areaId);
} <select id="getHouseMapByAreaId" resultMap="mapResultLong">
SELECT CONCAT(h.bulid_area_name, h.build_name, h.unit_name, h.house_num) k,
h.house_id v
FROM base_house h
WHERE h.area_id = #{areaId}
GROUP BY h.house_id
</select>
<resultMap id="mapResultLong" type="java.util.HashMap">
<result property="key" column="k" javaType="string" jdbcType="VARCHAR"/>
<result property="value" column="v" javaType="long" jdbcType="INTEGER"/>
</resultMap>Iteration 3 – EasyExcel + cache + batch insert : For 41 w rows, EasyExcel reads the file in ~50 s, far faster than EasyPOI, which ran out of memory. The same caching and batch‑insert strategy is retained.
Iteration 4 – Tuning batch size : Experiments with batch sizes (30 000, 10 000, 5 000, 3 000, 1 000, 500) showed 1 000 rows per INSERT gave the best throughput, likely because larger statements exceed the InnoDB buffer pool and trigger disk swapping.
Parallel insertion utility : The author provides a functional class that splits a list into sub‑lists of 1 000 items, sets the ForkJoinPool parallelism to 4, and executes the batch insert via a parallel stream.
/**
* Function‑style utility for fast parallel insertion.
* @author Keats
* @date 2020/7/1 9:25
*/
public class InsertConsumer {
private static final int SIZE = 1000;
static { System.setProperty("java.util.concurrent.ForkJoinPool.common.parallelism", "4"); }
public static
void insertData(List
list, Consumer
> consumer) {
if (list == null || list.isEmpty()) return;
List
> streamList = new ArrayList<>();
for (int i = 0; i < list.size(); i += SIZE) {
int j = Math.min(i + SIZE, list.size());
streamList.add(list.subList(i, j));
}
streamList.parallelStream().forEach(consumer);
}
}Usage example:
InsertConsumer.insertData(feeList, arrearageMapper::insertList);Other performance tips : Use INSERT … VALUES batch syntax, avoid excessive info logging inside loops (logging can add a ten‑fold slowdown), and increase JVM heap when necessary.
Conclusion : By adopting a faster Excel parser (EasyExcel), caching reference data, batching inserts, and parallelizing the write phase, the total import time for 100 k rows can be reduced to under two minutes, with the database becoming the remaining bottleneck.
Code Ape Tech Column
Former Ant Group P8 engineer, pure technologist, sharing full‑stack Java, job interview and career advice through a column. Site: java-family.cn
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.