How to Speed Up Large Excel Imports in Java: From POI to EasyExcel and Parallel Insertion
This article walks through a step‑by‑step optimization of a Java‑based Excel import pipeline, covering the transition from raw POI to EasyPOI and EasyExcel, caching database lookups, batch inserts, optimal batch sizes, parallel streams, and logging reductions to achieve sub‑minute import times for hundreds of thousands of rows.
Requirement Overview
The system must import payment records from an Excel template filled by users, validate the data, transform it, and store overdue records, invoices, and invoice details into a MySQL database.
Initial Implementation (Version 1)
Used raw Apache POI to read rows into a List, performed per‑row database validation (e.g., duplicate invoice numbers), and inserted each row individually. This approach caused massive network I/O and slow performance, suitable only for tens of rows.
Problems Identified
Each row triggered a separate DB query for validation, multiplying I/O latency.
Rows were inserted one by one, repeating the same overhead.
POI code was verbose and hard to maintain.
Version 2 – EasyPOI + Cached Validation + Batch Insert
Introduced three improvements:
Cache Validation Data
All house information for a community (≤5000 entries) is loaded once into a HashMap<String, Long> where the key is a concatenated address and the value is the house ID, eliminating per‑row DB lookups.
@Repository
public class SessionMapper extends SqlSessionDaoSupport {
@Resource
public void setSqlSessionFactory(SqlSessionFactory sqlSessionFactory) {
super.setSqlSessionFactory(sqlSessionFactory);
}
@SuppressWarnings("unchecked")
public Map<String, Long> getHouseMapByAreaId(Long areaId) {
MapResultHandler handler = new MapResultHandler();
this.getSqlSession().select(BaseUnitMapper.class.getName()+".getHouseMapByAreaId", areaId, handler);
return handler.getMappedResults();
}
} public class MapResultHandler implements ResultHandler {
private final Map mappedResults = new HashMap();
@Override
public void handleResult(ResultContext context) {
Map map = (Map)context.getResultObject();
mappedResults.put(map.get("key"), map.get("value"));
}
public Map getMappedResults() { return mappedResults; }
}Batch Insert with MyBatis
Used MySQL INSERT ... VALUES (...),(...) syntax combined with MyBatis <foreach> to insert many rows in a single statement.
<insert id="insertList">
insert into table(col1, col2)
values
<foreach collection="list" item="item" separator=",">
(#{item.col1}, #{item.col2})
</foreach>
</insert>Version 3 – EasyExcel + Cached Validation + Batch Insert
Switched to Alibaba's EasyExcel for reading large files. EasyExcel reads 410,000 rows (25 columns, 45.5 MB) in about 50 seconds, far faster than EasyPOI, which ran out of memory on the same dataset.
Version 4 – Optimizing Insert Speed
Experimented with different batch sizes for the VALUES statement. A batch of 1,000 rows yielded the best performance, likely because larger batches exceeded the InnoDB buffer pool and caused disk swapping.
Implemented parallel insertion using Java 8 parallel streams to keep the CPU busy while waiting for network I/O.
/**
* Utility for fast parallel insertion using streams.
*/
public class InsertConsumer {
private static final int SIZE = 1000;
static { System.setProperty("java.util.concurrent.ForkJoinPool.common.parallelism", "4"); }
public static <T> void insertData(List<T> list, Consumer<List<T>> consumer) {
if (list == null || list.isEmpty()) return;
List<List<T>> partitions = new ArrayList<>();
for (int i = 0; i < list.size(); i += SIZE) {
int end = Math.min(i + SIZE, list.size());
partitions.add(list.subList(i, end));
}
partitions.parallelStream().forEach(consumer);
}
}Usage example:
InsertConsumer.insertData(feeList, arrearageMapper::insertList);Other Performance Tips
Logging
Excessive info logging inside loops dramatically slows processing; reducing logs cut execution time to roughly one‑tenth of the original.
Summary of Improvements
Adopt a faster Excel parser (recommend EasyExcel).
Cache data needed for validation to avoid per‑row DB queries.
Use bulk INSERT ... VALUES statements.
Insert in batches of ~1,000 rows and leverage parallel streams for concurrency.
Suppress unnecessary logging inside processing loops.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Programmer DD
A tinkering programmer and author of "Spring Cloud Microservices in Action"
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
