Optimizing Large Excel Import Performance in Java Backend Applications
This article details a step‑by‑step optimization of a Java backend Excel import pipeline, covering requirement analysis, performance bottlenecks, caching strategies, batch insertion techniques, parallel processing, and practical code examples to reduce a 10‑minute import to under two minutes.
Requirement : Import massive Excel files containing payment records, validate them, transform into arrears data, and persist the results (arrears, bills, bill details) into a MySQL 5.7 database.
First version (POI + row‑by‑row query + row‑by‑row insert) : Read each row into a List, perform per‑row field length checks, execute a database query for each row to verify house existence, and insert each row individually. This caused O(n) database round‑trips and severe slowdown when rows reached 10⁵‑plus.
Problems identified :
Database validation performed for every row, multiplying network I/O.
Row‑by‑row inserts suffered the same issue.
Raw POI code was verbose and hard to maintain.
Second version (EasyPOI + cached DB queries + batch insert) :
Cache lookup data in a HashMap to avoid repeated queries. A custom SessionMapper and MapResultHandler load all house information for a given area into memory.
Use MySQL INSERT INTO table(col1,col2) VALUES (...),(...),... syntax (values batch) to insert many rows per statement.
Custom mapper code:
@Repository
public class SessionMapper extends SqlSessionDaoSupport {
@Resource
public void setSqlSessionFactory(SqlSessionFactory sqlSessionFactory) {
super.setSqlSessionFactory(sqlSessionFactory);
}
// 区域楼宇单元房号 - 房屋ID
@SuppressWarnings("unchecked")
public Map<String, Long> getHouseMapByAreaId(Long areaId) {
MapResultHandler handler = new MapResultHandler();
this.getSqlSession().select(BaseUnitMapper.class.getName()+".getHouseMapByAreaId", areaId, handler);
Map<String, Long> map = handler.getMappedResults();
return map;
}
}Result handler:
public class MapResultHandler implements ResultHandler {
private final Map mappedResults = new HashMap();
@Override
public void handleResult(ResultContext context) {
@SuppressWarnings("rawtypes")
Map map = (Map)context.getResultObject();
mappedResults.put(map.get("key"), map.get("value"));
}
public Map getMappedResults() {
return mappedResults;
}
}Mapper interface:
@Mapper
@Repository
public interface BaseUnitMapper {
// 区域楼宇单元房号 - 房屋ID
Map<String, Long> getHouseMapByAreaId(@Param("areaId") Long areaId);
}Mapper XML snippet (resultMap for key/value):
<select id="getHouseMapByAreaId" resultMap="mapResultLong">
SELECT CONCAT(h.bulid_area_name, h.build_name, h.unit_name, h.house_num) k,
h.house_id v
FROM base_house h
WHERE h.area_id = #{areaId}
GROUP BY h.house_id
</select>
<resultMap id="mapResultLong" type="java.util.HashMap">
<result property="key" column="k" javaType="string" jdbcType="VARCHAR"/>
<result property="value" column="v" javaType="long" jdbcType="INTEGER"/>
</resultMap>Third version (EasyExcel + cached DB queries + batch insert) : Switched to Alibaba's EasyExcel for faster reading of very large files (e.g., 410 000 rows, 25 columns). EasyExcel uses annotation‑based mapping similar to EasyPOI, reducing memory consumption and read time to ~50 s.
Fourth version (Optimized batch insertion) :
Experimented with batch sizes; 1 000 rows per INSERT gave the best throughput, avoiding excessive SQL size that caused disk swapping.
Implemented parallel insertion using Java 8 parallel streams to keep CPU busy while waiting for network I/O.
Parallel insertion utility:
public class InsertConsumer {
private static final int SIZE = 1000;
static {
System.setProperty("java.util.concurrent.ForkJoinPool.common.parallelism", "4");
}
public static <T> void insertData(List<T> list, Consumer<List<T>> consumer) {
if (list == null || list.size() < 1) return;
List<List<T>> streamList = new ArrayList<>();
for (int i = 0; i < list.size(); i += SIZE) {
int j = Math.min(i + SIZE, list.size());
streamList.add(list.subList(i, j));
}
streamList.parallelStream().forEach(consumer);
}
}Usage example:
InsertConsumer.insertData(feeList, arrearageMapper::insertList);Additional performance tips :
Prefer EasyExcel over POI for large files.
Cache data needed for validation to trade space for time.
Use multi‑row VALUES syntax for bulk inserts.
Leverage parallel streams or other multithreading to hide I/O latency.
Avoid excessive logging inside tight loops; it can increase runtime by an order of magnitude.
Result : With these optimizations, importing 100 000 rows, generating 420 000 arrears records, and inserting them with 16 parallel threads took only 72 seconds for insertion and 95 seconds total.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Top Architect
Top Architect focuses on sharing practical architecture knowledge, covering enterprise, system, website, large‑scale distributed, and high‑availability architectures, plus architecture adjustments using internet technologies. We welcome idea‑driven, sharing‑oriented architects to exchange and learn together.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
