How to Import Millions of Excel Rows in Seconds: 4 Proven Performance Hacks
This article analyzes why traditional Excel import methods crash under massive loads and presents four practical optimization techniques—including streaming parsing, batch inserts, asynchronous processing, and parallel sharding—backed by code samples, configuration tips, and real‑world performance benchmarks for importing millions of rows efficiently.
Introduction
Many developers struggle with importing massive Excel files; a typical e‑commerce system that needs to import 200,000 product rows per day can freeze for over three hours, and a server restart wipes all progress.
1 Why Traditional Import Solutions Fail
1.1 Memory Exhaustion
Problem: POI loads the entire workbook (e.g., UserModel / XSSFWorkbook) into heap memory.
Experiment: A 50 MB file (~200k rows) consumes the default 1 GB heap.
Symptoms: Frequent Full GC, CPU spikes, service unresponsiveness.
1.2 Synchronous Blocking
Process: User uploads → server processes all data synchronously → returns result.
Risk: HTTP timeout (default 30 s) leads to lost tasks.
1.3 Efficiency Black Hole
Measured: MySQL single‑thread insert ≈200 rows/s → 20 万 rows need ~16 minutes.
Root cause: Each INSERT triggers transaction commit, index update, log write.
2 Four Performance Optimizations
2.1 Streaming Parsing
Replace DOM parsing with POI’s SAX mode to read the file piece by piece.
// Correct example: segment reading (HSSF example)
OPCPackage pkg = OPCPackage.open(file);
XSSFReader reader = new XSSFReader(pkg);
SheetIterator sheets = (SheetIterator) reader.getSheetsData();
while (sheets.hasNext()) {
try (InputStream stream = sheets.next()) {
Sheet sheet = new XSSFSheet(); // streaming parse
RowHandler rowHandler = new RowHandler();
sheet.onRow(row -> rowHandler.process(row));
sheet.process(stream); // do not load full data
}
}Pitfall guide:
Adapt to different Excel versions (HSSF/XSSF/SXSSF).
Avoid creating many objects during parsing; reuse containers.
2.2 Paginated Batch Inserts
Use MyBatis batch insert with connection‑pool tuning.
// Paginated batch insert (commit every 1000 rows)
public void batchInsert(List<Product> list) {
SqlSession sqlSession = sqlSessionFactory.openSession(ExecutorType.BATCH);
ProductMapper mapper = sqlSession.getMapper(ProductMapper.class);
int pageSize = 1000;
for (int i = 0; i < list.size(); i += pageSize) {
List<Product> subList = list.subList(i, Math.min(i + pageSize, list.size()));
mapper.batchInsert(subList);
sqlSession.commit();
sqlSession.clearCache(); // clear cache
}
}Key parameters:
# MyBatis configuration
mybatis.executor.batch.size=1000
# Druid connection pool
spring.datasource.druid.maxActive=50
spring.datasource.druid.initialSize=102.3 Asynchronous Processing
Architecture diagram:
Frontend upload: Use chunked upload tools (e.g., WebUploader).
Server side: Generate a unique task ID and push the task into a queue (Redis Stream / RabbitMQ).
Async thread pool: Multiple workers consume the queue; progress stored in Redis.
Result notification: Notify client via WebSocket or email.
2.4 Parallel Import
For tens of millions of rows, apply a divide‑and‑conquer strategy:
Single‑thread: row‑by‑row read + insert (baseline 100%).
Paginated batch: time reduced to 5%.
Multi‑thread sharding: time reduced to 1%.
Distributed sharding (3 nodes): time reduced to 0.5%.
3 Key Experience Beyond Code
3.1 Pre‑validation
Wrong approach – validate while inserting, which may pollute the database:
// Wrong: validate while inserting, may corrupt DB
public void validateAndInsert(Product product) {
if (product.getPrice() < 0) {
throw new Exception("Price cannot be negative");
}
productMapper.insert(product);
}Correct practice:
Perform basic format and required‑field checks during streaming parsing.
Do business validation (referential integrity, uniqueness) before persisting.
3.2 Checkpoint‑Resume Design
Record processing status of each chunk.
On failure, resume from the last offset.
3.3 Logging & Monitoring
Example Spring Boot Prometheus metrics configuration:
// Spring Boot Prometheus metric bean
@Bean
public MeterRegistryCustomizer<PrometheusMeterRegistry> metrics() {
return registry -> registry.config().meterFilter(new MeterFilter() {
@Override
public DistributionStatisticConfig configure(Meter.Id id, DistributionStatisticConfig config) {
return DistributionStatisticConfig.builder()
.percentiles(0.5, 0.95) // median and 95th percentile
.build().merge(config);
}
});
}4 Million‑Row Import Performance Comparison
Test environment: 4‑core 8 GB server, MySQL 8.0, 100 万 rows × 15 columns (~200 MB Excel).
Results:
Traditional row‑by‑row: 2.5 GB peak memory, 96 min, 173 rows/s.
Paginated batch: 500 MB, 7 min, 2 381 rows/s.
Multi‑thread sharding + async batch: 800 MB, 86 s, 11 627 rows/s.
Distributed sharding (3 nodes): 300 MB per node, 29 s, 34 482 rows/s.
Conclusion
Never load the whole file into memory: Use SAX streaming.
Avoid row‑by‑row DB operations: Leverage batch inserts.
Never make users wait: Process asynchronously with progress queries.
Horizontal scaling beats vertical tuning: Sharding and distributed processing.
Memory management is critical: Object pooling, avoid large temporary objects.
Tune connection‑pool parameters: Prevent datasource bottlenecks.
Pre‑validation is non‑negotiable: Filter dirty data at the entry point.
Comprehensive monitoring: Full‑link metrics.
Design for disaster recovery: Checkpoint‑resume and idempotent handling.
Discard single‑machine mindset: Embrace distributed system design.
Stress test extreme scenarios: Million‑row load tests are essential.
If you are frustrated by Excel import performance, the techniques above should open a new door for your system.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Su San Talks Tech
Su San, former staff at several leading tech companies, is a top creator on Juejin and a premium creator on CSDN, and runs the free coding practice site www.susan.net.cn.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
