Operations 10 min read

5 Production Nightmares in an Education Mini‑Program and How to Avoid Them

The author recounts five critical production incidents that crippleed an education mini‑program—Redis connection‑pool exhaustion, duplicate bookings, double refunds, mis‑firing no‑show jobs, and inventory oversell—detailing root causes, concrete fixes, and hard‑won lessons for building resilient backend services.

Coder Trainee
Coder Trainee
Coder Trainee
5 Production Nightmares in an Education Mini‑Program and How to Avoid Them

Incident 1: Redis connection‑pool exhaustion crashes the app

Symptom : On the third day after launch, during a peak of parent appointments, the mini‑program became unreachable and all API calls timed out.

Investigation : Monitoring showed Redis connections spiking to over 800 and the Tomcat thread pool becoming fully BLOCKED.

Root cause : The Lettuce pool was configured with max-wait=-1, meaning threads waited indefinitely for a connection. A slow KEYS * query blocked Redis, preventing connections from being released, so the pool filled up and every request hung.

Fix :

spring.redis.lettuce.pool.max-active=100
spring.redis.lettuce.pool.max-wait=1000ms  # fail after 1 s
spring.redis.lettuce.pool.time-between-eviction-runs=30s

Added a fallback that catches RedisConnectionFailureException and reads directly from the database.

Lesson : All external calls (Redis, MQ, HTTP) must have a timeout and a degradation path.

Incident 2: Duplicate booking of the same time slot

Symptom : The backend recorded that a parent booked the same slot five times, and all five requests succeeded.

Investigation : The front‑end disabled the button after click, but a script sent five requests with only a 10 ms interval. The backend code only checked for an existing appointment and returned an error if found.

Root cause : Concurrent requests all saw exist=null before any insert, so they all proceeded.

Fix :

ALTER TABLE appointment ADD UNIQUE KEY uk_user_slot (user_id, slot_id);

Added a distributed lock:

String lockKey = "lock:appoint:" + userId + ":" + slotId;
Boolean locked = redisTemplate.opsForValue()
    .setIfAbsent(lockKey, "1", 3, TimeUnit.SECONDS);
if (!locked) {
    return error("请勿重复提交");
}

Lesson : Never rely solely on front‑end deduplication; enforce idempotency on the backend with a unique index as the final safeguard.

Incident 3: Refund executed twice

Symptom : Financial reconciliation found three orders that were refunded twice, costing the institution over 300 CNY.

Investigation : The refund API lacked idempotency. A user clicked “request refund” once; the front‑end timed out and retried, sending a second request.

Root cause : Both requests called the WeChat refund API, resulting in two successful refunds.

Fix :

ALTER TABLE refund_record ADD UNIQUE KEY uk_out_refund_no (out_refund_no);

Added an idempotency check before issuing a refund:

RefundRecord exist = refundRecordMapper.selectByAppointmentId(appointmentId);
if (exist != null) {
    log.info("订单已退款,跳过");
    return;
}
String outRefundNo = generateOutRefundNo(); // e.g., R + timestamp + random
// call WeChat refund API

Lesson : Any money‑related operation must be idempotent; combine a unique constraint with a state‑machine guard.

Incident 4: No‑show job mistakenly flags ongoing classes

Symptom : A parent received a no‑show notification while the child was still in class.

Investigation : The scheduled job queried appointments with status = 1 (pending) and start_time < NOW(). When a teacher rescheduled a class from 3 pm to 4 pm, the start_time in the database was not updated, so the job considered the appointment overdue and marked it as a no‑show.

Fix : Added a two‑hour buffer before treating a pending appointment as a no‑show.

SELECT * FROM appointment
WHERE status = 1
  AND start_time < DATE_SUB(NOW(), INTERVAL 2 HOUR)

Also introduced a manual review step.

Lesson : Scheduled‑job thresholds need a safety buffer, and business‑logic changes (rescheduling, leave) must synchronize related status fields.

Incident 5: Over‑selling inventory under high concurrency

Symptom : A popular teacher’s trial class limited to 4 seats ended up with 6 successful bookings.

Investigation : The update statement

UPDATE course_slot SET booked_count = booked_count + 1 WHERE id = ?

lacked a condition to ensure the count stayed below the maximum. Two concurrent requests both read booked_count = 3, incremented to 4, and both wrote back, resulting in 5.

Fix (optimistic lock):

UPDATE course_slot
SET booked_count = booked_count + 1
WHERE id = ? AND booked_count < max_students

The statement returns 0 on failure, allowing the front‑end to show “已约满”.

Lesson : Inventory‑deduction SQL must include a condition that checks the upper limit.

Key Takeaways

Set timeouts and degradation for every external dependency.

Enforce idempotency with unique constraints, distributed locks, or state checks.

Use optimistic‑lock patterns for inventory updates.

Give scheduled jobs a buffer and keep status fields in sync with business actions.

Implement robust monitoring (e.g., Alibaba Cloud ARMS, Feishu/DingTalk bots, SkyWalking) and automated daily reconciliation scripts.

Running fault‑injection drills—such as cutting Redis, MySQL, or high‑concurrency load—before launch helps surface hidden weaknesses early.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

MonitoringRedisspring-bootMySQLDistributed Lockoptimistic lockIdempotency
Coder Trainee
Written by

Coder Trainee

Experienced in Java and Python, we share and learn together. For submissions or collaborations, DM us.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.