How Hema Achieved Zero‑Failure Smart Scheduling: Lessons in System Stability
This article details Hema's approach to guaranteeing system stability for its offline and delivery operations, covering the complete smart‑dispatch architecture, exhaustive dependency analysis, database and middleware safeguards, monitoring strategies, gray‑release practices, testing methods, and emergency response procedures that together enabled a year of zero failures.
Stability Over Everything
Hema requires extremely high stability for its offline stores and delivery operations; payment failures or order‑pickup issues can trigger massive customer complaints and social backlash.
Smart Dispatch Chain Analysis
Understanding critical service links and external dependencies is essential. A comprehensive chain diagram was created to visualize the O2O smart dispatch system, covering scheduling, pressure testing, data sources, rider platform, algorithm strategies, distributed computing, path planning, and various storage middlewares (DTS, Diamond, Tair, job DB, downgrade DB).
Stability Factor Analysis and Practices
3.1 Database Dependencies
Slow SQL – early failures were often caused by slow queries; ongoing governance is required.
High logical read rows – queries reading >100k rows can become slow under growth.
Resource metrics – monitor CPU, load, QPS spikes; e.g., a grid task caused CPU spikes up to 60% during Double‑11.
Database isolation – separate core, secondary, and archive databases to prevent non‑core workloads from affecting critical services.
Database downgrade – use read‑only replicas via Jingwei to protect core jobs.
Schema mismatches – inconsistent field sizes/types across upstream/downstream can cause write failures.
Capacity & indexes – keep tables <5M rows, remove unnecessary indexes.
Schema changes – assess impact of index or column modifications before deployment.
Encoding – store JSON data carefully to avoid character‑set issues.
3.2 HSF Dependencies
Service timeout – keep HSF timeouts short (e.g., 3 s) to avoid thread‑pool exhaustion.
Retry strategy – short timeout + retries dramatically reduces failure rates.
Service caching – front‑end cache stable, long‑latency interfaces; back‑end cache for fallback.
Service degradation – switch to cache or alternative sources when primary service is unavailable.
Service isolation – core services must not depend on non‑core services and vice‑versa.
Traffic estimation & stress testing – evaluate expected load before releasing new features.
3.3 HSF Service Provision
Controlled timeout – set realistic timeouts based on load testing.
Rate limiting – use Sentinel for QPS or thread‑level limits.
Idempotency – ensure downstream services are idempotent for retries.
Service cache – front‑end cache for stable responses, back‑end cache for degradation.
3.4 Tair Dependencies
Product suitability – MDB for high QPS cache (no persistence), LDB for persistent storage, RDB for moderate QPS.
Capacity & QPS – ensure cache clusters can handle expected traffic.
Key/value limits – avoid large keys (>1 KB) or values (>10 KB) to prevent data loss.
Cache expiration – set appropriate TTLs to avoid cache‑stampede.
Cluster isolation – separate independent clusters to prevent cross‑region failures.
Distributed lock – handle lock timeouts with retries.
Data consistency – keep cache objects small and immutable.
Cache breakdown – protect DB with hot‑key TTL extension, in‑memory fallback, and lock‑based protection.
3.5 MetaQ Dependencies
Consumer registration – ensure online subscription to avoid missed messages.
Message size limit – 128 KB per message.
Send failures – monitor and retry on broker or network issues.
Consume retries – limit retries (max 16) and back‑off intervals.
QPS/TPS limits – keep below 3 000 per topic.
Backlog monitoring – set alert thresholds (e.g., 1 000 messages).
3.6 Jingwei Dependencies
JSON field corruption – avoid storing JSON in DB fields that may be malformed during transfer.
Latency – monitor and alert on Jingwei lag; increase write concurrency if needed.
Task pause – auto‑restart detection or manual intervention.
3.7 DTS Dependencies
Task granularity – discretize grid/parallel tasks to avoid downstream hotspots.
Degradation – fallback to local scheduling when DTS is unavailable.
Monitoring – track DTS timeouts, failures, and call volume.
3.8 Feature Switches
Push verification – confirm switch activation across machines.
Batch publishing – handle ordering dependencies between multiple switches.
Initialization – ensure switches are initialized before dependent services start.
Concurrency – use shadow instances for safe multi‑threaded updates.
ChangeFree integration – set up approval workflow for emergency switches.
3.9 Monitoring
Traffic monitoring – QPS of services, dependencies, and scheduled tasks.
Error count / success rate – minute‑level aggregation to smooth spikes.
Error cause analysis – structured logging for rapid diagnosis.
Response time (RT) – monitor service latency with minute averages.
Dashboard – aggregate all health metrics for quick overview.
System exception monitoring – detect NP errors, data overflows, type conversions.
3.10 Gray Release
State dependency – avoid simultaneous state reads from different sources during rollout.
Traffic estimation – pilot 10 % traffic, then extrapolate for full rollout.
Rollback criteria – immediate rollback if business or system expectations are not met.
3.11 Testing
Pre‑release dry‑run – generate traffic to validate functionality.
Pre‑release traffic injection – pull real traffic to stress test.
Online comparison – run new and old dispatch in parallel and compare results.
3.12 Emergency Response
Pre‑plan – define switches, rate‑limit tools, degradation paths, and fallback business processes.
Drills – conduct realistic fault‑injection exercises.
Incident handling – prioritize rapid containment, rollback changes, and coordinated communication.
Conclusion
Smart dispatch continues to evolve with ongoing projects in strategy operation, intelligent diagnosis, and simulation; the team will keep applying stability battle‑tested practices to meet new challenges.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Alibaba Cloud Developer
Alibaba's official tech channel, featuring all of its technology innovations.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
