Operations 24 min read

How Hema Achieved Zero‑Failure Smart Scheduling: Lessons in System Stability

This article details Hema's approach to guaranteeing system stability for its offline and delivery operations, covering the complete smart‑dispatch architecture, exhaustive dependency analysis, database and middleware safeguards, monitoring strategies, gray‑release practices, testing methods, and emergency response procedures that together enabled a year of zero failures.

Alibaba Cloud Developer

Feb 17, 2020

How Hema Achieved Zero‑Failure Smart Scheduling: Lessons in System Stability

Stability Over Everything

Hema requires extremely high stability for its offline stores and delivery operations; payment failures or order‑pickup issues can trigger massive customer complaints and social backlash.

Smart Dispatch Chain Analysis

Understanding critical service links and external dependencies is essential. A comprehensive chain diagram was created to visualize the O2O smart dispatch system, covering scheduling, pressure testing, data sources, rider platform, algorithm strategies, distributed computing, path planning, and various storage middlewares (DTS, Diamond, Tair, job DB, downgrade DB).

Stability Factor Analysis and Practices

3.1 Database Dependencies

Slow SQL – early failures were often caused by slow queries; ongoing governance is required.

High logical read rows – queries reading >100k rows can become slow under growth.

Resource metrics – monitor CPU, load, QPS spikes; e.g., a grid task caused CPU spikes up to 60% during Double‑11.

Database isolation – separate core, secondary, and archive databases to prevent non‑core workloads from affecting critical services.

Database downgrade – use read‑only replicas via Jingwei to protect core jobs.

Schema mismatches – inconsistent field sizes/types across upstream/downstream can cause write failures.

Capacity & indexes – keep tables <5M rows, remove unnecessary indexes.

Schema changes – assess impact of index or column modifications before deployment.

Encoding – store JSON data carefully to avoid character‑set issues.

3.2 HSF Dependencies

Service timeout – keep HSF timeouts short (e.g., 3 s) to avoid thread‑pool exhaustion.

Retry strategy – short timeout + retries dramatically reduces failure rates.

Service caching – front‑end cache stable, long‑latency interfaces; back‑end cache for fallback.

Service degradation – switch to cache or alternative sources when primary service is unavailable.

Service isolation – core services must not depend on non‑core services and vice‑versa.

Traffic estimation & stress testing – evaluate expected load before releasing new features.

3.3 HSF Service Provision

Controlled timeout – set realistic timeouts based on load testing.

Rate limiting – use Sentinel for QPS or thread‑level limits.

Idempotency – ensure downstream services are idempotent for retries.

Service cache – front‑end cache for stable responses, back‑end cache for degradation.

3.4 Tair Dependencies

Product suitability – MDB for high QPS cache (no persistence), LDB for persistent storage, RDB for moderate QPS.

Capacity & QPS – ensure cache clusters can handle expected traffic.

Key/value limits – avoid large keys (>1 KB) or values (>10 KB) to prevent data loss.

Cache expiration – set appropriate TTLs to avoid cache‑stampede.

Cluster isolation – separate independent clusters to prevent cross‑region failures.

Distributed lock – handle lock timeouts with retries.

Data consistency – keep cache objects small and immutable.

Cache breakdown – protect DB with hot‑key TTL extension, in‑memory fallback, and lock‑based protection.

3.5 MetaQ Dependencies

Consumer registration – ensure online subscription to avoid missed messages.

Message size limit – 128 KB per message.

Send failures – monitor and retry on broker or network issues.

Consume retries – limit retries (max 16) and back‑off intervals.

QPS/TPS limits – keep below 3 000 per topic.

Backlog monitoring – set alert thresholds (e.g., 1 000 messages).

3.6 Jingwei Dependencies

JSON field corruption – avoid storing JSON in DB fields that may be malformed during transfer.

Latency – monitor and alert on Jingwei lag; increase write concurrency if needed.

Task pause – auto‑restart detection or manual intervention.

3.7 DTS Dependencies

Task granularity – discretize grid/parallel tasks to avoid downstream hotspots.

Degradation – fallback to local scheduling when DTS is unavailable.

Monitoring – track DTS timeouts, failures, and call volume.

3.8 Feature Switches

Push verification – confirm switch activation across machines.

Batch publishing – handle ordering dependencies between multiple switches.

Initialization – ensure switches are initialized before dependent services start.

Concurrency – use shadow instances for safe multi‑threaded updates.

ChangeFree integration – set up approval workflow for emergency switches.

3.9 Monitoring

Traffic monitoring – QPS of services, dependencies, and scheduled tasks.

Error count / success rate – minute‑level aggregation to smooth spikes.

Error cause analysis – structured logging for rapid diagnosis.

Response time (RT) – monitor service latency with minute averages.

Dashboard – aggregate all health metrics for quick overview.

System exception monitoring – detect NP errors, data overflows, type conversions.

3.10 Gray Release

State dependency – avoid simultaneous state reads from different sources during rollout.

Traffic estimation – pilot 10 % traffic, then extrapolate for full rollout.

Rollback criteria – immediate rollback if business or system expectations are not met.

3.11 Testing

Pre‑release dry‑run – generate traffic to validate functionality.

Pre‑release traffic injection – pull real traffic to stress test.

Online comparison – run new and old dispatch in parallel and compare results.

3.12 Emergency Response

Pre‑plan – define switches, rate‑limit tools, degradation paths, and fallback business processes.

Drills – conduct realistic fault‑injection exercises.

Incident handling – prioritize rapid containment, rollback changes, and coordinated communication.

Conclusion

Smart dispatch continues to evolve with ongoing projects in strategy operation, intelligent diagnosis, and simulation; the team will keep applying stability battle‑tested practices to meet new challenges.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Monitoring backend-architecture Microservices system stability database optimization Smart Scheduling

Written by

Alibaba Cloud Developer

Alibaba's official tech channel, featuring all of its technology innovations.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.