How We Built a Robust Search Middle Platform: From Pain Points to Full‑Scale Quality Assurance
This article examines the challenges faced by a search middle platform—such as inaccurate impact assessment, unstable underlying clusters, and missing process standards—and details a comprehensive quality‑assurance strategy that includes baseline test suites, stability practices, performance testing, emergency drills, and systematic monitoring to ensure reliable search services.
Background
The search middle platform at Youzan evolved from an ES middleware without a clear "middle‑platform" concept, leading to duplicated development effort whenever a new business unit needed search capabilities. To reduce integration time and complexity, a unified search product was created, but guaranteeing its quality from 0 to 1 presented unique challenges.
1. Search Quality Pain Points
Impact Assessment Inaccuracy
Business teams cannot reliably evaluate whether ES upgrades or underlying infrastructure changes affect all upstream scenarios, making comprehensive regression testing essential.
Underlying Cluster Instability
ES cluster jitter, manual dual‑datacenter switchovers, and dependencies on HBase, Flink, DTS, NSQ, etc., cause frequent read/write failures; lack of monitoring and automated failover exacerbates the issue.
Zero Process Standards
Initially, the platform lacked formal processes; each business line relied on ad‑hoc agreements between developers and testers, leading to inconsistent quality and complex iteration cycles.
2. Multi‑Faceted Assurance Measures
2.1 Baseline Test Suite Completion
Traffic replay covering >80% of read scenarios.
CI‑integrated test cases to fill the missing 20% low‑traffic scenarios.
Dedicated search test project bit-search-platform for small‑traffic and exception cases.
Scenario execution sets for core modules (e.g., C‑side product search, B‑side order search).
2.2 Business and Cluster Stability Practices
Implemented Dubbo group isolation to separate high‑traffic product searches from order searches, preventing one workload from saturating shared proxy thread pools.
2.3 Performance‑Focused Testing
Baseline SLA verification for core index search workloads.
Stress tests for diverse ES usage patterns (nested, multi_search, aggregations).
Cluster‑level benchmark using official ES stress tools across single‑ and multi‑node setups.
Identified 26 performance issues, primarily around strong/weak dependencies, leading to throttling and degradation strategies.
2.4 Drills and Emergency Plans
Regularly conducted ~18 drills covering dual‑datacenter switchovers, index migrations, and routine failure simulations. Automated scripts reduced switch‑over time from >1 hour to under 10 minutes.
Key emergency actions include:
Proxy thread‑pool saturation – diagnose source, adjust refresh intervals, apply Tesla rate‑limiting.
High CPU load – clear caches, increase refresh_interval, migrate shards, or switch to backup datacenter.
Node failure – failover specific indices, rebuild from backups.
2.5 Monitoring and Governance
Deployed a three‑stage monitoring framework:
Pre‑monitoring: Slow‑query stats, top queries, error codes, performance dashboards (11 Grafana panels) to detect misuse early.
Mid‑monitoring: Application‑level alerts (RT, thread‑pool, KV, NSQ), cluster metrics (CPU, disk, bulk, search load), middleware health.
Post‑monitoring: Periodic BCP reconciliation between HBase and ES, daily index request volume reports for optimization.
These measures cut online bug counts significantly in H1 2021.
3. Process Standardization
Established comprehensive guidelines for unit‑test coverage (L1 ≥ 85%, L2 ≥ 75%), cluster switch‑over checklists, and index migration procedures, turning ad‑hoc practices into documented standards.
Future Outlook
Remaining challenges include flexible ranking, category prediction, and synonym handling. Advancing toward intelligent search will require richer testing frameworks and diversified assurance techniques.
Youzan Coder
Official Youzan tech channel, delivering technical insights and occasional daily updates from the Youzan tech team.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
