Operations 13 min read

How We Built a Robust Search Middle Platform: From Pain Points to Full‑Scale Quality Assurance

This article examines the challenges faced by a search middle platform—such as inaccurate impact assessment, unstable underlying clusters, and missing process standards—and details a comprehensive quality‑assurance strategy that includes baseline test suites, stability practices, performance testing, emergency drills, and systematic monitoring to ensure reliable search services.

Youzan Coder

Jul 19, 2021

How We Built a Robust Search Middle Platform: From Pain Points to Full‑Scale Quality Assurance

Background

The search middle platform at Youzan evolved from an ES middleware without a clear "middle‑platform" concept, leading to duplicated development effort whenever a new business unit needed search capabilities. To reduce integration time and complexity, a unified search product was created, but guaranteeing its quality from 0 to 1 presented unique challenges.

1. Search Quality Pain Points

Impact Assessment Inaccuracy

Business teams cannot reliably evaluate whether ES upgrades or underlying infrastructure changes affect all upstream scenarios, making comprehensive regression testing essential.

Underlying Cluster Instability

ES cluster jitter, manual dual‑datacenter switchovers, and dependencies on HBase, Flink, DTS, NSQ, etc., cause frequent read/write failures; lack of monitoring and automated failover exacerbates the issue.

Zero Process Standards

Initially, the platform lacked formal processes; each business line relied on ad‑hoc agreements between developers and testers, leading to inconsistent quality and complex iteration cycles.

2. Multi‑Faceted Assurance Measures

2.1 Baseline Test Suite Completion

Traffic replay covering >80% of read scenarios.

CI‑integrated test cases to fill the missing 20% low‑traffic scenarios.

Dedicated search test project bit-search-platform for small‑traffic and exception cases.

Scenario execution sets for core modules (e.g., C‑side product search, B‑side order search).

2.2 Business and Cluster Stability Practices

Implemented Dubbo group isolation to separate high‑traffic product searches from order searches, preventing one workload from saturating shared proxy thread pools.

2.3 Performance‑Focused Testing

Baseline SLA verification for core index search workloads.

Stress tests for diverse ES usage patterns (nested, multi_search, aggregations).

Cluster‑level benchmark using official ES stress tools across single‑ and multi‑node setups.

Identified 26 performance issues, primarily around strong/weak dependencies, leading to throttling and degradation strategies.

2.4 Drills and Emergency Plans

Regularly conducted ~18 drills covering dual‑datacenter switchovers, index migrations, and routine failure simulations. Automated scripts reduced switch‑over time from >1 hour to under 10 minutes.

Key emergency actions include:

Proxy thread‑pool saturation – diagnose source, adjust refresh intervals, apply Tesla rate‑limiting.

High CPU load – clear caches, increase refresh_interval, migrate shards, or switch to backup datacenter.

Node failure – failover specific indices, rebuild from backups.

2.5 Monitoring and Governance

Deployed a three‑stage monitoring framework:

Pre‑monitoring: Slow‑query stats, top queries, error codes, performance dashboards (11 Grafana panels) to detect misuse early.

Mid‑monitoring: Application‑level alerts (RT, thread‑pool, KV, NSQ), cluster metrics (CPU, disk, bulk, search load), middleware health.

Post‑monitoring: Periodic BCP reconciliation between HBase and ES, daily index request volume reports for optimization.

These measures cut online bug counts significantly in H1 2021.

3. Process Standardization

Established comprehensive guidelines for unit‑test coverage (L1 ≥ 85%, L2 ≥ 75%), cluster switch‑over checklists, and index migration procedures, turning ad‑hoc practices into documented standards.

Future Outlook

Remaining challenges include flexible ranking, category prediction, and synonym handling. Advancing toward intelligent search will require richer testing frameworks and diversified assurance techniques.

backend Monitoring operations Performance Testing quality assurance Search

Written by

Youzan Coder

Official Youzan tech channel, delivering technical insights and occasional daily updates from the Youzan tech team.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.