Operations 19 min read

How 58.com Scaled Elasticsearch: Cluster Optimization, Automation, and Real‑World Practices

This article details 58.com’s journey with Elasticsearch, covering the challenges of disparate deployments, common problems like disk exhaustion and write slowdown, the governance and automation platform they built, development standards, service architecture, real‑world application cases, and future plans for version upgrades and intelligent diagnostics.

dbaplus Community

Apr 29, 2021

How 58.com Scaled Elasticsearch: Cluster Optimization, Automation, and Real‑World Practices

1. Cluster Optimization and Governance

Early on, Elasticsearch instances were independently managed by various business units, leading to fragmented versions, diverse hardware, uncontrolled index growth, and lack of monitoring. The database team took over governance, identifying issues such as complex usage scenarios, version inconsistency, mixed deployment with data services, missing accountability for index onboarding, and manual Excel‑based cluster tracking.

Restrict account permissions to limit index access per team.

Enforce index lifecycle management to retire stale data.

Plan shard counts carefully, keeping each shard under 100 GB.

Partition log indices by day (or hour for high‑volume logs).

Avoid sharing the same server nodes across multiple clusters.

Standardize hardware specifications.

2. Typical Issue I: Disk Exhaustion

Root causes include uncontrolled index onboarding, missing lifecycle policies, oversized shards, monolithic daily log indices, multi‑cluster node reuse, and inconsistent disk capacities.

Implement centralized account control to prevent arbitrary index loading.

Adopt index lifecycle policies to delete or shrink old data.

Limit shard size (recommended ≤100 GB) based on node count and disk space.

Split log indices by day, or by hour when daily indices become too large.

Isolate clusters on dedicated server partitions to prevent cross‑cluster disk pressure.

3. Typical Issue II: Write Slowdown

Factors affecting write performance include unnecessary large‑text writes, excessive shard counts, over‑replication, short refresh intervals, Logstash throughput limits, sub‑optimal translog settings, and slow SSD vs. HDD I/O.

Reduce unnecessary writes, especially large text fields.

Keep shard numbers moderate; too many shards degrade write speed.

Limit replica count when high write throughput is required.

Increase refresh interval (e.g., 5–60 seconds) for log indices.

Replace or supplement Logstash with higher‑throughput alternatives (e.g., Hangout or custom consumers).

Prefer SSDs for hot data; use cold/hot tiering to move older data to SATA disks.

4. Development Standards

To ensure stability, the team defined guidelines for both log‑type and non‑log applications.

Log‑type applications

Provide capacity and throughput estimates (e.g., daily ingest volume, retention period).

Supply static mapping definitions; static mapping is strongly recommended.

Integrate with the central Kafka cluster, specifying topic and client ID.

Configure Logstash filter rules.

Default index settings: 5 primary shards, 1 replica (adjustable).

Retention policy not exceeding 30 days.

Index naming convention: prefix-YYYY-MM-DD (daily partitioning).

Restrict index creation to authorized accounts matching predefined prefixes.

Non‑log applications

Provide capacity and throughput estimates.

Prioritize core business services for the main search platform.

Important services receive dedicated clusters; non‑core services share public clusters.

Supply static mapping.

Default index settings: 5 primary shards, 2 replicas (tunable per workload).

Enforce the same index‑creation permission model.

5. Service Architecture

The unified architecture separates node roles: dedicated master nodes, data nodes, and optional client nodes for large clusters. All clusters run version 6.8+ to leverage role‑based security. The stack uses the default IK analyzer for Chinese tokenization and implements hot‑cold data separation, automatically migrating indices during low‑load periods.

6. Typical Application Practices

6.1 ELKB Overview

ELKB combines Elasticsearch, Logstash, Kibana, and Beats. Beats (e.g., Filebeat) collect data; Logstash transforms and enriches; Elasticsearch stores; Kibana visualizes.

6.2 Real‑time Log Platform

Early version relied solely on Logstash, causing high resource consumption. The modern platform adds Filebeat, uses Kafka for buffering, deploys multiple Logstash instances (or Hangout) for scaling, and stores results in Elasticsearch. Kibana and custom dashboards provide real‑time monitoring and alerting.

6.3 MySQL Slow‑Log Real‑time Monitoring

Collect MySQL slow‑log via Filebeat agents on each server, forward to Kafka, filter with Logstash (or custom parsers), store in Elasticsearch, and visualize in Kibana or the internal DB platform. End‑to‑end latency is under five seconds.

7. Platform Construction

Since last year, a self‑service platform has been built to streamline Elasticsearch onboarding for developers and to provide DBA‑level operational tools.

7.1 User Side

Developers, data ops, and analysts can query, generate reports, and view cluster health.

Cluster owners see a dashboard with cluster list, Kibana URL, capacity, and request status.

Index‑creation follows a standardized request form covering retention, mapping, and ingestion method.

7.2 Management Side

One‑click deployment of Elasticsearch clusters with role‑specific nodes (master, data, client, monitoring, Logstash, Filebeat).

Dashboard shows IP, role, and status of each node.

Automated deployment pipelines reduce manual configuration errors.

7.3 Index Governance

Current governance relies on scripts for lifecycle management; future work aims to integrate these functions into the platform with audit trails.

8. Future Plans

Version Upgrade: Move to Elasticsearch 7.x for performance and relevance improvements, while handling breaking changes from 6.x.

Intelligent Cluster Diagnosis: Develop rule‑based or AI‑driven automated fault detection and remediation.

Private Cloud Exploration: Analyze workload characteristics (search vs. log) to allocate appropriate hardware, and adopt cloud‑native models to improve resource utilization.

9. Q&A Highlights

Data sync between Elasticsearch and Hadoop can use official connectors or custom programs.

Complex log formats can be handled with Filebeat multiline merging and Logstash filtering.

MySQL‑to‑Elasticsearch real‑time sync options include dual‑write at the application layer, open‑source tools like DataX, or binlog parsing with Canal (noting Canal’s single‑node throughput limits).

Efficient secondary indexing can be achieved by storing reference IDs in Elasticsearch and performing a follow‑up lookup in MySQL/HBase.

Thank you for attending the session.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Elasticsearch cluster management Platform Automation Log Analytics Index Lifecycle

Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.

1. Cluster Optimization and Governance

2. Typical Issue I: Disk Exhaustion

3. Typical Issue II: Write Slowdown

4. Development Standards

Log‑type applications

Non‑log applications

5. Service Architecture

6. Typical Application Practices

6.1 ELKB Overview

6.2 Real‑time Log Platform

6.3 MySQL Slow‑Log Real‑time Monitoring

7. Platform Construction

7.1 User Side

7.2 Management Side

7.3 Index Governance

8. Future Plans

9. Q&A Highlights

dbaplus Community

How this landed with the community

Was this worth your time?

0 Comments

2. Typical Issue I: Disk Exhaustion

3. Typical Issue II: Write Slowdown