Big Data 17 min read

Elasticsearch Practices and Platform Construction at 58.com

This article details 58.com’s extensive use of Elasticsearch for search, analytics, and log processing, covering cluster optimization challenges, typical issues like disk exhaustion and write slowdown, practical solutions, development standards, ELKB architecture, real‑time log and MySQL slow‑log applications, platform‑as‑a‑service construction, and future roadmap plans.

DataFunTalk
DataFunTalk
DataFunTalk
Elasticsearch Practices and Platform Construction at 58.com

Introduction

Elasticsearch is a distributed search and analytics engine built on Lucene, widely used for full‑text, structured, and analytical queries. 58.com employs it for internal search, large‑scale real‑time OLAP, log streams, user profiling, caching, security analytics, graph indexing, monitoring, and wiki retrieval.

1. Cluster Optimization and Governance

Early on, Elasticsearch clusters were independently managed by various business units, leading to fragmented deployments, version inconsistencies, mixed application‑data services, insufficient monitoring, heterogeneous hardware, uncontrolled index onboarding, and lack of platform management.

Key problems included index sprawl, missing lifecycle management, oversized shards, unpartitioned log indices, shared nodes across clusters, and varying disk capacities, resulting in red cluster health, write failures, query timeouts, OOM, and master node unresponsiveness.

Typical Issue 1: Disk Exhaustion

Caused by uncontrolled index creation, absence of lifecycle policies, oversized shards, monolithic log indices, and multiple clusters sharing the same servers.

Mitigations: enforce account‑level index permissions, implement index lifecycle management, plan shard counts (keep shard size < 100 GB), partition log indices by day/hour, avoid sharing servers across clusters, and standardize index settings (e.g., 5 shards, 1 replica).

Typical Issue 2: Write Slowdown

Root causes include unnecessary writes, excessive shard count, high replica factor, large refresh intervals, Logstash bottlenecks, and slow disk I/O.

Solutions: prune unnecessary writes, set appropriate shard numbers, limit replicas, increase refresh interval (e.g., 5–60 s for logs), replace Logstash with higher‑throughput alternatives, use SSD for hot data and SATA for cold data.

Development Standards

For log‑type applications: provide capacity and throughput estimates, static mappings, Kafka topic/client‑ID, Logstash filter configs, default 5‑shard‑1‑replica index, 30‑day retention, daily index naming, and restrict unauthorized index creation.

For non‑log applications: similar capacity estimates, business‑type evaluation, core services get dedicated clusters, default 5‑shard‑2‑replica index, and same access controls.

2. ELKB Architecture

ELKB (Elasticsearch‑Logstash‑Kibana‑Beats) forms a three‑layer stack: data extraction, storage, and visualization. Beats collect data, Logstash transforms it, Elasticsearch stores and indexes, and Kibana visualizes.

3. Typical Application Practices

Real‑time Log Platform

Initial version used Logstash heavily, causing resource pressure. The modern platform uses Flume for collection, Kafka for buffering, multiple Logstash or Hangout instances for consumption, and Kibana for visualization, with hot data on SSD and cold data on HDD.

MySQL Slow‑log Monitoring

Agents (Filebeat) collect slow‑log files, send them to Kafka, filter with Logstash, store in Elasticsearch, and expose via Kibana or internal DB platform, achieving sub‑5‑second latency.

4. Platform‑as‑a‑Service Construction

Provides a user portal for developers, data operators, and analysts to query, monitor, and request index resources, and a management portal for DBA‑level cluster deployment, monitoring, and index governance.

User side shows cluster lists, Kibana URLs, capacity, and index application workflow; management side displays node roles, one‑click automated deployment, and Zabbix‑Grafana monitoring integrated with Elasticsearch metrics.

5. Future Plans

Upgrade to Elasticsearch 7.x for performance and relevance improvements, develop intelligent cluster diagnostics to automate fault detection and remediation, and explore private‑cloud deployment to reduce resource waste and improve utilization.

Q&A Highlights

Answers cover Hadoop/ Hive data sync, log format handling with Filebeat multiline, MySQL‑to‑Elasticsearch real‑time sync options (dual‑write, DataX, Canal), and secondary indexing strategies using ID‑based lookups.

Big Dataplatform engineeringSearch EngineElasticsearchcluster managementLog Analytics
DataFunTalk
Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.