How Liulishuo Scaled Its Unified Monitoring Platform for Billions of Users
This article examines the evolution of online education, introduces Liulishuo's massive English‑learning platform, and details the technical challenges, design choices, and architecture of its cloud‑native unified monitoring system that handles tens of terabytes of data daily.
Online Education Industry Overview
Since the 1990s the internet has enabled online education products, which have evolved from pure text to rich media (images, audio) and massive data collection. In March this year, daily online time exceeded 2 million user‑days, providing abundant data for AI‑driven analysis of content, marketing, teacher management, and quality assessment.
Liulishuo Company Overview
Liulishuo is a leading technology‑driven education company with a large AI team and a Chinese‑language English‑speech database containing roughly 37 billion minutes of conversation and 504 billion recorded sentences. Its flagship product, launched in 2013, integrates speech recognition, scoring, adaptive learning, contextual dialogue, pronunciation guidance, AI teachers, and gamified experiences, quickly gaining market traction. User numbers grew from millions to over 100 million, creating massive data‑flow and analysis challenges for the IT architecture.
Challenges of the Unified Monitoring Platform
Without a dedicated operations team, the cloud‑infra developers must build a unified monitoring platform that satisfies SLA, performance, alerting, and operational value (utilization, cost saving, business relationship mapping). Key requirements include:
Collecting heterogeneous data sources (K8s, ECS metrics, Istio logs, custom middleware metrics, cloud service metrics, business trace data, and real‑time cost data).
Dynamic discovery and real‑time collection of resources and organizational relationships.
Large‑scale storage and analysis of tens of terabytes of daily data with real‑time query capability.
High availability of the monitoring platform itself, eliminating single points of failure and enabling rapid recovery.
Technical Selection
The platform needed both log and metric solutions. After evaluating community and commercial options (Elasticsearch, Loki, SLS, Prometheus, OpenTSDB, InfluxDB), the final choices were:
Alibaba Cloud SLS for log storage and processing, because it unifies log and metric data, handles massive scale, and is a fully managed service.
Prometheus for time‑series metrics, with SLS acting as a remote high‑reliability storage backend.
Reasons for SLS included unified data storage, superior performance at scale, managed service eliminating operational overhead, and built‑in data enrichment capabilities.
Overall Architecture
Key Implementation Details
1. Automated resource discovery : Developed a mechanism for dynamic detection of IaaS/PaaS resources, automatically adding new assets to monitoring and collection pipelines.
2. Log handling :
Logtail streams logs from various services into separate SLS log stores.
Logs are classified: audit logs are archived to OSS, troubleshooting logs retained for two weeks with full indexing, and AccessLog fields are partially indexed to reduce cost.
NGINX AccessLog data is enriched with catalog information (department, application, method) from RDS to compute SLA and PXX metrics.
3. Monitoring :
Prometheus exporters collect metrics from cloud products and custom components.
A sidecar watches Git repositories and reloads Prometheus configuration on changes.
Recording rules are version‑controlled in Git for fast queries.
AlertManager integrates with the internal alert center for advanced formatting and escalation.
Prometheus Remote Write streams data to SLS’s time‑series store to avoid single‑point failures and enable catalog‑based analysis.
4. Metric calculation :
Core metrics are derived from NGINX AccessLog (QPS, error rate, latency) without instrumenting applications.
Utilization, middleware, and infrastructure metrics come from Prometheus; catalog data allows aggregation per department or business unit.
Resulting metrics are stored in MySQL or Elasticsearch and backed up to OSS.
Outcomes
The platform now supports almost all core monitoring needs, remains stable during traffic spikes, and delivers business value in three areas:
Monitoring & SLA : Real‑time SLA per department/application enables company‑wide improvement initiatives.
Issue diagnosis & fault isolation : Istio logs combined with catalog data generate a live business relationship graph, allowing rapid root‑cause identification.
FinOps : Resource utilization and PXX metrics are calculated per team, guiding cost‑optimization efforts.
Underlying Technology
The platform leverages Alibaba Cloud SLS as a cloud‑native observability service offering unified Log/Metric/Trace ingestion, high‑performance storage, zero‑ops management, and built‑in AIOps algorithms for anomaly detection, forecasting, and intelligent alerting.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
ITFLY8 Architecture Home
ITFLY8 Architecture Home - focused on architecture knowledge sharing and exchange, covering project management and product design. Includes large-scale distributed website architecture (high performance, high availability, caching, message queues...), design patterns, architecture patterns, big data, project management (SCRUM, PMP, Prince2), product design, and more.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
