How Kuaishou Scaled Data Services with a Config‑Driven Big Data Platform
This article explains how Kuaishou’s data‑service platform tackles high development barriers and repetitive work by introducing a self‑service, configuration‑driven architecture, multi‑mode APIs, efficient data acceleration, and robust high‑availability mechanisms, while outlining its evolution and future roadmap.
本文是围绕着快手的数据服务化中台进行介绍。第一部分是背景介绍,包括数据开发的痛点,第二部分是介绍大数据服务化平台,包括平台架构以及关键细节详解,第三部分是经验总结和未来思考。
Background
Kuaishou is a data‑driven company where data plays a crucial role. Data development engineers are responsible for producing high‑quality structured data tables and building stable, reliable data services delivered via APIs. They face two main pain points: high development barriers for data services and repetitive development of those services.
Pain Point 1: High Barrier to Developing Data Services
Beyond creating data tables, engineers must consider:
Data delivery: business prefers flexible, decoupled data interfaces rather than raw tables.
Service development: services require micro‑service knowledge, service discovery, high concurrency, etc.
Permission and availability: ensure secure access and stable performance.
Operations: scaling, migration, decommission, interface changes, alerts, and other operational concerns.
These requirements mean engineers must not only build tables but also package them into independent, flexible, high‑availability, and secure data services, demanding skills beyond basic SQL and modeling, including Java and micro‑service development.
Pain Point 2: Repeated Development of Data Services
Many Kuaishou business lines (payment, live streaming, account, etc.) have similar data needs, leading to duplicated pipelines: data sync to online databases/caches, micro‑service development, and repeated implementation of similar services, wasting resources and slowing delivery.
Big Data Service Platform
The platform is a one‑stop self‑service data platform. Users create data service interfaces, operational services, and invoke them through a “configuration‑as‑service” model: no hand‑coded services are needed; simple configuration automatically generates and deploys services, greatly improving efficiency.
System Architecture
Raw data resides in a Data Lake, is processed into domain‑organized data assets, stored in a data warehouse, accelerated to high‑speed storage, and finally exposed via various service interfaces.
Technical architecture supports both RPC and HTTP interfaces. RPC offers high‑throughput, efficient serialization, load balancing, flow control, degradation, and tracing. HTTP is simpler but less efficient.
Key Technology 1: Configuration‑as‑Development
There are two user roles: data service producers and consumers. Producers configure data source, acceleration target, interface type, and isolated test environments. After configuration, the platform automatically generates and deploys the service, after which consumers request access permissions to invoke it.
Key Technology 2: Multi‑Mode Service Forms
Data services are offered in several forms:
KV API : Simple key‑value lookups supporting millions of QPS with millisecond latency, auto‑generated via templates, returning Protobuf structures for easy ORM usage. Typical use cases include IP‑to‑geo lookup and user‑profile queries.
SQL API : Flexible complex queries built on OLAP/OLTP engines via a fluent API, supporting nested conditions, aggregation, pagination, or full data retrieval. Used for user segmentation based on multiple tags.
Union API : Composite API that merges multiple atomic APIs in serial or parallel, reducing client‑side calls and latency.
Key Technology 3: Efficient Data Acceleration
Data assets stored in slower engines need acceleration to meet online traffic demands. Two methods are used: full data acceleration and multi‑level caching.
Full Data Acceleration
Raw data from sources such as Kafka, MySQL, and logs are ingested, modeled, and synchronized to high‑speed stores like Redis, HBase, and Druid via a distributed scheduler built on DataX. The platform syncs up to 1.2 trillion rows (≈20 TB) per day.
Multi‑Level Caching
The platform stores data in Redis, HBase, Druid, ClickHouse, etc. Hot data is cached using additional layers. Users can configure cache strategies per API and apply compression (ZSTD, Snappy, GZIP) to reduce storage, sometimes by up to 90%.
Key Technology 4: High‑Availability Guarantees
High availability is ensured through three mechanisms:
Elastic Service Framework
Resource Isolation
Full‑Link Monitoring
Elastic Service Framework
Services run in Kuaishou’s self‑developed elastic container cloud. RPC services register with KESS (service registry & discovery). Faulty instances are automatically removed. Full monitoring covers availability, latency, QPS, container CPU/memory, etc.
Resource Isolation
Isolation reduces impact of failures. Deployments are isolated by business line and priority (high/medium/low), ensuring independent operation. Multiple data services within a line can be mixed‑deployed to improve resource utilization.
Full‑Link Monitoring
Monitoring covers three aspects:
Data synchronization: monitors data quality, timeouts, and failures when syncing assets to fast storage.
Service stability: a sentinel service tracks API metrics such as latency and availability.
Business correctness: ensures API responses match underlying data assets for consistency.
Conclusion and Outlook
Since 2017, the platform supports diverse scenarios (live streaming, short video, e‑commerce, internal systems) with online QPS reaching 10 million and millisecond‑level latency. It offers multiple API modes, permission control, and an API marketplace, further empowering business.
Future development will focus on:
Aligning closely with evolving business needs by abstracting and consolidating common data service capabilities.
Deepening data asset management, including registration, tagging, mapping, and open services.
The platform will evolve toward a unified OneService system, emphasizing:
Support for diverse data sources, including wide tables, files, and machine‑learning models.
Multiple data retrieval methods such as synchronous, asynchronous, push, and scheduled tasks.
A unified API gateway integrating permission control, rate limiting, and traffic management for both platform‑created and user‑developed APIs.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
ITFLY8 Architecture Home
ITFLY8 Architecture Home - focused on architecture knowledge sharing and exchange, covering project management and product design. Includes large-scale distributed website architecture (high performance, high availability, caching, message queues...), design patterns, architecture patterns, big data, project management (SCRUM, PMP, Prince2), product design, and more.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
