Big Data 14 min read

How Kuaishou Scaled Data Services with a Config‑Driven Big Data Platform

This article explains how Kuaishou’s data‑service platform tackles high development barriers and repetitive work by introducing a self‑service, configuration‑driven architecture, multi‑mode APIs, efficient data acceleration, and robust high‑availability mechanisms, while outlining its evolution and future roadmap.

ITFLY8 Architecture Home

Feb 2, 2021

How Kuaishou Scaled Data Services with a Config‑Driven Big Data Platform

本文是围绕着快手的数据服务化中台进行介绍。第一部分是背景介绍，包括数据开发的痛点，第二部分是介绍大数据服务化平台，包括平台架构以及关键细节详解，第三部分是经验总结和未来思考。

Background

Kuaishou is a data‑driven company where data plays a crucial role. Data development engineers are responsible for producing high‑quality structured data tables and building stable, reliable data services delivered via APIs. They face two main pain points: high development barriers for data services and repetitive development of those services.

Pain Point 1: High Barrier to Developing Data Services

Beyond creating data tables, engineers must consider:

Data delivery: business prefers flexible, decoupled data interfaces rather than raw tables.

Service development: services require micro‑service knowledge, service discovery, high concurrency, etc.

Permission and availability: ensure secure access and stable performance.

Operations: scaling, migration, decommission, interface changes, alerts, and other operational concerns.

These requirements mean engineers must not only build tables but also package them into independent, flexible, high‑availability, and secure data services, demanding skills beyond basic SQL and modeling, including Java and micro‑service development.

Pain Point 2: Repeated Development of Data Services

Many Kuaishou business lines (payment, live streaming, account, etc.) have similar data needs, leading to duplicated pipelines: data sync to online databases/caches, micro‑service development, and repeated implementation of similar services, wasting resources and slowing delivery.

Big Data Service Platform

The platform is a one‑stop self‑service data platform. Users create data service interfaces, operational services, and invoke them through a “configuration‑as‑service” model: no hand‑coded services are needed; simple configuration automatically generates and deploys services, greatly improving efficiency.

System Architecture

Raw data resides in a Data Lake, is processed into domain‑organized data assets, stored in a data warehouse, accelerated to high‑speed storage, and finally exposed via various service interfaces.

Technical architecture supports both RPC and HTTP interfaces. RPC offers high‑throughput, efficient serialization, load balancing, flow control, degradation, and tracing. HTTP is simpler but less efficient.

Key Technology 1: Configuration‑as‑Development

There are two user roles: data service producers and consumers. Producers configure data source, acceleration target, interface type, and isolated test environments. After configuration, the platform automatically generates and deploys the service, after which consumers request access permissions to invoke it.

Key Technology 2: Multi‑Mode Service Forms

Data services are offered in several forms:

KV API : Simple key‑value lookups supporting millions of QPS with millisecond latency, auto‑generated via templates, returning Protobuf structures for easy ORM usage. Typical use cases include IP‑to‑geo lookup and user‑profile queries.

SQL API : Flexible complex queries built on OLAP/OLTP engines via a fluent API, supporting nested conditions, aggregation, pagination, or full data retrieval. Used for user segmentation based on multiple tags.

Union API : Composite API that merges multiple atomic APIs in serial or parallel, reducing client‑side calls and latency.

Key Technology 3: Efficient Data Acceleration

Data assets stored in slower engines need acceleration to meet online traffic demands. Two methods are used: full data acceleration and multi‑level caching.

Full Data Acceleration

Raw data from sources such as Kafka, MySQL, and logs are ingested, modeled, and synchronized to high‑speed stores like Redis, HBase, and Druid via a distributed scheduler built on DataX. The platform syncs up to 1.2 trillion rows (≈20 TB) per day.

Multi‑Level Caching

The platform stores data in Redis, HBase, Druid, ClickHouse, etc. Hot data is cached using additional layers. Users can configure cache strategies per API and apply compression (ZSTD, Snappy, GZIP) to reduce storage, sometimes by up to 90%.

Key Technology 4: High‑Availability Guarantees

High availability is ensured through three mechanisms:

Elastic Service Framework

Resource Isolation

Full‑Link Monitoring

Elastic Service Framework

Services run in Kuaishou’s self‑developed elastic container cloud. RPC services register with KESS (service registry & discovery). Faulty instances are automatically removed. Full monitoring covers availability, latency, QPS, container CPU/memory, etc.

Resource Isolation

Isolation reduces impact of failures. Deployments are isolated by business line and priority (high/medium/low), ensuring independent operation. Multiple data services within a line can be mixed‑deployed to improve resource utilization.

Full‑Link Monitoring

Monitoring covers three aspects:

Data synchronization: monitors data quality, timeouts, and failures when syncing assets to fast storage.

Service stability: a sentinel service tracks API metrics such as latency and availability.

Business correctness: ensures API responses match underlying data assets for consistency.

Conclusion and Outlook

Since 2017, the platform supports diverse scenarios (live streaming, short video, e‑commerce, internal systems) with online QPS reaching 10 million and millisecond‑level latency. It offers multiple API modes, permission control, and an API marketplace, further empowering business.

Future development will focus on:

Aligning closely with evolving business needs by abstracting and consolidating common data service capabilities.

Deepening data asset management, including registration, tagging, mapping, and open services.

The platform will evolve toward a unified OneService system, emphasizing:

Support for diverse data sources, including wide tables, files, and machine‑learning models.

Multiple data retrieval methods such as synchronous, asynchronous, push, and scheduled tasks.

A unified API gateway integrating permission control, rate limiting, and traffic management for both platform‑created and user‑developed APIs.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

High Availability Data Platform Service Architecture Data Acceleration

Written by

ITFLY8 Architecture Home

ITFLY8 Architecture Home - focused on architecture knowledge sharing and exchange, covering project management and product design. Includes large-scale distributed website architecture (high performance, high availability, caching, message queues...), design patterns, architecture patterns, big data, project management (SCRUM, PMP, Prince2), product design, and more.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.