Big Data 22 min read

How Tencent Cloud Leverages Spark, ElasticSearch, and Flink for PB‑Scale Data Warehousing

The cloud+ community and Kuaishou hosted a big‑data technology salon where experts detailed the evolution, architecture, and practical deployments of Spark‑based cloud data warehouses, ElasticSearch, Yarn, and Flink, highlighting trends, optimization techniques, and future directions for enterprise data analytics.

Tencent Cloud Developer

Aug 30, 2019

How Tencent Cloud Leverages Spark, ElasticSearch, and Flink for PB‑Scale Data Warehousing

Event Overview

On August 24, the Cloud+ community (Tencent Cloud official developer community) partnered with Kuaishou to hold a "Big Data Technology Practice and Application" salon. Speakers from Tencent Cloud and Kuaishou presented the development history, architectural optimizations, and real‑world use cases of Spark, ElasticSearch, Yarn, MapReduce, and Flink.

AI and Big Data Relationship

Speaker Ding Xiaokun explained that AI relies heavily on high‑quality data; better data yields more accurate deep‑learning models. In typical recommendation pipelines, data extraction, preparation, model training, and model publishing form a cyclic iteration that demands efficient data‑model interaction.

To meet these demands, frameworks such as TensorFlow On Spark, Angel (Tencent), and BigDL (Intel) integrate Spark’s compute engine with AI training frameworks, while GPU scheduling benefits from Kubernetes and Hadoop 3.0 advancements.

Building a PB‑Scale Cloud Data Warehouse with Spark

Ding outlined four reasons Tencent Cloud chose Spark as the core compute engine:

Rich ecosystem and broad scenario support.

Multi‑language support (Python, SQL, R, Scala, Java) makes it easy to adopt.

Strong open‑source community enables deep technical insight.

DAG model, in‑memory RDD computation, and fine‑grained scheduling deliver superior performance.

Tencent Cloud’s data warehouse adopts a cloud‑native architecture with three node types—master, core compute, and elastic compute—supporting horizontal scaling, elastic resource allocation, and seamless separation of storage and compute. Features such as smooth container removal, virtual environment optimization (NO Group layer, node‑>node‑group‑>rack‑>off‑rack read strategy), and Parquet‑based Bloom filter indexing improve stability and query efficiency.

Shuffle performance remains a critical factor; ongoing memory‑storage innovations continue to boost Spark efficiency.

ElasticSearch Product Architecture and Practices

ElasticSearch, introduced around 2010, evolved from a search engine to a full‑featured analytics platform. Its key capabilities include full‑text search, NoSQL storage, OLAP analysis, and a RESTful management interface built on Lucene’s inverted index.

Tencent Cloud enhances ElasticSearch with X‑Pack security features (role‑based access control, SSL/TLS, audit logging), high‑availability across zones, VPC isolation, automated backups, and a recycle‑bin for data protection. The architecture supports three node roles (master, data, client) and employs cross‑zone replica placement, dedicated master nodes, and balanced shard allocation to avoid hotspots.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data Cloud Computing Flink Elasticsearch Data Warehouse YARN Spark

Written by

Tencent Cloud Developer

Official Tencent Cloud community account that brings together developers, shares practical tech insights, and fosters an influential tech exchange community.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.