Big Data 5 min read

LeKe Sports Big Data Platform Evolution: From Early ETL Reporting to 2.0 Streaming Architecture

The article describes how LeKe Sports built and continuously upgraded its Hadoop‑based big data platform—from a manual ETL‑to‑Elasticsearch reporting system to a 2.0 architecture featuring Spark Streaming, SQL‑based query layers, Elasticsearch indexing, and cloud‑native storage and backup solutions—to meet rapidly growing PB‑scale data demands.

Architecture Digest

Feb 11, 2017

LeKe Sports Big Data Platform Evolution: From Early ETL Reporting to 2.0 Streaming Architecture

LeKe Sports' big data platform is built on a Hadoop ecosystem to support daily operational reports and core company metrics. As the business expanded in 2016 with both online and offline services, data volume surged from gigabytes to petabytes, prompting a rethink of scalability and predictability.

Early ETL Reporting Platform Initially, scattered statistical needs were met by exporting business data scripts into Elasticsearch for complex queries. By the end of 2016, increasing demand and diverse dimensions required a dedicated team and highlighted the limitations of the ad‑hoc approach. The early reporting system relied on Oozie‑scheduled Hive ETL jobs that loaded processed data into Elasticsearch, but the workflow was slow, involved many hand‑offs, and offered limited query flexibility.

To address these shortcomings, the team developed an internal data visualization platform (DMS) that allowed multi‑dimensional web queries, though it still struggled with non‑customized, exploratory, and analytical requirements, leading to the planning of a more robust data analysis service.

LeKe Data Platform 2.0 Version 2.0 introduced a major redesign covering data collection, processing, storage, and dirty‑data alerts. The visualization layer now supports customizable query dimensions, and a SQL‑based data access layer enables rapid queries for product and development teams.

Streaming Data Processing and Computation LeKe’s smart‑fitness devices generate massive daily data streams. The pipeline is: (1) users trigger device usage via the app, sending data to Alibaba Cloud ONS; (2) Spark Streaming consumes the data in real time; (3) a data model produces result sets; (4) results are displayed on a monitoring dashboard. Processed streaming results are also fed into Elasticsearch for multi‑dimensional queries such as user‑profile mining.

Data Storage and Backup Leveraging Alibaba Cloud services, the platform stores raw data in OSS and HDFS, loads data from local machines or OSS, and synchronizes columnar data to HBase. Regular backups and restores are handled by OSS, ensuring durability across the entire architecture.

The article concludes with a visual architecture diagram (omitted) and a copyright notice indicating the content originates from the original author.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data stream processing Data Platform ETL Spark Hadoop

Written by

Architecture Digest

Focusing on Java backend development, covering application architecture from top-tier internet companies (high availability, high performance, high stability), big data, machine learning, Java architecture, and other popular fields.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.