Big Data 9 min read

Longgong Data Analysis Platform: Architecture and Solutions for Large‑Scale Structured Data

The Longgong Data Analysis Platform enables Idle Fish to capture, store, and analyze billions of structured product attributes in real time across more than 8,000 categories, using TableStore, MySQL, ODPS, and a distributed scheduler to achieve over 50% query speedup, 80% category coverage, and rapid support for search and recommendation teams.

Xianyu Technology

Jun 8, 2021

Longgong Data Analysis Platform: Architecture and Solutions for Large‑Scale Structured Data

Idle Fish (Xianyu) faces difficulty obtaining structured product attributes from C‑end sellers, whose lightweight publishing style limits data quality. To enrich product understanding, a structured attribute supplement was added to the publishing flow, which improved data capture without harming user experience.

The Longgong Data Analysis Platform was built to provide real‑time, multi‑dimensional analytics for these attributes. Its design goals include real‑time coverage analysis, support for 8,000+ leaf categories, and unified management of category attributes, SPU data, and operational strategies.

Key challenges in the data pipeline are massive data volume (over 2 billion records), high QPS (15 k+), heterogeneous sources (10+ types), and complex analytical queries that ordinary databases cannot handle efficiently.

TableStore, a column‑store database, was chosen for storing structured product information due to its scalability and high availability (99.99%). Online data resides in MySQL and is streamed to a source table; offline data from algorithms is ingested via ODPS + MQ and written to the source table. An analytical database (ADB) handles complex SQL, real‑time indexing, and hot‑cold data separation.

For offline heterogeneous data sources, a unified standard table (idle_kgraph_std_source) in ODPS aggregates all algorithm outputs. A Blink task synchronizes this table to TableStore, merging multi‑source records per scene to reduce write operations.

Data fusion follows product and operation‑defined rules. A distributed task scheduler splits full‑load jobs into shards, enabling 600 million records to be processed in 40 minutes with idempotent, isolated incremental and full processing.

The analysis module separates dimension‑based queries (routed via a Distributor to specific processors) from filter/sort conditions, which are served from cache and applied in‑memory, improving query efficiency by over 50%.

Since deployment, the platform supports over 8,000 category dimensions, contributes to 80% category coverage, and provides rapid query capabilities for search and recommendation teams, significantly accelerating bad‑case attribution.

Future work includes tighter integration with product publishing and transaction data, richer visualizations, and feedback loops to algorithms for personalized attribute prediction.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Alibaba Big Data data pipeline real-time analytics Data Platform Distributed Processing

Written by

Xianyu Technology

Official account of the Xianyu technology team

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.