Big Data 19 min read

Design and Implementation of SQL on Streaming (SQL 1.0 → SQL 2.0) in a Real‑Time Computing Platform

This article describes the evolution of a real‑time computing platform from SQL 1.0 built on Spark Structured Streaming to SQL 2.0 powered by Flink‑SQL, covering dynamic tables, continuous queries, dimension‑table joins, cache optimization, DDL extensions, platformization, operational challenges and future roadmap.

Beike Product & Technology

Jun 12, 2020

Design and Implementation of SQL on Streaming (SQL 1.0 → SQL 2.0) in a Real‑Time Computing Platform

The platform provides a unified real‑time computing environment supporting over 570 streaming tasks with daily log throughput of 104.1 billion records. To reduce development and maintenance costs, the team explored using SQL as a universal language for streaming workloads.

SQL on Streaming introduces the concept of treating a continuous data stream as a dynamic table that can be queried with standard SQL. Two update modes are defined: append mode (new records are inserted) and update mode (records update existing rows based on a key).

Continuous queries are expressed as R[t] = q(A[t]), meaning that at each timestamp t the snapshot A[t] of the dynamic table is queried to produce a result snapshot R[t].

SQL 1.0 was implemented on Spark Structured Streaming (version 2.3.1). It supported simple SELECT, UNION, and temporary view DAG composition, with configurable formatters for Kafka sources and sinks. Example DDL for a Kafka source table:

--定义数据源表
create source table kafka_source_tb (
  system_type string extraction '$.system_type',
  client_os_type string extraction '$.client_os_type',
  ucid string extraction '$.ucid',
  ts bigint extraction '$.timestamp',
  watermark key ts delay for 5s
) with (
  type = 'kafka',
  dataType = 'json',
  brokers = 'off03-bigdata.mars.ljnode.com:9092',
  topics = 'data-pipeline-common-dev',
  groupId = 'test-group'
);

--定义维表
create dim table redis_dim_tb (
  ucid string,
  first_login_time string,
  device_id string,
  primary key ucid
) with (
  type = 'redis',
  server = '127.0.0.1',
  port = '6379',
  cache = 'all'
);

--定义输出表
create sink table console_sink_tb (
  ucid string,
  first_login_time string,
  device_id string,
  client_os_type string,
  system_type string,
  ts bigint
) with (
  type = 'console',
  dataType = 'json'
);

An example ETL job using SQL 1.0 inserts data from the Kafka source into the console sink:

insert into console_sink_tb (
  ucid, first_login_time, device_id, client_os_type, system_type
) select
  ucid, first_login_time, device_id, client_os_type, system_type
from kafka_source_tb a;

SQL 2.0 shifts to Flink‑SQL to provide richer capabilities: dimension‑table joins, user‑defined functions, time‑window aggregations, and a unified DDL syntax. The extended DDL allows defining source, dimension, and sink tables with connection metadata.

--定义数据源表
create source table kafka_source_tb (
  system_type string extraction '$.system_type',
  client_os_type string extraction '$.client_os_type',
  ucid string extraction '$.ucid',
  ts bigint extraction '$.timestamp',
  watermark key ts delay for 5s
) with (
  type = 'kafka',
  dataType = 'json',
  brokers = 'off03-bigdata.mars.ljnode.com:9092',
  topics = 'data-pipeline-common-dev',
  groupId = 'test-group'
);

--定义维表
create dim table redis_dim_tb (
  ucid string,
  first_login_time string,
  device_id string,
  primary key ucid
) with (
  type = 'redis',
  server = '127.0.0.1',
  port = '6379',
  cache = 'all'
);

--定义输出表
create sink table console_sink_tb (
  ucid string,
  first_login_time string,
  device_id string,
  client_os_type string,
  system_type string,
  ts bigint
) with (
  type = 'console',
  dataType = 'json'
);

Dimension‑table join processing is achieved by parsing the SQL, extracting join conditions, converting the stream table to a DataStream, and applying Async I/O to fetch external data. The rewritten SQL after join rewriting looks like:

select
  a_J_b.ucid,
  a_J_b.first_login_time,
  a_J_b.device_id,
  a_J_b.client_os_type,
  a_J_b.system_type
from a_J_b;

Cache optimization stores dimension data in memory with LRU or time‑based eviction to reduce external lookups. For complex join expressions, a temporary view computes the expression beforehand, e.g.:

select
  a.col1,
  a.col2,
  a.col3,
  a.col1 + a.col3 AS __function_tmp_field__0
from a;

Supported dimension sources include HBase, Redis, HTTP, and Phoenix, allowing businesses to sync their operational databases via binlog into these stores for low‑latency joins.

The platform now runs over 200 SQL 2.0 tasks covering real‑time data warehouse, transaction, commercialization, and leasing scenarios. It provides data source management, DDL generation, online SQL validation (ANTLR4 & Calcite), execution plan inspection, and debugging tools, all integrated into a unified monitoring and alerting system.

Future work focuses on data lineage visualization, traffic monitoring and alerting, adaptive resource allocation, large‑state optimization, and building a real‑time data warehouse ecosystem.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

data engineering Big Data Flink SQL Streaming Real‑Time Computing Dimension Table

Written by

Beike Product & Technology

As Beike's official product and technology account, we are committed to building a platform for sharing Beike's product and technology insights, targeting internet/O2O developers and product professionals. We share high-quality original articles, tech salon events, and recruitment information weekly. Welcome to follow us.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.