Big Data 23 min read

Understanding Stream‑Table Duality: How Apache Flink Provides a SQL API Using MySQL Binlog

This article explains why the native streaming engine Apache Flink can offer a SQL API by showing that MySQL binlog events form a time‑stamped data stream that can be replayed as a dynamic table, establishing a lossless stream‑table duality useful for both batch and streaming queries.

Big Data Technology & Architecture

Mar 14, 2019

Understanding Stream‑Table Duality: How Apache Flink Provides a SQL API Using MySQL Binlog

Actual Problem

Many big‑data processing products (Hive, Spark, Flink) expose a SQL API. SQL, originally designed for relational databases, is naturally suited to batch queries where the data set is finite and a query returns a single result. In contrast, Flink operates on native streaming data where the data set is infinite and a query continuously updates its result.

Semantic Relationship Between Stream and Batch

SQL works on relational tables, which in traditional databases are static at query time, representing a finite batch. To answer why Flink can also provide a SQL API, we must understand the semantic link between stream and batch. An example with a click‑event stream (timestamp + username) shows that the same data source and query logic produce identical results in both streaming and batch modes, proving that the semantics are equivalent.

Stream‑Table Relationship

Since streams and batches are semantically identical and SQL operates on tables, the question becomes whether a stream can be regarded as a table and vice versa. This article focuses on the duality of streams and tables.

MySQL Master‑Slave Replication

Before discussing stream‑table duality we review MySQL binlog, the core mechanism of master‑slave replication. The replication process consists of three steps:

Master records changes (change logs) as binary‑log events in the binlog.

Slave copies the master’s binary‑log events to its relay log.

Slave replays the relay‑log events, applying the changes to its data.

Binlog Modes

MySQL supports three logging modes:

Statement‑based logging – events contain the original SQL statements.

Row‑based logging – events describe changes to individual rows (used in this article).

Mixed logging – defaults to statement‑based but switches to row‑based for certain cases (e.g., NDB engine, non‑deterministic functions, UDFs, temporary tables).

Binlog Format

Each binary‑log event consists of two parts: an event header and event data. The header includes a timestamp that orders events, ensuring slaves replay changes in the correct sequence.

+=====================================+
| event | timestamp         0:4        |
+----------------------------+--------+
|        | type_code          4:1    |
+----------------------------+--------+
|        | server_id          5:4    |
+----------------------------+--------+
|        | event_length       9:4    |
+----------------------------+--------+
| ... (version‑specific fields)      |
+=====================================+
| event | fixed part                 |
| data   +----------------------------+
|        | variable part              |
+=====================================+

The timestamp in the header is the key attribute that allows us to treat binlog events as a time‑stamped stream.

Inspecting Binlog with MySQL Commands

Typical steps to view and manipulate binlog:

Check if binlog is enabled: show variables like 'log_bin'; Check the binlog format (need ROW mode): show variables like 'binlog_format'; Reset the master, create a test table, and perform DML operations (INSERT, UPDATE).

Show the current binlog file and position: show master status\G Display binlog events:

show binlog events in 'binlog.000001';

Exporting Binlog to Text

Use MySQLbinlog to decode rows and output readable text:

sudo MySQLbinlog --start-datetime='2018-04-29 00:00:03' \
    --stop-datetime='2018-05-02 00:30:00' \
    --base64-output=decode-rows -v /usr/local/MySQL/data/binlog.000001 > ~/binlog.txt

The resulting binlog.txt contains a series of events such as Write_rows, Update_rows, and Query, each prefixed with a timestamp and the affected table.

Mapping DML Operations to Binlog Records

For each DML statement we can see the corresponding binlog header (timestamp) and the row data. Example mapping:

DML

Binlog Header (timestamp)

Data

INSERT INTO blink_tab(user, clicks) VALUES ('Mary', 1);

1525099013 (2018‑04‑30 22:36:53) INSERT INTO blinkdb.blink_tab – @1=1, @2='Mary', @3=1

INSERT INTO blink_tab(user, clicks) VALUES ('Bob', 1);

1525099026 (2018‑04‑30 22:37:06) INSERT INTO blinkdb.blink_tab – @1=2, @2='Bob', @3=1

UPDATE blink_tab SET clicks=2 WHERE user='Mary';

1525099035 (2018‑04‑30 22:37:15) UPDATE blinkdb.blink_tab – WHERE @1=1, @2='Mary', @3=1 SET @3=2

INSERT INTO blink_tab(user, clicks) VALUES ('Llz', 1);

1525099047 (2018‑04‑30 22:37:27) INSERT INTO blinkdb.blink_tab – @1=3, @2='Llz', @3=1

UPDATE blink_tab SET clicks=2 WHERE user='Bob';

1525099056 (2018‑04‑30 22:37:36) UPDATE blinkdb.blink_tab – WHERE @1=2, @2='Bob', @3=1 SET @3=2

UPDATE blink_tab SET clicks=3 WHERE user='Mary';

1525099065 (2018‑04‑30 22:37:45) UPDATE blinkdb.blink_tab – WHERE @1=1, @2='Mary', @3=2 SET @3=3

Simplified Binlog View

timestamp

user

clicks

1525099013

Mary

1525099026

Bob

1525099035

Mary

1525099047

Llz

1525099056

Bob

1525099065

Mary

Replaying the binlog in timestamp order yields the final table state:

user

clicks

Mary

Bob

Llz

Stream‑Table Duality

The binlog events constitute a time‑stamped event stream. Replaying this stream reconstructs a table, and any change to a table can be expressed as an event in a stream. Thus a table can be viewed as a stream and a stream as a dynamic (changing) table. Both have schema, data, and a time attribute (event‑time for streams, operation time for tables). This lossless conversion is called stream‑table duality.

In Flink, a dynamic table is a logical view over a continuously updating stream. Each incoming event updates the table snapshot, and SQL queries over the table are internally translated into stream operators that maintain the same semantics.

Conclusion

This article demonstrated why Apache Flink, a native streaming engine, can provide a SQL API. By treating MySQL binlog as a time‑stamped data stream, Flink can replay the stream to produce a dynamic table. The stream‑table duality guarantees that SQL, originally designed for batch tables, can be safely used for streaming tasks in Flink.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Flink SQL mysql Binlog Duality

Written by

Big Data Technology & Architecture

Wang Zhiwu, a big data expert, dedicated to sharing big data technology.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.