Understanding Stream‑Table Duality: How Apache Flink Provides a SQL API Using MySQL Binlog
This article explains why the native streaming engine Apache Flink can offer a SQL API by showing that MySQL binlog events form a time‑stamped data stream that can be replayed as a dynamic table, establishing a lossless stream‑table duality useful for both batch and streaming queries.
Actual Problem
Many big‑data processing products (Hive, Spark, Flink) expose a SQL API. SQL, originally designed for relational databases, is naturally suited to batch queries where the data set is finite and a query returns a single result. In contrast, Flink operates on native streaming data where the data set is infinite and a query continuously updates its result.
Semantic Relationship Between Stream and Batch
SQL works on relational tables, which in traditional databases are static at query time, representing a finite batch. To answer why Flink can also provide a SQL API, we must understand the semantic link between stream and batch. An example with a click‑event stream (timestamp + username) shows that the same data source and query logic produce identical results in both streaming and batch modes, proving that the semantics are equivalent.
Stream‑Table Relationship
Since streams and batches are semantically identical and SQL operates on tables, the question becomes whether a stream can be regarded as a table and vice versa. This article focuses on the duality of streams and tables.
MySQL Master‑Slave Replication
Before discussing stream‑table duality we review MySQL binlog, the core mechanism of master‑slave replication. The replication process consists of three steps:
Master records changes (change logs) as binary‑log events in the binlog.
Slave copies the master’s binary‑log events to its relay log.
Slave replays the relay‑log events, applying the changes to its data.
Binlog Modes
MySQL supports three logging modes:
Statement‑based logging – events contain the original SQL statements.
Row‑based logging – events describe changes to individual rows (used in this article).
Mixed logging – defaults to statement‑based but switches to row‑based for certain cases (e.g., NDB engine, non‑deterministic functions, UDFs, temporary tables).
Binlog Format
Each binary‑log event consists of two parts: an event header and event data. The header includes a timestamp that orders events, ensuring slaves replay changes in the correct sequence.
+=====================================+
| event | timestamp 0:4 |
+----------------------------+--------+
| | type_code 4:1 |
+----------------------------+--------+
| | server_id 5:4 |
+----------------------------+--------+
| | event_length 9:4 |
+----------------------------+--------+
| ... (version‑specific fields) |
+=====================================+
| event | fixed part |
| data +----------------------------+
| | variable part |
+=====================================+The timestamp in the header is the key attribute that allows us to treat binlog events as a time‑stamped stream.
Inspecting Binlog with MySQL Commands
Typical steps to view and manipulate binlog:
Check if binlog is enabled: show variables like 'log_bin'; Check the binlog format (need ROW mode): show variables like 'binlog_format'; Reset the master, create a test table, and perform DML operations (INSERT, UPDATE).
Show the current binlog file and position: show master status\G Display binlog events:
show binlog events in 'binlog.000001';Exporting Binlog to Text
Use MySQLbinlog to decode rows and output readable text:
sudo MySQLbinlog --start-datetime='2018-04-29 00:00:03' \
--stop-datetime='2018-05-02 00:30:00' \
--base64-output=decode-rows -v /usr/local/MySQL/data/binlog.000001 > ~/binlog.txtThe resulting binlog.txt contains a series of events such as Write_rows, Update_rows, and Query, each prefixed with a timestamp and the affected table.
Mapping DML Operations to Binlog Records
For each DML statement we can see the corresponding binlog header (timestamp) and the row data. Example mapping:
DML
Binlog Header (timestamp)
Data
INSERT INTO blink_tab(user, clicks) VALUES ('Mary', 1);
1525099013 (2018‑04‑30 22:36:53) INSERT INTO blinkdb.blink_tab – @1=1, @2='Mary', @3=1
INSERT INTO blink_tab(user, clicks) VALUES ('Bob', 1);
1525099026 (2018‑04‑30 22:37:06) INSERT INTO blinkdb.blink_tab – @1=2, @2='Bob', @3=1
UPDATE blink_tab SET clicks=2 WHERE user='Mary';
1525099035 (2018‑04‑30 22:37:15) UPDATE blinkdb.blink_tab – WHERE @1=1, @2='Mary', @3=1 SET @3=2
INSERT INTO blink_tab(user, clicks) VALUES ('Llz', 1);
1525099047 (2018‑04‑30 22:37:27) INSERT INTO blinkdb.blink_tab – @1=3, @2='Llz', @3=1
UPDATE blink_tab SET clicks=2 WHERE user='Bob';
1525099056 (2018‑04‑30 22:37:36) UPDATE blinkdb.blink_tab – WHERE @1=2, @2='Bob', @3=1 SET @3=2
UPDATE blink_tab SET clicks=3 WHERE user='Mary';
1525099065 (2018‑04‑30 22:37:45) UPDATE blinkdb.blink_tab – WHERE @1=1, @2='Mary', @3=2 SET @3=3
Simplified Binlog View
timestamp
user
clicks
1525099013
Mary
1
1525099026
Bob
1
1525099035
Mary
2
1525099047
Llz
1
1525099056
Bob
2
1525099065
Mary
3
Replaying the binlog in timestamp order yields the final table state:
user
clicks
Mary
3
Bob
2
Llz
1
Stream‑Table Duality
The binlog events constitute a time‑stamped event stream. Replaying this stream reconstructs a table, and any change to a table can be expressed as an event in a stream. Thus a table can be viewed as a stream and a stream as a dynamic (changing) table. Both have schema, data, and a time attribute (event‑time for streams, operation time for tables). This lossless conversion is called stream‑table duality.
In Flink, a dynamic table is a logical view over a continuously updating stream. Each incoming event updates the table snapshot, and SQL queries over the table are internally translated into stream operators that maintain the same semantics.
Conclusion
This article demonstrated why Apache Flink, a native streaming engine, can provide a SQL API. By treating MySQL binlog as a time‑stamped data stream, Flink can replay the stream to produce a dynamic table. The stream‑table duality guarantees that SQL, originally designed for batch tables, can be safely used for streaming tasks in Flink.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Big Data Technology & Architecture
Wang Zhiwu, a big data expert, dedicated to sharing big data technology.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
