Big Data 8 min read

Why Spark and Flink Can't Stream MySQL via JDBC (And What Works Instead)

This article explains the limitations of using JDBC for true streaming reads in Spark and Flink, demonstrates failed attempts with MySQL, shows workarounds that revert to batch processing, and recommends Flink CDC as the practical solution for incremental MySQL ingestion.

dbaplus Community

Dec 25, 2023

Why Spark and Flink Can't Stream MySQL via JDBC (And What Works Instead)

Both Apache Spark and Apache Flink can read data from any JDBC‑compatible database in batch mode, but they do not support true streaming (continuous incremental ingestion) via the generic JDBC connector. Official documentation and practical tests confirm that MySQL is not listed among the built‑in streaming sources for either engine.

Data source reading modes

Batch : read the entire dataset once.

Streaming : continuously monitor the source and read only new or changed records.

Streaming reads require a source‑specific streaming connector; a generic JDBC source cannot provide change events.

Spark JDBC attempt

Using Spark Structured Streaming with a MySQL JDBC source fails to start a streaming query because MySQL and generic JDBC are absent from the list of supported streaming sources. Adding the mysql‑connector‑java dependency and executing a streaming query results in an error, confirming the limitation. Switching to a regular Spark batch job with spark.read.jdbc(...) works, but this defeats the purpose of streaming.

Flink JDBC attempt

Flink provides a JDBC connector (different from the standard MySQL driver) that can be used in a streaming environment. The following Scala program creates a Flink table backed by MySQL via the JDBC connector and prints the first 100 rows:

package com.anryg.mysql.jdbc
import java.time.Duration
import org.apache.flink.contrib.streaming.state.EmbeddedRocksDBStateBackend
import org.apache.flink.streaming.api.CheckpointingMode
import org.apache.flink.streaming.api.environment.CheckpointConfig.ExternalizedCheckpointCleanup
import org.apache.flink.streaming.api.scala.StreamExecutionEnvironment
import org.apache.flink.table.api.bridge.scala.StreamTableEnvironment

object FromMysql2Print {
  def main(args: Array[String]): Unit = {
    val env = StreamExecutionEnvironment.getExecutionEnvironment
    env.enableCheckpointing(10000L)
    env.setStateBackend(new EmbeddedRocksDBStateBackend(true))
    env.getCheckpointConfig.setCheckpointStorage("hdfs://192.168.211.106:8020/tmp/flink_checkpoint/FromMysql2Print")
    env.getCheckpointConfig.setExternalizedCheckpointCleanup(ExternalizedCheckpointCleanup.RETAIN_ON_CANCELLATION)
    env.getCheckpointConfig.setAlignedCheckpointTimeout(Duration.ofMinutes(1L))
    env.getCheckpointConfig.setCheckpointingMode(CheckpointingMode.EXACTLY_ONCE)
    val tableEnv = StreamTableEnvironment.create(env)
    tableEnv.executeSql("""
      CREATE TABLE data_from_mysql (
        client_ip STRING,
        domain STRING,
        time STRING,
        target_ip STRING,
        rcode INT,
        query_type INT,
        authority_record STRING,
        add_msg STRING,
        dns_ip STRING,
        PRIMARY KEY(client_ip, domain, time, target_ip, rcode, query_type) NOT ENFORCED
      ) WITH (
        'connector' = 'jdbc',
        'url' = 'jdbc:mysql://192.168.221.173:3306/test',
        'username' = '***',
        'password' = '***',
        'table-name' = 'test02'
      )
    """.stripMargin)
    tableEnv.executeSql("""
      SELECT * FROM data_from_mysql LIMIT 100
    """.stripMargin).print()
  }
}

When executed, the job reads a static snapshot of the MySQL table and then terminates, behaving like a batch job despite being launched in a streaming context.

Conclusion

Neither Spark nor Flink can achieve true incremental ingestion from MySQL using the standard JDBC connector; they fall back to batch semantics. For continuous change capture, the recommended solution is Flink CDC, which provides a dedicated MySQL change‑data‑capture connector. JDBC remains useful for one‑time imports or for legacy MySQL versions (e.g., 5.5 and earlier) that lack CDC support.

Reference implementation of a custom Spark JDBC streaming source (unverified): https://github.com/sutugin/spark-streaming-jdbc-source

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data Flink Streaming mysql JDBC Spark CDC

Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.