Integrating Flink SQL with Apache Zeppelin: Installation, Configuration, and Usage
This guide explains how to set up Apache Zeppelin as an interactive notebook for Flink SQL, covering download, environment configuration, Zeppelin and Flink interpreter settings on YARN, Hive integration, and step‑by‑step testing of streaming SQL queries.
Flink SQL has become a popular way to write stream and batch jobs using familiar SQL syntax, but the default development approach still relies on Java/Scala APIs, making platformization difficult; Apache Zeppelin offers an open‑source, ready‑to‑use notebook that can serve as a complete Flink SQL development environment.
Zeppelin is a web‑based interactive data‑analysis notebook supporting SQL, Scala, Python and more. It uses plug‑in interpreters to forward user code to the appropriate backend such as Flink, providing high flexibility. Its architecture is illustrated in the accompanying diagram.
Configuring Zeppelin
Download the latest Zeppelin 0.9.0‑preview2 release (zeppelin‑0.9.0‑preview2‑bin‑all.tgz) and extract it to a suitable directory. Then edit conf/zeppelin-env.sh (renamed from the template) to set the JDK path, enable Hadoop, and point to Hadoop configuration:
# JDK directory
export JAVA_HOME=/opt/jdk1.8.0_172
# Enable Hadoop (required for YARN mode)
export USE_HADOOP=true
# Hadoop configuration directory
export HADOOP_CONF_DIR=/etc/hadoop/hadoop-confRename conf/zeppelin-site.xml.template to zeppelin-site.xml and modify the server address and port so the UI is reachable externally:
<!-- Server address, default 127.0.0.1 -->
<property>
<name>zeppelin.server.addr</name>
<value>0.0.0.0</value>
<description>Server binding address</description>
</property>
<!-- Server port, default 8080 -->
<property>
<name>zeppelin.server.port</name>
<value>18080</value>
<description>Server port.</description>
</property>Start Zeppelin with bin/zeppelin-daemon.sh start; after the "Zeppelin start [ OK ]" message, open http://<server_address>:18080 to verify the UI.
For production use you may also adjust notebook storage to HDFS, enable recovery, and disable anonymous access by editing the corresponding properties in zeppelin-site.xml (not shown in detail here).
Configuring the Flink Interpreter on YARN
In the Zeppelin UI go to the Interpreter page, search for "flink" and edit its settings. Change the Interpreter Binding mode to Isolated per Note so each notebook runs its own interpreter process, similar to Flink’s per‑job mode on YARN.
Set the essential YARN‑related Flink parameters, for example:
FLINK_HOME – directory of Flink 1.11
HADOOP_CONF_DIR – Hadoop config directory
flink.execution.mode – set to yarn flink.jm.memory – JobManager memory (MB)
flink.tm.memory – TaskManager memory (MB)
flink.tm.slot – number of slots per TaskManager
flink.yarn.appName – default YARN application name
flink.yarn.queue – default YARN queue
Hive Integration
If you need to query Hive tables or use HiveCatalog for metadata, configure the following:
HIVE_CONF_DIR – directory containing hive‑site.xml zeppelin.flink.enableHive – set to true zeppelin.flink.hive.version – Hive version number
Copy the required Hive‑related jars into $FLINK_HOME/lib:
flink-connector-hive_2.11-1.11.0.jar
flink-hadoop-compatibility_2.11-1.11.0.jar
hive-exec-*.jar
# For Hive 1.x also add hive-metastore-1.*.jar, libfb303-0.9.2.jar, libthrift-0.9.2.jarMake sure the Hive Metastore runs in standalone mode (e.g., MySQL or PostgreSQL) rather than embedded.
Interpreter on YARN Resources
By default the interpreter runs on the Zeppelin node, which can become a single point of failure. Configure the following YARN resource parameters so the interpreter containers are launched on the cluster:
zeppelin.interpreter.yarn.resource.cores – vCores per container
zeppelin.interpreter.yarn.resource.memory – memory per container (MB)
zeppelin.interpreter.yarn.queue – YARN queue for the interpreter
After saving these settings, the Flink‑on‑Zeppelin integration is complete.
Testing Flink SQL on Zeppelin
Create a new notebook, select the flink interpreter, and add the first paragraph with %flink.conf to specify job‑level configurations (any Flink config key is supported). Use flink.execution.packages to add Maven dependencies if needed.
Add a second paragraph to create a Kafka source table using %flink.ssql (streaming SQL). The table definition can reference Zeppelin credentials for the Kafka bootstrap servers.
Add a third paragraph to query the streaming table with a SELECT statement; the results are displayed in real time within the notebook.
Running the paragraphs shows execution logs, and the Flink Job UI can be opened via the "FLINK JOB" button to inspect the JobGraph, as illustrated in the screenshots.
Beyond simple SELECT queries, INSERT statements can also be executed from Zeppelin, enabling richer data pipelines. Future articles will explore more advanced use‑cases of Flink SQL on Zeppelin.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Big Data Technology & Architecture
Wang Zhiwu, a big data expert, dedicated to sharing big data technology.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
