Big Data 11 min read

Integrating Flink SQL with Apache Zeppelin: Installation, Configuration, and Usage

This guide explains how to set up Apache Zeppelin as an interactive notebook for Flink SQL, covering download, environment configuration, Zeppelin and Flink interpreter settings on YARN, Hive integration, and step‑by‑step testing of streaming SQL queries.

Big Data Technology & Architecture

Nov 6, 2020

Integrating Flink SQL with Apache Zeppelin: Installation, Configuration, and Usage

Flink SQL has become a popular way to write stream and batch jobs using familiar SQL syntax, but the default development approach still relies on Java/Scala APIs, making platformization difficult; Apache Zeppelin offers an open‑source, ready‑to‑use notebook that can serve as a complete Flink SQL development environment.

Zeppelin is a web‑based interactive data‑analysis notebook supporting SQL, Scala, Python and more. It uses plug‑in interpreters to forward user code to the appropriate backend such as Flink, providing high flexibility. Its architecture is illustrated in the accompanying diagram.

Configuring Zeppelin

Download the latest Zeppelin 0.9.0‑preview2 release (zeppelin‑0.9.0‑preview2‑bin‑all.tgz) and extract it to a suitable directory. Then edit conf/zeppelin-env.sh (renamed from the template) to set the JDK path, enable Hadoop, and point to Hadoop configuration:

# JDK directory
export JAVA_HOME=/opt/jdk1.8.0_172
# Enable Hadoop (required for YARN mode)
export USE_HADOOP=true
# Hadoop configuration directory
export HADOOP_CONF_DIR=/etc/hadoop/hadoop-conf

Rename conf/zeppelin-site.xml.template to zeppelin-site.xml and modify the server address and port so the UI is reachable externally:

<!-- Server address, default 127.0.0.1 -->
<property>
  <name>zeppelin.server.addr</name>
  <value>0.0.0.0</value>
  <description>Server binding address</description>
</property>

<!-- Server port, default 8080 -->
<property>
  <name>zeppelin.server.port</name>
  <value>18080</value>
  <description>Server port.</description>
</property>

Start Zeppelin with bin/zeppelin-daemon.sh start; after the "Zeppelin start [ OK ]" message, open http://<server_address>:18080 to verify the UI.

For production use you may also adjust notebook storage to HDFS, enable recovery, and disable anonymous access by editing the corresponding properties in zeppelin-site.xml (not shown in detail here).

Configuring the Flink Interpreter on YARN

In the Zeppelin UI go to the Interpreter page, search for "flink" and edit its settings. Change the Interpreter Binding mode to Isolated per Note so each notebook runs its own interpreter process, similar to Flink’s per‑job mode on YARN.

Set the essential YARN‑related Flink parameters, for example:

FLINK_HOME – directory of Flink 1.11

HADOOP_CONF_DIR – Hadoop config directory

flink.execution.mode – set to yarn flink.jm.memory – JobManager memory (MB)

flink.tm.memory – TaskManager memory (MB)

flink.tm.slot – number of slots per TaskManager

flink.yarn.appName – default YARN application name

flink.yarn.queue – default YARN queue

Hive Integration

If you need to query Hive tables or use HiveCatalog for metadata, configure the following:

HIVE_CONF_DIR – directory containing hive‑site.xml zeppelin.flink.enableHive – set to true zeppelin.flink.hive.version – Hive version number

Copy the required Hive‑related jars into $FLINK_HOME/lib:

flink-connector-hive_2.11-1.11.0.jar
flink-hadoop-compatibility_2.11-1.11.0.jar
hive-exec-*.jar
# For Hive 1.x also add hive-metastore-1.*.jar, libfb303-0.9.2.jar, libthrift-0.9.2.jar

Make sure the Hive Metastore runs in standalone mode (e.g., MySQL or PostgreSQL) rather than embedded.

Interpreter on YARN Resources

By default the interpreter runs on the Zeppelin node, which can become a single point of failure. Configure the following YARN resource parameters so the interpreter containers are launched on the cluster:

zeppelin.interpreter.yarn.resource.cores – vCores per container

zeppelin.interpreter.yarn.resource.memory – memory per container (MB)

zeppelin.interpreter.yarn.queue – YARN queue for the interpreter

After saving these settings, the Flink‑on‑Zeppelin integration is complete.

Testing Flink SQL on Zeppelin

Create a new notebook, select the flink interpreter, and add the first paragraph with %flink.conf to specify job‑level configurations (any Flink config key is supported). Use flink.execution.packages to add Maven dependencies if needed.

Add a second paragraph to create a Kafka source table using %flink.ssql (streaming SQL). The table definition can reference Zeppelin credentials for the Kafka bootstrap servers.

Add a third paragraph to query the streaming table with a SELECT statement; the results are displayed in real time within the notebook.

Running the paragraphs shows execution logs, and the Flink Job UI can be opened via the "FLINK JOB" button to inspect the JobGraph, as illustrated in the screenshots.

Beyond simple SELECT queries, INSERT statements can also be executed from Zeppelin, enabling richer data pipelines. Future articles will explore more advanced use‑cases of Flink SQL on Zeppelin.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Flink SQL Configuration Hive YARN Zeppelin

Written by

Big Data Technology & Architecture

Wang Zhiwu, a big data expert, dedicated to sharing big data technology.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.