Big Data 37 min read

Installing and Configuring Kettle (Pentaho Data Integration) on Linux for Hadoop ETL

This guide provides a step‑by‑step tutorial on preparing a Linux environment, installing Java, GNOME Desktop, VNC remote access, Chinese language support, downloading and extracting Kettle, configuring its startup scripts, creating desktop shortcuts, and managing essential Kettle configuration files for successful Hadoop ETL development.

Big Data Technology & Architecture
Big Data Technology & Architecture
Big Data Technology & Architecture
Installing and Configuring Kettle (Pentaho Data Integration) on Linux for Hadoop ETL

1. Installation

The article begins by explaining why Kettle should be installed on a Linux system rather than Windows, focusing on user permissions and Hadoop integration issues.

1.1 Installation Environment

Choose the operating system (CentOS 7.2), install Java 1.8 (JRE for runtime or JDK for development), and ensure the correct version of Kettle (8.3) matches the Java version. hdfs dfs -mkdir /user/WindowsUsername For Windows users, modify config.properties to set authentication.superuser.provider=NO_AUTH and restart Kettle.

1.2 Installation Planning

Four virtual machines (50 GB disk, 8 GB RAM) are planned with IPs 172.16.1.101‑104. The tutorial focuses on the first host; the others follow the same steps.

OS: CentOS Linux release 7.2.1511 (Core)

Java version: openjdk 1.8.0_262

Kettle version: GA Release 8.3.0.0‑371

2. Pre‑Installation Preparation

(1) Install Java Environment

Kettle requires Java 1.8. You can install the JDK via RPM or yum.

rpm -ivh jdk-8u261-linux-x64.rpm
# Search for Java packages
yum search java | grep -i --color JDK
# Install Java 1.8
yum install -y java-1.8.0-openjdk.x86_64 java-1.8.0-openjdk-devel.x86_64
# Verify installation
java -version

(2) Install GNOME Desktop

Use yum groupinstall to add the GNOME Desktop environment.

yum groupinstall "GNOME Desktop" -y

(3) Configure Chinese Support

Install Chinese language packs and input methods, then set the system locale to zh_CN.UTF-8.

yum install kde-l10n-Chinese
localectl set-locale LANG=zh_CN.UTF-8

(4) Install and Configure VNC Remote Control

Disable firewalld and SELinux, install tigervnc-server, start the VNC server, and set it to launch at boot.

# Stop and disable firewalld
systemctl stop firewalld
systemctl disable firewalld
# Temporarily disable SELinux
setenforce 0
# Permanently disable SELinux
vim /etc/sysconfig/selinux   # set SELINUX=disabled
# Install TigerVNC server
yum install -y tigervnc-server
# Start VNC server
vncserver
# List VNC sessions
vncserver -list
# Enable service at boot
cp /lib/systemd/system/[email protected] /lib/systemd/system/vncserver@:1.service
systemctl daemon-reload
systemctl enable vncserver@:1.service
reboot

(5) Install VNC Viewer on Client

Download RealVNC Viewer, create a new connection to 172.16.1.101:1, and save the password.

3. Install and Run Kettle

(1) Download and Extract

Download the Kettle zip from SourceForge and unzip it.

# Download package
wget https://sourceforge.net/projects/pentaho/files/Pentaho%208.3/client-tools/pdi-ce-8.3.0.0-371.zip
# Extract
unzip pdi-ce-8.3.0.0-371.zip
# Rename directory for clarity
mv data-integration pdi-ce-8.3.0.0-371

(2) Run Kettle

Make the shell scripts executable and start Spoon.

chmod 755 *.sh
cd pdi-ce-8.3.0.0-371/
./spoon.sh

(3) Create Spoon Desktop Shortcut

On GNOME, create a .desktop file such as /root/Desktop/Spoon.desktop with the following content:

[Desktop Entry]
Encoding=UTF-8
Name=spoon
Exec=sh /root/pdi-ce-8.3.0.0-371/spoon.sh
Terminal=false
Type=Application

Refresh the desktop (F5) to see the shortcut, trust the launcher, and optionally set a custom icon.

4. Configuration

Kettle uses several configuration files located in the .kettle directory (or a custom KETTLE_HOME location). .spoonrc – Spoon UI preferences and recent files. jdbc.properties – JNDI database connection definitions. kettle.properties – Global variables for jobs and transformations. kettle.pwd – Password file for Carte services. repositories.xml – Definitions of Pentaho repositories. shared.xml – Shared objects such as steps and connections.

4.1 jdbc.properties Example

SampleData/type=javax.sql.DataSource
SampleData/driver=org.h2.Driver
SampleData/url=jdbc:h2:file:samples/db/sampledb;IFEXISTS=TRUE
SampleData/user=PENTAHO_USER
SampleData/password=PASSWORD

4.2 kettle.properties Example

# connection parameters for the job server
DB_HOST=dbhost.domain.org
DB_NAME=sakila
DB_USER=sakila_user
DB_PASSWORD=sakila_password

# path from where to read input files
INPUT_PATH=/home/sakila/import

# path to store the error reports
ERROR_PATH=/home/sakila/import_errors

Variables can be referenced in transformations using ${VARIABLE} or %%VARIABLE%%.

4.3 repositories.xml

Stores repository definitions; Spoon updates it automatically, but for deployment you may need to copy and edit it to match the production database.

4.4 shared.xml

Contains shared objects. Its location can be set via a variable, e.g., ${Internal.Transformation.Filename.Directory}/shared.xml.

5. Adjusting Startup Shell Scripts

If you need additional JARs, increase JVM heap size, or modify the graphical toolkit, edit the corresponding .sh scripts.

Classpath extension: place JARs under libext (the script automatically adds them).

Heap size: modify

PENTAHO_DI_JAVA_OPTIONS="-Xms1024m -Xmx2048m -XX:MaxPermSize=256m"

or set the environment variable.

GTK version: change export SWT_GTK3=1 to avoid SWT errors on GTK 2.

6. Managing JDBC Drivers

All JDBC drivers are located in the lib directory. To add a new driver, copy its JAR into lib and restart Kettle. Remove old drivers to avoid conflicts.

Conclusion

The article covered selecting the OS, installing Java, GNOME, VNC, Chinese locale, downloading and extracting Kettle, creating desktop shortcuts, and the main Kettle configuration files, providing a complete guide for preparing a Linux platform for Hadoop‑based ETL with Kettle.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

LinuxETLInstallationKettle
Big Data Technology & Architecture
Written by

Big Data Technology & Architecture

Wang Zhiwu, a big data expert, dedicated to sharing big data technology.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.