Big Data 12 min read

An Introduction to Kafka Connect: Architecture, Components, and Hands‑On Setup

This article introduces Kafka Connect, explaining its purpose as a scalable and reliable tool for moving data between Apache Kafka and external systems, detailing its core concepts, architecture, deployment modes, configuration files, and a step‑by‑step example that streams data from a file source to a file sink.

Big Data Technology & Architecture

Mar 2, 2021

An Introduction to Kafka Connect: Architecture, Components, and Hands‑On Setup

Kafka Connect is an extensible, reliable tool for transferring data between Apache Kafka and other systems, developed by Confluent as a core part of the Confluent Platform. It simplifies the creation, deployment, and management of connectors that ingest data from databases or export data to downstream storage, enabling low‑latency streaming and batch analytics.

Background

Kafka is often used as the central hub in ETL pipelines, but upstream and downstream integration historically required separate tools like Flume or Logstash. Kafka Connect fills this gap by providing a scalable, fault‑tolerant pipeline that can move large volumes of data in and out of Kafka with minimal custom code.

Key Features

Standardized connector framework that reduces development effort.

Supports both distributed and standalone deployment modes.

REST API for managing connectors.

Automatic offset management.

Scalable worker architecture built on Kafka’s group management protocol.

Integration with stream and batch processing systems.

Typical Use Cases

When you need to move data between Kafka and external storage without modifying application code, Connect is the preferred solution. It offers out‑of‑the‑box features such as configuration management, offset handling, error handling, and supports a wide range of data formats.

Architecture and Components

The main concepts are Connectors, Tasks, Workers, Converters, and Transformers.

Connectors : Define the source or sink system and the data flow direction.

Tasks : Parallel units of work that execute the actual data transfer; their state is stored in Kafka topics.

Workers : Processes that run connectors and tasks; can be standalone or part of a distributed cluster.

Converters : Translate between Kafka Connect’s internal data format and external byte representations (e.g., Avro, JSON).

Transformers : Simple functions that modify records on the fly; more complex transformations can be done with KSQL or Kafka Streams.

Installation and First Experience

Kafka Connect can run in two modes:

Standalone – a single process for development or small deployments.

Distributed – a scalable cluster of workers.

Example commands:

./connect-standalone.sh ../config/connect-file.properties ../config/connect-file-source.properties ../config/connect-file-sink.properties

and for distributed mode:

./connect-distributed.sh ../config/connect-distributed.properties

The REST API is exposed on port 8083 for managing connectors.

Hands‑On Example

Two connectors are used: FileStreamSource reads test.txt and publishes lines to a Kafka topic, while FileStreamSink consumes the topic and writes to test.sink.txt.

Source connector configuration ( ${KAFKA_HOME}/config/connect-file-source.properties):

name=local-file-source
connector.class=FileStreamSource
tasks.max=1
file=test.txt
topic=connect-test

Sink connector configuration ( ${KAFKA_HOME}/config/connect-file-sink.properties):

name=local-file-sink
connector.class=FileStreamSink
tasks.max=1
file=test.sink.txt
topics=connect-test

Standalone worker configuration ( ${KAFKA_HOME}/config/connect-standalone.properties):

bootstrap.servers=localhost:9092
key.converter=org.apache.kafka.connect.json.JsonConverter
value.converter=org.apache.kafka.connect.json.JsonConverter
key.converter.schemas.enable=true
value.converter.schemas.enable=true
internal.key.converter=org.apache.kafka.connect.json.JsonConverter
internal.value.converter=org.apache.kafka.connect.json.JsonConverter
internal.key.converter.schemas.enable=false
internal.value.converter.schemas.enable=false
offset.storage.file.filename=/tmp/connect.offsets
offset.flush.interval.ms=10000

After starting the connectors, appending lines to test.txt (e.g., echo 'hello flink01' >> test.txt) results in the same lines appearing in test.sink.txt, demonstrating a working end‑to‑end pipeline.

The article promises a follow‑up that will explore production‑grade usage of Kafka Connect in various companies.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Streaming ETL Data Integration kafka-connect

Written by

Big Data Technology & Architecture

Wang Zhiwu, a big data expert, dedicated to sharing big data technology.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.