An Introduction to Kafka Connect: Architecture, Components, and Hands‑On Setup
This article introduces Kafka Connect, explaining its purpose as a scalable and reliable tool for moving data between Apache Kafka and external systems, detailing its core concepts, architecture, deployment modes, configuration files, and a step‑by‑step example that streams data from a file source to a file sink.
Kafka Connect is an extensible, reliable tool for transferring data between Apache Kafka and other systems, developed by Confluent as a core part of the Confluent Platform. It simplifies the creation, deployment, and management of connectors that ingest data from databases or export data to downstream storage, enabling low‑latency streaming and batch analytics.
Background
Kafka is often used as the central hub in ETL pipelines, but upstream and downstream integration historically required separate tools like Flume or Logstash. Kafka Connect fills this gap by providing a scalable, fault‑tolerant pipeline that can move large volumes of data in and out of Kafka with minimal custom code.
Key Features
Standardized connector framework that reduces development effort.
Supports both distributed and standalone deployment modes.
REST API for managing connectors.
Automatic offset management.
Scalable worker architecture built on Kafka’s group management protocol.
Integration with stream and batch processing systems.
Typical Use Cases
When you need to move data between Kafka and external storage without modifying application code, Connect is the preferred solution. It offers out‑of‑the‑box features such as configuration management, offset handling, error handling, and supports a wide range of data formats.
Architecture and Components
The main concepts are Connectors, Tasks, Workers, Converters, and Transformers.
Connectors : Define the source or sink system and the data flow direction.
Tasks : Parallel units of work that execute the actual data transfer; their state is stored in Kafka topics.
Workers : Processes that run connectors and tasks; can be standalone or part of a distributed cluster.
Converters : Translate between Kafka Connect’s internal data format and external byte representations (e.g., Avro, JSON).
Transformers : Simple functions that modify records on the fly; more complex transformations can be done with KSQL or Kafka Streams.
Installation and First Experience
Kafka Connect can run in two modes:
Standalone – a single process for development or small deployments.
Distributed – a scalable cluster of workers.
Example commands:
./connect-standalone.sh ../config/connect-file.properties ../config/connect-file-source.properties ../config/connect-file-sink.propertiesand for distributed mode:
./connect-distributed.sh ../config/connect-distributed.propertiesThe REST API is exposed on port 8083 for managing connectors.
Hands‑On Example
Two connectors are used: FileStreamSource reads test.txt and publishes lines to a Kafka topic, while FileStreamSink consumes the topic and writes to test.sink.txt.
Source connector configuration ( ${KAFKA_HOME}/config/connect-file-source.properties):
name=local-file-source
connector.class=FileStreamSource
tasks.max=1
file=test.txt
topic=connect-testSink connector configuration ( ${KAFKA_HOME}/config/connect-file-sink.properties):
name=local-file-sink
connector.class=FileStreamSink
tasks.max=1
file=test.sink.txt
topics=connect-testStandalone worker configuration ( ${KAFKA_HOME}/config/connect-standalone.properties):
bootstrap.servers=localhost:9092
key.converter=org.apache.kafka.connect.json.JsonConverter
value.converter=org.apache.kafka.connect.json.JsonConverter
key.converter.schemas.enable=true
value.converter.schemas.enable=true
internal.key.converter=org.apache.kafka.connect.json.JsonConverter
internal.value.converter=org.apache.kafka.connect.json.JsonConverter
internal.key.converter.schemas.enable=false
internal.value.converter.schemas.enable=false
offset.storage.file.filename=/tmp/connect.offsets
offset.flush.interval.ms=10000After starting the connectors, appending lines to test.txt (e.g., echo 'hello flink01' >> test.txt) results in the same lines appearing in test.sink.txt, demonstrating a working end‑to‑end pipeline.
The article promises a follow‑up that will explore production‑grade usage of Kafka Connect in various companies.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Big Data Technology & Architecture
Wang Zhiwu, a big data expert, dedicated to sharing big data technology.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
