Big Data 9 min read

Real‑time User Behavior Collection Using Flume, Kafka, and Spark Streaming on Hadoop

This guide explains how to continuously collect web‑service user behavior logs, route them through Flume agents to Kafka, and finally ingest them with Spark Streaming into HDFS, covering environment preparation, configuration files, deployment steps, and verification procedures.

Big Data Technology & Architecture

Aug 12, 2020

Real‑time User Behavior Collection Using Flume, Kafka, and Spark Streaming on Hadoop

Scenario: continuously collect user‑behavior data from a web service and store it in a Hadoop data platform.

Preparation:

Deploy Nginx on the web server (e.g., 172.17.111.111) to generate access logs.

Data platform runs on Hadoop; final data is persisted to HDFS (e.g., 172.22.222.17‑19).

Place web server and data‑platform machines in the same data center for reliable network connectivity.

Solution & technology selection:

Use Flume to collect logs.

Send collected logs to Kafka as a unified destination.

Consume Kafka streams with Spark Streaming and write to HDFS (Hive, Kudu, etc.).

Deploy Flume in a distributed fashion: web node as source, a cluster of Flume agents on the big‑data side as sinks, communication via Avro.

Deploy Flume on the web server:

# Download
wget http://mirrors.tuna.tsinghua.edu.cn/apache/flume/1.8.0/apache-flume-1.8.0-bin.tar.gz
# Extract
tar -zxvf apache-flume-1.8.0-bin.tar.gz
# Move to /opt
mv /home/apache-flume-1.8.0-bin /opt/flume-1.8.0

Edit the configuration file /opt/flume-1.8.0/conf/flume-conf.properties with the following content:

cd /opt/flume-1.8.0/conf
vi flume-conf.properties

# Agent definition
agent.sources=source1
agent.channels=channel1
agent.sinks=sink1 sink2 sink3

# Source (tail nginx log)
agent.sources.source1.type=exec
agent.sources.source1.command=tail -F /usr/local/nginx/logs/dev-biwx.belle.net.cn.log
agent.sources.source1.channels=channel1

# Sinks (Avro to three slave agents)
agent.sinks.sink1.type=avro
agent.sinks.sink1.channel=channel1
agent.sinks.sink1.hostname=172.22.222.17
agent.sinks.sink1.port=10000
agent.sinks.sink1.connect-timeout=200000

agent.sinks.sink2.type=avro
agent.sinks.sink2.channel=channel1
agent.sinks.sink2.hostname=172.22.222.18
agent.sinks.sink2.port=10000
agent.sinks.sink2.connect-timeout=200000

agent.sinks.sink3.type=avro
agent.sinks.sink3.channel=channel1
agent.sinks.sink3.hostname=172.22.222.19
agent.sinks.sink3.port=10000
agent.sinks.sink3.connect-timeout=200000

# Sink group for load‑balancing
agent.sinkgroups=g1
agent.sinkgroups.g1.sinks=sink1 sink2 sink3
agent.sinkgroups.g1.processor.type=load_balance
agent.sinkgroups.g1.processor.selector=round_robin

# Channel definition
agent.channels.channel1.type=file
agent.channels.channel1.checkpointDir=/var/checkpoint
agent.channels.channel1.dataDirs=/var/tmp
agent.channels.channel1.capacity=10000
agent.channels.channel1.transactionCapactiy=100

# Bind source and sinks to channel
agent.sources.source1.channels=channel1
agent.sinks.sink1.channel=channel1
agent.sinks.sink2.channel=channel1
agent.sinks.sink3.channel=channel1
:wq!

Configure each slave agent (example for slave1):

#Name the components on this agent
file2Kafka.sources = file2Kafka_source
file2Kafka.sinks = file2Kafka_sink
file2Kafka.channels = file2Kafka_channel

# Source (Avro)
file2Kafka.sources.file2Kafka_source.type = avro
file2Kafka.sources.file2Kafka_source.bind = 172.22.222.17
file2Kafka.sources.file2Kafka_source.port = 10000

# Sink (Kafka)
file2Kafka.sinks.file2Kafka_sink.type = org.apache.flume.sink.kafka.KafkaSink
file2Kafka.sinks.file2Kafka_sink.kafka.topic = testnginx
file2Kafka.sinks.file2Kafka_sink.kafka.bootstrap.servers = 172.22.222.17:9092,172.22.222.18:9092,172.22.222.20:9092
file2Kafka.sinks.file2Kafka_sink.kafka.flumeBatchSize = 20

# Channel (memory)
file2Kafka.channels.file2Kafka_channel.type = memory
file2Kafka.channels.file2Kafka_channel.capacity = 100000
file2Kafka.channels.file2Kafka_channel.dataDirs = 10000

# Bind source and sink to channel
file2Kafka.sources.file2Kafka_source.channels = file2Kafka_channel
file2Kafka.sinks.file2Kafka_sink.channel = file2Kafka_channel

Start the Flume agents on CDH nodes, then monitor their logs (example for slave1):

# Example for slave1
cd /var/log/flume-ng
tailf flume-cmf-flume-AGENT-bi-slave1.log

When the agents start without errors, launch the web‑side agent:

./flume-ng agent --conf ../conf -f ../conf/flume-conf.properties --name agent -Dflume.root.logger=INFO,console

Test the pipeline by starting a Kafka console consumer:

# On a Kafka broker
./kafka-console-consumer.sh --bootstrap-server 172.22.222.20:9092,172.22.222.17:9092,172.22.222.18:9092 --topic testnginx --from-beginning

Generate user actions on the web front‑end; the corresponding Nginx logs are captured by Flume, sent to Kafka, and displayed by the consumer, confirming that real‑time user behavior data is flowing through the system.

Finally, the article encourages readers to like, bookmark, and share the tutorial.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data Kafka Hadoop Spark Streaming Flume

Written by

Big Data Technology & Architecture

Wang Zhiwu, a big data expert, dedicated to sharing big data technology.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.