Real‑time User Behavior Collection Using Flume, Kafka, and Spark Streaming on Hadoop
This guide explains how to continuously collect web‑service user behavior logs, route them through Flume agents to Kafka, and finally ingest them with Spark Streaming into HDFS, covering environment preparation, configuration files, deployment steps, and verification procedures.
Scenario: continuously collect user‑behavior data from a web service and store it in a Hadoop data platform.
Preparation:
Deploy Nginx on the web server (e.g., 172.17.111.111) to generate access logs.
Data platform runs on Hadoop; final data is persisted to HDFS (e.g., 172.22.222.17‑19).
Place web server and data‑platform machines in the same data center for reliable network connectivity.
Solution & technology selection:
Use Flume to collect logs.
Send collected logs to Kafka as a unified destination.
Consume Kafka streams with Spark Streaming and write to HDFS (Hive, Kudu, etc.).
Deploy Flume in a distributed fashion: web node as source, a cluster of Flume agents on the big‑data side as sinks, communication via Avro.
Deploy Flume on the web server:
# Download
wget http://mirrors.tuna.tsinghua.edu.cn/apache/flume/1.8.0/apache-flume-1.8.0-bin.tar.gz
# Extract
tar -zxvf apache-flume-1.8.0-bin.tar.gz
# Move to /opt
mv /home/apache-flume-1.8.0-bin /opt/flume-1.8.0Edit the configuration file /opt/flume-1.8.0/conf/flume-conf.properties with the following content:
cd /opt/flume-1.8.0/conf
vi flume-conf.properties
# Agent definition
agent.sources=source1
agent.channels=channel1
agent.sinks=sink1 sink2 sink3
# Source (tail nginx log)
agent.sources.source1.type=exec
agent.sources.source1.command=tail -F /usr/local/nginx/logs/dev-biwx.belle.net.cn.log
agent.sources.source1.channels=channel1
# Sinks (Avro to three slave agents)
agent.sinks.sink1.type=avro
agent.sinks.sink1.channel=channel1
agent.sinks.sink1.hostname=172.22.222.17
agent.sinks.sink1.port=10000
agent.sinks.sink1.connect-timeout=200000
agent.sinks.sink2.type=avro
agent.sinks.sink2.channel=channel1
agent.sinks.sink2.hostname=172.22.222.18
agent.sinks.sink2.port=10000
agent.sinks.sink2.connect-timeout=200000
agent.sinks.sink3.type=avro
agent.sinks.sink3.channel=channel1
agent.sinks.sink3.hostname=172.22.222.19
agent.sinks.sink3.port=10000
agent.sinks.sink3.connect-timeout=200000
# Sink group for load‑balancing
agent.sinkgroups=g1
agent.sinkgroups.g1.sinks=sink1 sink2 sink3
agent.sinkgroups.g1.processor.type=load_balance
agent.sinkgroups.g1.processor.selector=round_robin
# Channel definition
agent.channels.channel1.type=file
agent.channels.channel1.checkpointDir=/var/checkpoint
agent.channels.channel1.dataDirs=/var/tmp
agent.channels.channel1.capacity=10000
agent.channels.channel1.transactionCapactiy=100
# Bind source and sinks to channel
agent.sources.source1.channels=channel1
agent.sinks.sink1.channel=channel1
agent.sinks.sink2.channel=channel1
agent.sinks.sink3.channel=channel1
:wq!Configure each slave agent (example for slave1):
#Name the components on this agent
file2Kafka.sources = file2Kafka_source
file2Kafka.sinks = file2Kafka_sink
file2Kafka.channels = file2Kafka_channel
# Source (Avro)
file2Kafka.sources.file2Kafka_source.type = avro
file2Kafka.sources.file2Kafka_source.bind = 172.22.222.17
file2Kafka.sources.file2Kafka_source.port = 10000
# Sink (Kafka)
file2Kafka.sinks.file2Kafka_sink.type = org.apache.flume.sink.kafka.KafkaSink
file2Kafka.sinks.file2Kafka_sink.kafka.topic = testnginx
file2Kafka.sinks.file2Kafka_sink.kafka.bootstrap.servers = 172.22.222.17:9092,172.22.222.18:9092,172.22.222.20:9092
file2Kafka.sinks.file2Kafka_sink.kafka.flumeBatchSize = 20
# Channel (memory)
file2Kafka.channels.file2Kafka_channel.type = memory
file2Kafka.channels.file2Kafka_channel.capacity = 100000
file2Kafka.channels.file2Kafka_channel.dataDirs = 10000
# Bind source and sink to channel
file2Kafka.sources.file2Kafka_source.channels = file2Kafka_channel
file2Kafka.sinks.file2Kafka_sink.channel = file2Kafka_channelStart the Flume agents on CDH nodes, then monitor their logs (example for slave1):
# Example for slave1
cd /var/log/flume-ng
tailf flume-cmf-flume-AGENT-bi-slave1.logWhen the agents start without errors, launch the web‑side agent:
./flume-ng agent --conf ../conf -f ../conf/flume-conf.properties --name agent -Dflume.root.logger=INFO,consoleTest the pipeline by starting a Kafka console consumer:
# On a Kafka broker
./kafka-console-consumer.sh --bootstrap-server 172.22.222.20:9092,172.22.222.17:9092,172.22.222.18:9092 --topic testnginx --from-beginningGenerate user actions on the web front‑end; the corresponding Nginx logs are captured by Flume, sent to Kafka, and displayed by the consumer, confirming that real‑time user behavior data is flowing through the system.
Finally, the article encourages readers to like, bookmark, and share the tutorial.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Big Data Technology & Architecture
Wang Zhiwu, a big data expert, dedicated to sharing big data technology.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
