Big Data 17 min read

Why NiFi Beats Flink: Practical Data Flow for Recommendation Engines

This article explains why the team prefers Apache NiFi over Flink or Storm for data‑flow handling in information‑stream recommendation systems, outlines NiFi’s core components, features, cluster setup, custom processor development, and real‑world use cases such as HDFS, Elasticsearch, and RocketMQ integrations.

Mafengwo Technology
Mafengwo Technology
Mafengwo Technology
Why NiFi Beats Flink: Practical Data Flow for Recommendation Engines

Why Not Choose Flink

In information‑stream recommendation scenarios, data is the raw material for model iteration and a key driver of metric growth, while the "data flow" runs through the entire recommendation pipeline. Traditional solutions often use Flink or Storm (FS) to build data‑flow pipelines, but many cases only require simple data transport without heavy transformation, making FS overly costly and inefficient.

Apache NiFi, another member of the Apache family, excels at handling and distributing data flows. For a simple requirement such as adding a data path to Elasticsearch, using FS could take a morning for development, testing, deployment, and verification, whereas NiFi can accomplish the same in about five minutes.

This article introduces NiFi’s characteristics and usage within a recommendation engine, aiming to inspire further exploration.

NiFi as a Process‑Oriented Big Data Framework

NiFi was originally developed by the U.S. National Security Agency (NSA) as a visual, customizable data‑integration product. In 2014 the NSA contributed it to the Apache open‑source community, and it became a top‑level Apache project in July 2015.

2.1 NiFi Features

NiFi is designed for data‑flow management and can build pipelines between data centers. Its drag‑and‑drop UI, configurable parameters, and simple connections enable users to host data flows and automate inter‑system transfers without writing code. Compared with FS, NiFi offers:

Web UI for dragging and configuring components

No code development required for users

Support for multiple data sources

Automatic load balancing and back‑pressure

Easy monitoring

Scalable and recoverable architecture

Template reuse

The following section demonstrates how NiFi’s visual workflow translates to backend execution.

2.2 Framework and Cluster

NiFi runs on the Java Virtual Machine and consists of three core components: Web Server, Flow Controller, and Repository.

NiFi architecture diagram
NiFi architecture diagram

Web Server : Provides an HTTP‑based web interface for operating tasks.

Flow Controller : The core part that manages connections between multiple Processors. Processors are the actual processing units.

Each functionality is encapsulated in a Processor; the Flow Controller maintains connections and manages them.

NiFi offers many built‑in Processors (e.g., Amazon, Attributes, Hadoop) that can be dragged and configured.

If built‑in Processors do not meet business needs, custom Processors can be developed.

Repository : NiFi provides three databases—FlowFile, Content, and Provenance—to store flow state, actual data, and provenance information.

NiFi also supports cluster mode, where each node performs identical operations on different data. The cluster relies on ZooKeeper to elect a primary node and coordinate heartbeats.

NiFi cluster diagram
NiFi cluster diagram

NiFi in Recommendation Engine Platforms: Applications and Extensions

3.1 Application Status

NiFi is already used in many online tasks of the recommendation platform, such as real‑time user behavior data landing in HDFS, exposure events stored in Elasticsearch, session data synchronization, and interest tag persistence in MySQL.

NiFi usage diagram
NiFi usage diagram

For example, a real‑time user‑behavior pipeline to HDFS uses Processors such as EvaluateJsonPath (extracts FlowFile attributes) and UpdateAttribute (simple field processing). Configuring these Processors and launching the job takes only minutes, dramatically improving efficiency.

Key Processor groups/jobs include:

Real‑time user behavior data landing in Hive tables for hourly model training.

Streaming behavior data to Elasticsearch for monitoring (CTR, recall, ranking scores, etc.).

User‑screen profile snapshots stored in Elasticsearch to support historical queries and online case fixing.

Support for content‑mining teams by storing recommendation pool image scores.

3.2 Extension Development

When business complexity grows, custom Processors may be needed. For instance, NiFi does not natively support RocketMQ, so a custom Processor is built by extending the AbstractProcessor class and overriding onTrigger and onScheduled.

private final PropertyDescriptor TOPIC = new PropertyDescriptor.Builder()
    .name("TOPIC")
    .displayName("TOPIC")
    .description("TOPIC")
    .required(true)
    .addValidator(StandardValidators.NON_BLANK_VALIDATOR)
    .expressionLanguageSupported(ExpressionLanguageScope.NONE)
    .build();

The following diagram shows a Processor Group that continuously consumes a RocketMQ topic and writes data to HDFS.

RocketMQ to HDFS Processor Group
RocketMQ to HDFS Processor Group

Key methods:

@Override
public void onTrigger(ProcessContext context, final ProcessSession session) throws ProcessException {
    try {
        String topic = context.getProperty(TOPIC).getValue();
        FlowFile flowfile = session.get();
        session.read(flowfile, in -> {
            String message = IOUtils.toString(in);
            byte[] messageBytes = message.getBytes(RemotingHelper.DEFAULT_CHARSET);
            Message msg = new Message(topic, messageBytes);
            try {
                producer.send(msg, new SendCallback() {
                    @Override public void onSuccess(SendResult sendResult) {}
                    @Override public void onException(Throwable e) { getLogger().error("producer:", e); }
                });
            } catch (Exception e) { getLogger().error("producer:", e.getStackTrace()); }
        });
        session.getProvenanceReporter().send(flowfile, "rocketMQ");
        session.transfer(flowfile, SUCCESS);
    } catch (Exception e) { getLogger().error("start fail", e.getStackTrace()); }
}

@OnScheduled
public void onScheduled(final ProcessContext context) { /* initialize producer */ }

Stopping a Processor triggers @OnStopped, which shuts down resources.

@OnStopped
public void stopConsumer() {
    getLogger().error("OnStopped:" + String.valueOf(producer));
    invalidProducer();
}

private synchronized void invalidProducer() {
    if (producer != null) { producer.shutdown(); }
    producer = null;
}

Practical Deployment

4.1 NiFi Cluster Setup

NiFi clusters depend on ZooKeeper. The team uses an external ZooKeeper ensemble. Core configuration resides in nifi.properties:

nifi.web.http.host=your_host
nifi.web.http.port=your_port
nifi.remote.input.host=your_host
nifi.remote.input.socket.port=port
nifi.cluster.is.node=true
nifi.cluster.node.address=your_host
nifi.cluster.node.protocol.port=port
nifi.zookeeper.connect.string=address1,address2,...
nifi.cluster.load.balance.port=port

The bootstrap.conf file sets JVM memory and GC logging:

java.arg.2=-Xms5120m
java.arg.3=-Xmx5120m
java.arg.20=-XX:+PrintGCDetails
java.arg.21=-XX:+PrintGCTimeStamps
java.arg.22=-XX:+PrintGCDateStamps
java.arg.23=-Xloggc:path_of_your_gc.log

The startup script nifi.sh defines the web process memory:

run_nifi_cmd='${JAVA}' -cp '${BOOTSTRAP_CLASSPATH}' -Xms1024m -Xmx1024m ...

4.2 Adding and Configuring Processors

After the cluster is up, access the Web UI via the configured IP and port. Use the toolbar to add Processor Groups and Processors. For a real‑world requirement of writing data from Kafka to Redis, the steps are:

Create a Processor Group by dragging the button, naming it, and clicking ADD.

Add a GetKafka Processor to consume Kafka data, configure its properties (broker list, topic, etc.), and set scheduling options.

Add a PutRedis Processor, configure connection details, and link it to the previous Processor.

Commonly used Processors include:

GetKafka/PutKafka

GetHDFS/PutHDFS

PutElasticsearch

UpdateCounter

GetRocketMQ/PutRocketMQ (custom)

PutSQL

ExecuteSQL

Finally, right‑click a Processor and select **Start** to launch the job.

Conclusion

The article summarizes the application and practice of Apache NiFi in an information‑stream recommendation engine. The team is investigating the use of NiFi’s UI‑based configuration for all online recommendation business, eliminating manual database changes. Links to the team's extension components and official NiFi repositories are provided for further exploration.

Big DataNiFiProcessor Development
Mafengwo Technology
Written by

Mafengwo Technology

External communication platform of the Mafengwo Technology team, regularly sharing articles on advanced tech practices, tech exchange events, and recruitment.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.