Big Data 25 min read

Building a Scalable Ad Attribution Platform: Architecture & Real‑Time Data Flow

This article explains how to design and implement a scalable ad attribution platform, covering data collection, real‑time processing with Kafka, storage in HBase, deduplication strategies, attribution models, and configurable media integration to maximize ROI for marketers.

Huolala Tech

Aug 3, 2023

Building a Scalable Ad Attribution Platform: Architecture & Real‑Time Data Flow

Background

Marketers aim to maximize ROI across many advertising channels, but often face questions such as which channel brings the most users, which drives activation, which yields the highest conversion, how to run low‑cost campaigns, and how to optimize the ad model.

Which channel and specific ad plan a user came from?

Which channel brings the most new users or is best for activation?

Which channel has the best conversion and most paying users?

How to run low‑cost high‑volume campaigns?

How to optimize the advertising model?

Attribution platforms help answer these questions by providing detailed channel‑level data, enabling ROI calculation, audience analysis, dynamic optimization, and objective justification for marketing budgets.

John Wanamaker once said, “Half of my advertising budget is wasted, but I don’t know which half.”

With ad attribution technology, marketers can identify the ineffective half and make informed decisions.

Business Overview

Ad Attribution Platform tracks and analyzes ad performance, helping marketers understand how users reach the product through various channels, evaluate campaign effectiveness, and optimize media models to improve conversion rates and ROI. Common platforms include AppsFlyer, Adjust, Kochava, and Umeng.

The attribution flow typically follows click‑based mapping, consisting of several steps:

1. Ad Delivery

Define campaign strategy, select platforms, set budgets, and aim to expose the product or service.

2. User Click

The click is the user’s touchpoint that leads to the landing page.

3. Media Reports Click Data

Media sends click data containing device identifiers (IMEI, OAID, IDFA, MAC) and ad parameters.

4. Event Reporting

App‑side events are uploaded to the server or cloud, requiring accurate, secure, and stable transmission.

5. Event Consumption

Event data is pushed to Kafka and consumed by the attribution platform.

6. Attribution

The core step matches event data with click data using models such as last‑click, first‑click, average, time decay, or position‑based attribution.

7. Data Reporting

Attributed results are returned to media to refine targeting, expand audiences, and enable precise push.

8. Data Analysis

Media metrics (impressions, clicks, downloads, payments) are analyzed to optimize ad strategy, targeting, placement, and timing.

Architecture Design

Technical Architecture

The system is split into four services: data collection, real‑time attribution, data reporting, and a configuration/monitoring platform.

Data Collection

Expose external APIs for media data ingestion and authorization callbacks.

Provide device‑level deduplication for new‑old user distinction.

Require high stability and elastic scaling.

Real‑Time Attribution

Process only attribution logic, decoupled from media.

Push successful attributions to Kafka.

Consume all event data from Kafka, perform device matching, and support elastic scaling.

Data Reporting

Consume from Kafka, offer an independent outbound service with controlled risk.

Low performance requirements because attributed volume is relatively small.

Configuration & Monitoring Platform

Internal visualization platform for media configuration, event management, attribution plan management, monitoring, and querying.

Key Technical Solutions

Data Stability and Storage

Peak traffic occurs around 9:30 am and 3:00 pm, with troughs at 4:00 am. Kafka smooths spikes and valleys, reducing service pressure.

HBase is used for storage because it supports massive real‑time reads/writes, random access, space‑efficient null fields, high scalability, and eliminates the need for joins.

RowKey design avoids hotspots; keep it ≤16 bytes.

Data Querying

Redis caches recent click data to mitigate latency during peak periods, ensuring that click‑to‑install windows (≈30 s) are respected.

Redis keys:

Precise attribution: att:{mediaId}:{deviceId} Fuzzy attribution: att:{mediaId}:{IP}:{osVersion}:{model} (2‑minute window)

/** 
 * Pseudo‑code 
 * Click data Redis expiration time (minutes) 
 */ 
@Value("${redisClickDataExp:30}") 
private Integer redisClickDataExp; 

final String[] IDENTIFIER_FIELDS = {"oaid","imei","androidId","idfa"}; 

// Precise attribution 
for (String field : IDENTIFIER_FIELDS) { 
    String key = "att:" + PromoteClient.USER.getCode() + field + attributionDataMap.get(field); 
    redisTemplate.opsForValue().set(DigestUtils.md5DigestAsHex(key.getBytes()), 
        JsonUtil.toJSONString(attributionData), redisClickDataExp, TimeUnit.SECONDS); 
} 

// Fuzzy attribution 
String fuzzyKey = "att:" + PromoteClient.USER.getCode() + attributionData.getProperties().getIp() 
    + attributionData.getProperties().getOsVersion() + attributionData.getProperties().getModel(); 
redisTemplate.opsForValue().set(DigestUtils.md5DigestAsHex(fuzzyKey.getBytes()), 
    JsonUtil.toJSONString(attributionData), 2, TimeUnit.MINUTES);

Real‑Time Attribution Logic

To keep business logic separate from data, Drools rule engine is used. Each consumer instance loads rules dynamically, allowing runtime configuration without code changes.

@Override 
public void handle(ConsumerRecord<Object, Object> record) { 
    // Deserialize event data 
    AttributionData attributionData = JsonUtil.parseJson(String.valueOf(record.value()), AttributionData.class); 

    // Execute Drools rules 
    StatelessKieSession kieSession = evenTrackRulesService.getKieSession(); 
    kieSession.execute(attributionData); 
}

Rules can be loaded on the fly:

@Override 
public StatelessKieSession getKieSession() { 
    if (kieSession == null) { 
        kieSession = kieContainer.newStatelessKieSession(); 
        kieSession.setGlobal("ss", strategyService); 
    } 
    return kieSession; 
} 

private KieContainer loadContainer(Map<String, EventTrackRuleEntity> ruleEntityMap) { 
    KieServices ks = KieServices.Factory.get(); 
    KieRepository kr = ks.getRepository(); 
    KieFileSystem kfs = ks.newKieFileSystem(); 
    for (Map.Entry<String, EventTrackRuleEntity> map : ruleEntityMap.entrySet()) { 
        String drl = map.getValue().getRule(); 
        kfs.write("src/main/resources/" + drl.hashCode() + ".drl", drl); 
    } 
    KieBuilder kb = ks.newKieBuilder(kfs); 
    kb.buildAll(); 
    if (kb.getResults().hasMessages(Message.Level.ERROR)) { 
        log.error("rule online failed, msg: {}", kb.getResults().toString()); 
        throw new BizException(BizErrorCode.DROOL_K_CONTAINER_ERR, kb.getResults().toString()); 
    } 
    return ks.newKieContainer(kr.getDefaultReleaseId()); 
}

Deduplication

Device‑level deduplication uses AndroidID, IDFA/IDFV, or OPEN_ID. Business‑level deduplication uses userId or driverId to avoid reliance on device IDs and protect privacy.

Deep deduplication combines device ID with event type to prevent duplicate conversions during the attribution window.

if (Boolean.TRUE.equals(attributionData.isDeviceIdDeduplication())) { 
    Long installedId = eventTrackAppInstallService.findInstalledIdByDeviceId(deviceId); 
    if (installedId != null) { 
        return; 
    } 
} 

if (Boolean.TRUE.equals(attributionData.isUserIdDeduplication())) { 
    Long installedId = eventTrackAppInstallService.findInstalledIdByUserId(userId); 
    if (installedId != null) { 
        return; 
    } 
} 

if (Boolean.TRUE.equals(attributionData.isDeepDeduplication())) { 
    Long installedId = eventTrackAppInstallService.findInstalledIdByUploadId(uploadId); 
    if (installedId != null) { 
        return; 
    } 
}

Attribution Models

Last‑click: credit goes to the final click.

First‑click: credit goes to the first click.

Average: credit is split evenly across all touches.

Time decay: recent interactions receive more credit.

Position‑based: 40 % to first and last clicks, 60 % distributed evenly.

Mapping vs. Link Attribution

Mapping attribution uses device identifiers (IMEI > OAID > AndroidID > Model+IP) with a window period; link attribution relies only on channel identifiers and cannot support user‑level analysis.

Implementation Highlights

Media configuration via UI reduces integration time and eliminates code changes.

Rule‑engine based event filtering enables dynamic selection of millions of event points.

Configurable request bodies use a tree‑structured model to map arbitrary media schemas.

Front‑end visualizations built with Vue.js provide reporting dashboards.

Future Directions

Enhanced data analysis for media effectiveness and audience profiling.

Anti‑fraud mechanisms with deeper event correlation.

Full‑chain data monitoring and alerting.

Integration with major market APIs for campaign creation, material management, and A/B testing.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

data pipeline Real-time Processing Kafka HBase Marketing Analytics Ad Attribution

Written by

Huolala Tech

Technology reshapes logistics

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.