Building a Scalable Ad Attribution Platform: Architecture & Real‑Time Data Flow
This article explains how to design and implement a scalable ad attribution platform, covering data collection, real‑time processing with Kafka, storage in HBase, deduplication strategies, attribution models, and configurable media integration to maximize ROI for marketers.
Background
Marketers aim to maximize ROI across many advertising channels, but often face questions such as which channel brings the most users, which drives activation, which yields the highest conversion, how to run low‑cost campaigns, and how to optimize the ad model.
Which channel and specific ad plan a user came from?
Which channel brings the most new users or is best for activation?
Which channel has the best conversion and most paying users?
How to run low‑cost high‑volume campaigns?
How to optimize the advertising model?
Attribution platforms help answer these questions by providing detailed channel‑level data, enabling ROI calculation, audience analysis, dynamic optimization, and objective justification for marketing budgets.
John Wanamaker once said, “Half of my advertising budget is wasted, but I don’t know which half.”
With ad attribution technology, marketers can identify the ineffective half and make informed decisions.
Business Overview
Ad Attribution Platform tracks and analyzes ad performance, helping marketers understand how users reach the product through various channels, evaluate campaign effectiveness, and optimize media models to improve conversion rates and ROI. Common platforms include AppsFlyer, Adjust, Kochava, and Umeng.
The attribution flow typically follows click‑based mapping, consisting of several steps:
1. Ad Delivery
Define campaign strategy, select platforms, set budgets, and aim to expose the product or service.
2. User Click
The click is the user’s touchpoint that leads to the landing page.
3. Media Reports Click Data
Media sends click data containing device identifiers (IMEI, OAID, IDFA, MAC) and ad parameters.
4. Event Reporting
App‑side events are uploaded to the server or cloud, requiring accurate, secure, and stable transmission.
5. Event Consumption
Event data is pushed to Kafka and consumed by the attribution platform.
6. Attribution
The core step matches event data with click data using models such as last‑click, first‑click, average, time decay, or position‑based attribution.
7. Data Reporting
Attributed results are returned to media to refine targeting, expand audiences, and enable precise push.
8. Data Analysis
Media metrics (impressions, clicks, downloads, payments) are analyzed to optimize ad strategy, targeting, placement, and timing.
Architecture Design
Technical Architecture
The system is split into four services: data collection, real‑time attribution, data reporting, and a configuration/monitoring platform.
Data Collection
Expose external APIs for media data ingestion and authorization callbacks.
Provide device‑level deduplication for new‑old user distinction.
Require high stability and elastic scaling.
Real‑Time Attribution
Process only attribution logic, decoupled from media.
Push successful attributions to Kafka.
Consume all event data from Kafka, perform device matching, and support elastic scaling.
Data Reporting
Consume from Kafka, offer an independent outbound service with controlled risk.
Low performance requirements because attributed volume is relatively small.
Configuration & Monitoring Platform
Internal visualization platform for media configuration, event management, attribution plan management, monitoring, and querying.
Key Technical Solutions
Data Stability and Storage
Peak traffic occurs around 9:30 am and 3:00 pm, with troughs at 4:00 am. Kafka smooths spikes and valleys, reducing service pressure.
HBase is used for storage because it supports massive real‑time reads/writes, random access, space‑efficient null fields, high scalability, and eliminates the need for joins.
RowKey design avoids hotspots; keep it ≤16 bytes.
Data Querying
Redis caches recent click data to mitigate latency during peak periods, ensuring that click‑to‑install windows (≈30 s) are respected.
Redis keys:
Precise attribution: att:{mediaId}:{deviceId} Fuzzy attribution: att:{mediaId}:{IP}:{osVersion}:{model} (2‑minute window)
/**
* Pseudo‑code
* Click data Redis expiration time (minutes)
*/
@Value("${redisClickDataExp:30}")
private Integer redisClickDataExp;
final String[] IDENTIFIER_FIELDS = {"oaid","imei","androidId","idfa"};
// Precise attribution
for (String field : IDENTIFIER_FIELDS) {
String key = "att:" + PromoteClient.USER.getCode() + field + attributionDataMap.get(field);
redisTemplate.opsForValue().set(DigestUtils.md5DigestAsHex(key.getBytes()),
JsonUtil.toJSONString(attributionData), redisClickDataExp, TimeUnit.SECONDS);
}
// Fuzzy attribution
String fuzzyKey = "att:" + PromoteClient.USER.getCode() + attributionData.getProperties().getIp()
+ attributionData.getProperties().getOsVersion() + attributionData.getProperties().getModel();
redisTemplate.opsForValue().set(DigestUtils.md5DigestAsHex(fuzzyKey.getBytes()),
JsonUtil.toJSONString(attributionData), 2, TimeUnit.MINUTES);Real‑Time Attribution Logic
To keep business logic separate from data, Drools rule engine is used. Each consumer instance loads rules dynamically, allowing runtime configuration without code changes.
@Override
public void handle(ConsumerRecord<Object, Object> record) {
// Deserialize event data
AttributionData attributionData = JsonUtil.parseJson(String.valueOf(record.value()), AttributionData.class);
// Execute Drools rules
StatelessKieSession kieSession = evenTrackRulesService.getKieSession();
kieSession.execute(attributionData);
}Rules can be loaded on the fly:
@Override
public StatelessKieSession getKieSession() {
if (kieSession == null) {
kieSession = kieContainer.newStatelessKieSession();
kieSession.setGlobal("ss", strategyService);
}
return kieSession;
}
private KieContainer loadContainer(Map<String, EventTrackRuleEntity> ruleEntityMap) {
KieServices ks = KieServices.Factory.get();
KieRepository kr = ks.getRepository();
KieFileSystem kfs = ks.newKieFileSystem();
for (Map.Entry<String, EventTrackRuleEntity> map : ruleEntityMap.entrySet()) {
String drl = map.getValue().getRule();
kfs.write("src/main/resources/" + drl.hashCode() + ".drl", drl);
}
KieBuilder kb = ks.newKieBuilder(kfs);
kb.buildAll();
if (kb.getResults().hasMessages(Message.Level.ERROR)) {
log.error("rule online failed, msg: {}", kb.getResults().toString());
throw new BizException(BizErrorCode.DROOL_K_CONTAINER_ERR, kb.getResults().toString());
}
return ks.newKieContainer(kr.getDefaultReleaseId());
}Deduplication
Device‑level deduplication uses AndroidID, IDFA/IDFV, or OPEN_ID. Business‑level deduplication uses userId or driverId to avoid reliance on device IDs and protect privacy.
Deep deduplication combines device ID with event type to prevent duplicate conversions during the attribution window.
if (Boolean.TRUE.equals(attributionData.isDeviceIdDeduplication())) {
Long installedId = eventTrackAppInstallService.findInstalledIdByDeviceId(deviceId);
if (installedId != null) {
return;
}
}
if (Boolean.TRUE.equals(attributionData.isUserIdDeduplication())) {
Long installedId = eventTrackAppInstallService.findInstalledIdByUserId(userId);
if (installedId != null) {
return;
}
}
if (Boolean.TRUE.equals(attributionData.isDeepDeduplication())) {
Long installedId = eventTrackAppInstallService.findInstalledIdByUploadId(uploadId);
if (installedId != null) {
return;
}
}Attribution Models
Last‑click: credit goes to the final click.
First‑click: credit goes to the first click.
Average: credit is split evenly across all touches.
Time decay: recent interactions receive more credit.
Position‑based: 40 % to first and last clicks, 60 % distributed evenly.
Mapping vs. Link Attribution
Mapping attribution uses device identifiers (IMEI > OAID > AndroidID > Model+IP) with a window period; link attribution relies only on channel identifiers and cannot support user‑level analysis.
Implementation Highlights
Media configuration via UI reduces integration time and eliminates code changes.
Rule‑engine based event filtering enables dynamic selection of millions of event points.
Configurable request bodies use a tree‑structured model to map arbitrary media schemas.
Front‑end visualizations built with Vue.js provide reporting dashboards.
Future Directions
Enhanced data analysis for media effectiveness and audience profiling.
Anti‑fraud mechanisms with deeper event correlation.
Full‑chain data monitoring and alerting.
Integration with major market APIs for campaign creation, material management, and A/B testing.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
