How FIN Boosts CTR in Online Food Ordering: A Spatial‑Temporal Modeling Breakthrough
The paper introduces FIN (Fragment and Integrate Network), a novel spatial‑temporal model that extracts multiple sub‑sequences from ultra‑long user behavior logs, applies simplified and multi‑head attention, and fuses them with physically meaningful set operations, achieving up to 5.7% CTR lift and 7.3% RPM improvement in real‑world food‑delivery advertising.
Introduction
In food‑delivery scenarios such as Ele.me and Meituan, time‑location (spatial‑temporal) signals are crucial for click‑through‑rate (CTR) prediction. Existing models like BASM/StEN handle limited‑length sequences and ignore multi‑dimensional spatial‑temporal relations, while long‑sequence models (SIM/ETA) lack spatial‑temporal awareness, limiting their effectiveness for location‑based services.
This work proposes FIN (Fragment and Integrate Network), a spatial‑temporal modeling framework designed for ultra‑long behavior sequences.
Problem & Challenges
CTR depends heavily on spatial‑temporal context: farther distance reduces order probability, and user preferences vary across meal periods. Users generate massive click histories (over 22% have >1000 clicks per year). Prior methods either truncate sequences (≈100 events) or fail to exploit rich spatial‑temporal cues, leading to information loss.
Method Overview
FIN consists of two core networks:
Fragment Network (FN) : extracts four sub‑sequences from the ultra‑long user log based on distinct spatial‑temporal dimensions—location, meal‑time, recent short‑term, and full‑time‑space. Each sub‑sequence (still hundreds of events) is modeled with a simplified attention mechanism that balances computational cost and information gain, alongside conventional multi‑head attention for the most recent items.
Integrate Network (IN) : performs element‑wise set operations (union, intersection, difference) on the FN outputs to create a unified sequence with explicit physical meaning, then applies multi‑head attention to capture the fused representation.
Fragment Network Details
Location Sub‑sequence (Geohash‑block Modeling) : each latitude‑longitude pair is converted to a 6‑digit Geohash string. Queries retrieve matching behaviors, and a simplified attention processes the long list.
Meal‑time Sub‑sequence : the day is divided into five meal periods (breakfast, lunch, afternoon tea, dinner, night snack) and further split into minute‑level buckets. Queries retrieve behaviors belonging to the current bucket.
Recent Short‑term Sub‑sequence : the most recent dozens of actions within the last month are extracted and modeled with multi‑head attention.
Full‑time‑space Sub‑sequence : to capture long‑term preferences outside the current location or period, the full two‑year history is de‑duplicated by shop, retaining occurrence counts as side information. The de‑duplicated sequence is truncated to the latest 100 events and modeled with multi‑head target attention.
All sub‑sequences share side information such as item ID, category, brand, geohash, time interval, weekday/weekend flag, minute bucket, delivery distance, and statistical features (7/14/30‑day exposure, clicks, orders, rating, dwell time).
Integrate Network
The four FN sub‑sequences are aligned in length (via truncation) and dimension (via linear projection). Element‑wise set operations generate a unified sequence:
Intersection of meal‑time and location captures preferences for a specific place and period.
Difference between full‑time‑space and location captures preferences from other locations.
Union of full‑time‑space and recent short‑term captures periodic interests.
Corresponding set operations are also applied to the query item’s side information, producing a unified query representation. Multi‑head attention then processes the fused sequence, and the resulting vectors from FN and IN are concatenated and fed to a top‑level MLP for final CTR estimation.
Experiments
Datasets
Amazon Books : 197,226 samples, 98,613 users, 130,880 items, with category and price buckets as spatial‑temporal proxies.
Google Local (New York) : includes geohash‑encoded coordinates and category information.
Industrial (Ele.me) : 10.7 billion samples, 70.3 M users, 2.8 M shops, with 6‑digit geohash and minute‑level time buckets; >55% of users have sequence length >500.
Baselines
DIN – short‑term attention model.
Avg‑Pooling Long DIN – average‑pooled long‑term representation concatenated with DIN.
SIM (hard) – LSH‑based long‑sequence model.
ETA – end‑to‑end long‑sequence model.
StEN – spatial‑temporal model using location and period.
FIN – the proposed fragment‑and‑integrate architecture.
Setup : Adam optimizer (lr = 0.001), 4‑head multi‑head attention, MLP layers 200×80×2, embedding size = 4, AUC as primary metric.
Public Dataset Results
FIN outperforms all baselines, achieving the highest AUC. The gain over StEN demonstrates the benefit of additional sub‑sequences and physically meaningful fusion.
Ablation Study
Removing any FN component or the set‑operation fusion degrades performance, confirming the importance of multi‑dimensional sub‑sequence extraction, simplified attention, and IN‑level integration.
Industrial Dataset Results
FIN surpasses SIM, ETA, and StEN, delivering an AUC improvement of 0.0066 over the production baseline (SIM).
Online A/B Test
From June 17 2022 to August 1 2022, FIN achieved +5.7% CTR and +7.3% RPM compared with the previous online model (SIM). FIN has been fully deployed since August 2022.
Case Study
Analysis of click logs shows FIN significantly boosts matching efficiency and exposure for time‑sensitive categories (e.g., tea‑time drinks, night‑snack convenience stores) while maintaining stable performance for other periods, confirming its ability to capture current spatial‑temporal user interests.
Deployment Practices
Ele.me’s ad serving handles >10,000 QPS with latency under tens of milliseconds. Real‑time and offline behavior streams are decoupled; real‑time sub‑sequences are refreshed every second, offline ones daily. Deep kernel fusion and CUDA‑Graph optimizations accelerate multi‑head attention in production.
Conclusion
FIN introduces a fragment‑and‑integrate paradigm that splits ultra‑long user histories into multiple spatial‑temporal sub‑sequences, models each with efficient attention, and fuses them via set operations with clear physical meaning. Deployed in Ele.me’s advertising system, FIN yields consistent CTR and revenue gains and provides a new direction for long‑sequence, multi‑dimensional user modeling.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
