How Gaode’s Spacetime‑GR Model Boosts POI Recommendation with AI‑Powered SFT and DPO
Gaode transforms its map app into a dynamic, AI‑driven “living map” by fine‑tuning the large Spacetime‑GR model through embedding‑based and generative ranking SFT, DPO alignment, and multimodal augmentation, achieving significant offline CTR‑AUC improvements and online CTR gains in POI recommendation.
Background
Gaode (Amap) aims to turn its static base map into a dynamic, AI‑driven “living map”. The pre‑trained Spacetime‑GR model encodes massive, de‑identified app usage behavior and learns a generic spatio‑temporal representation. The goal is to adapt this foundation model to downstream POI recommendation tasks.
Challenges
Applying large models to search and recommendation must balance heavy computation with the low‑latency response required in online services.
Proposed Post‑Training Methods
Embedding‑based Ranking SFT (E‑SFT) : fine‑tune Spacetime‑GR as a dual‑tower encoder to produce user and POI embeddings ( E_u, E_p). The embeddings are projected to a low‑dimensional space and optimized with an InfoNCE contrastive loss, then used as additional features for the existing ranking model.
Generative Ranking SFT (G‑SFT) : fine‑tune the model to directly output a POI relevance score. Candidate POI tokens are concatenated to the user sequence; attention masks are modified so that each POI only sees the user context and its own token, making the score invariant to POI order. Training uses a cross‑entropy loss on the POI logits.
DPO Alignment : apply Direct Preference Optimization to jointly model recall and ranking. Clicked POIs are treated as positive samples, non‑clicked exposures as negatives. The model predicts the probability of each POI being output and is trained to increase probabilities for positives.
Multimodal Augmentation : enrich POI side‑information with textual fields (name, address, tags, comments) and images. These modalities are encoded by a pretrained multimodal LLM, pooled, and fused with existing geographic and type embeddings before the dual‑tower encoder.
Model Architecture
Both E‑SFT and G‑SFT share a dual‑tower encoder where user features and candidate POI features are passed through Spacetime‑GR (excluding the final language‑model head) to obtain hidden states H_u and H_p. A learnable projection maps them to embeddings E_u and E_p. Cosine similarity between E_u and E_p is maximized for positive pairs and minimized for negatives via the InfoNCE loss. For G‑SFT, POI tokens are appended to the user sequence, attention masks are adjusted, and a cross‑entropy loss is applied to the POI ranking logits. DPO constructs positive/negative pairs from click logs and optimizes a preference‑based loss that encourages higher output probability for clicked items.
Multimodal Augmentation
During post‑training, POI textual information (name, address, tags, comments) is concatenated into a text sequence, and associated images are processed by a multimodal LLM. The resulting hidden states are average‑pooled and combined with geographic and type encodings, providing richer POI representations for both E‑SFT and G‑SFT.
Offline Experiments
CTR‑AUC is used as the evaluation metric. Both E‑SFT and G‑SFT improve over the baseline online ranking model; G‑SFT yields a larger gain. Combining the two methods achieves the best performance, demonstrating their complementarity.
Online Experiments
Applying the fine‑tuned embeddings and scores to Gaode’s POI recommendation pipeline yields:
+5.24 % PV CTR and +3.88 % UV CTR for the ranking task (E‑SFT + G‑SFT).
+15.4 % PV CTR and +7.1 % UV CTR for the end‑to‑end DPO‑enhanced model.
Case Study
A user with a historical preference for hot springs receives a top‑ranked restaurant POI that is categorized as “chicken dish”. Multimodal text and image signals reveal a strong semantic link to hot springs, allowing the model to rank this POI highly despite its nominal category.
Conclusion
Spacetime‑GR can be effectively adapted to downstream recommendation tasks via embedding‑based SFT, generative SFT, and DPO alignment. Incorporating multimodal POI information further enriches representations, leading to consistent offline and online performance gains.
Amap Tech
Official Amap technology account showcasing all of Amap's technical innovations.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
