How Gaode’s Spacetime‑GR Model Boosts POI Recommendation with AI‑Powered SFT and DPO

Gaode transforms its map app into a dynamic, AI‑driven “living map” by fine‑tuning the large Spacetime‑GR model through embedding‑based and generative ranking SFT, DPO alignment, and multimodal augmentation, achieving significant offline CTR‑AUC improvements and online CTR gains in POI recommendation.

Amap Tech
Amap Tech
Amap Tech
How Gaode’s Spacetime‑GR Model Boosts POI Recommendation with AI‑Powered SFT and DPO

Background

Gaode (Amap) aims to turn its static base map into a dynamic, AI‑driven “living map”. The pre‑trained Spacetime‑GR model encodes massive, de‑identified app usage behavior and learns a generic spatio‑temporal representation. The goal is to adapt this foundation model to downstream POI recommendation tasks.

Challenges

Applying large models to search and recommendation must balance heavy computation with the low‑latency response required in online services.

Proposed Post‑Training Methods

Embedding‑based Ranking SFT (E‑SFT) : fine‑tune Spacetime‑GR as a dual‑tower encoder to produce user and POI embeddings ( E_u, E_p). The embeddings are projected to a low‑dimensional space and optimized with an InfoNCE contrastive loss, then used as additional features for the existing ranking model.

Generative Ranking SFT (G‑SFT) : fine‑tune the model to directly output a POI relevance score. Candidate POI tokens are concatenated to the user sequence; attention masks are modified so that each POI only sees the user context and its own token, making the score invariant to POI order. Training uses a cross‑entropy loss on the POI logits.

DPO Alignment : apply Direct Preference Optimization to jointly model recall and ranking. Clicked POIs are treated as positive samples, non‑clicked exposures as negatives. The model predicts the probability of each POI being output and is trained to increase probabilities for positives.

Multimodal Augmentation : enrich POI side‑information with textual fields (name, address, tags, comments) and images. These modalities are encoded by a pretrained multimodal LLM, pooled, and fused with existing geographic and type embeddings before the dual‑tower encoder.

Model Architecture

Both E‑SFT and G‑SFT share a dual‑tower encoder where user features and candidate POI features are passed through Spacetime‑GR (excluding the final language‑model head) to obtain hidden states H_u and H_p. A learnable projection maps them to embeddings E_u and E_p. Cosine similarity between E_u and E_p is maximized for positive pairs and minimized for negatives via the InfoNCE loss. For G‑SFT, POI tokens are appended to the user sequence, attention masks are adjusted, and a cross‑entropy loss is applied to the POI ranking logits. DPO constructs positive/negative pairs from click logs and optimizes a preference‑based loss that encourages higher output probability for clicked items.

Dual‑tower architecture
Dual‑tower architecture
DPO training framework
DPO training framework

Multimodal Augmentation

During post‑training, POI textual information (name, address, tags, comments) is concatenated into a text sequence, and associated images are processed by a multimodal LLM. The resulting hidden states are average‑pooled and combined with geographic and type encodings, providing richer POI representations for both E‑SFT and G‑SFT.

Offline Experiments

CTR‑AUC is used as the evaluation metric. Both E‑SFT and G‑SFT improve over the baseline online ranking model; G‑SFT yields a larger gain. Combining the two methods achieves the best performance, demonstrating their complementarity.

Offline CTR‑AUC results
Offline CTR‑AUC results

Online Experiments

Applying the fine‑tuned embeddings and scores to Gaode’s POI recommendation pipeline yields:

+5.24 % PV CTR and +3.88 % UV CTR for the ranking task (E‑SFT + G‑SFT).

+15.4 % PV CTR and +7.1 % UV CTR for the end‑to‑end DPO‑enhanced model.

Online CTR improvements
Online CTR improvements

Case Study

A user with a historical preference for hot springs receives a top‑ranked restaurant POI that is categorized as “chicken dish”. Multimodal text and image signals reveal a strong semantic link to hot springs, allowing the model to rank this POI highly despite its nominal category.

Case study illustration
Case study illustration

Conclusion

Spacetime‑GR can be effectively adapted to downstream recommendation tasks via embedding‑based SFT, generative SFT, and DPO alignment. Incorporating multimodal POI information further enriches representations, leading to consistent offline and online performance gains.

multimodalSFTAI recommendationDPOSpacetime-GR
Amap Tech
Written by

Amap Tech

Official Amap technology account showcasing all of Amap's technical innovations.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.