Multi-modal Multi-query Search Session Modeling with Heterogeneous Graph Neural Networks
The paper introduces MUVCOG, a heterogeneous graph neural network that models multi‑modal, multi‑query search sessions on Mobile Taobao by jointly learning attention‑based global and hierarchical local views through contrastive pre‑training, yielding universal session embeddings that markedly improve CTR prediction, query recommendation, and intent classification.
Abstract: Modeling contextual information in search sessions is crucial for e‑commerce. Users on Mobile Taobao switch among text queries, photo search, and similar‑item search, forming multi‑modal multi‑query (MM) sessions. Existing work only models textual queries. This paper proposes a heterogeneous graph neural network (HGNN) framework, MUVCOG, that learns representations of MM sessions via a multi‑view contrastive pre‑training scheme, capturing intra‑query, inter‑query, and cross‑modal interactions. Experiments show significant gains on downstream tasks such as personalized click‑through‑rate (CTR) prediction, query recommendation, and query intent classification.
Background: Mobile Taobao supports text search, photo search, and similar‑item search. Users frequently alternate among these modalities, creating MM sessions where queries contain both textual and visual information. Statistical analysis reveals that MM sessions contain more queries and exhibit richer user intent than pure text sessions.
Method: An MM session is modeled as a heterogeneous directed graph where nodes represent words, images, and queries of different types. Two complementary views are designed:
Attention Global View (AGV): Performs modality‑wise attention aggregation followed by cross‑modal aggregation to obtain a global session representation.
Hierarchical Local View (HLV): First aggregates nodes within each query, then aggregates across queries, yielding a hierarchical representation.
Both views are encoded by a graph neural network. For contrastive pre‑training, positive samples are generated by masking a random query, while hard negatives are selected as sessions with no overlapping clicked items but highest similarity. The two views are mixed during contrastive learning, and a binary classifier predicts whether a pair of sessions is positive or negative using binary cross‑entropy loss.
Experiments: A seven‑day Mobile Taobao log (clothing, beauty, electronics) is used, with sessions split by a 30‑minute inactivity threshold. The learned session embeddings are evaluated on:
CTR prediction (NDCG@10, HR@10, MRR@10) – MUVCOG‑M (multi‑modal) consistently outperforms baselines including LSTM and Transformer.
Query recommendation – MUVCOG improves all metrics over sequential models.
Query intent classification – MUVCOG‑M surpasses word2vec and BERT baselines.
Results demonstrate that heterogeneous graph representations capture cross‑modal relationships better than pure sequential models, and that incorporating visual queries yields further improvements.
Conclusion: The proposed MUVCOG framework effectively pre‑trains heterogeneous graph representations for MM search sessions, providing universal embeddings that boost multiple e‑commerce downstream tasks without task‑specific fine‑tuning.
Alimama Tech
Official Alimama tech channel, showcasing all of Alimama's technical innovations.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.