Artificial Intelligence 15 min read

Multi-modal Multi-query Search Session Modeling with Heterogeneous Graph Neural Networks

The paper introduces MUVCOG, a heterogeneous graph neural network that models multi‑modal, multi‑query search sessions on Mobile Taobao by jointly learning attention‑based global and hierarchical local views through contrastive pre‑training, yielding universal session embeddings that markedly improve CTR prediction, query recommendation, and intent classification.

Alimama Tech

Jun 15, 2022

Multi-modal Multi-query Search Session Modeling with Heterogeneous Graph Neural Networks

Abstract: Modeling contextual information in search sessions is crucial for e‑commerce. Users on Mobile Taobao switch among text queries, photo search, and similar‑item search, forming multi‑modal multi‑query (MM) sessions. Existing work only models textual queries. This paper proposes a heterogeneous graph neural network (HGNN) framework, MUVCOG, that learns representations of MM sessions via a multi‑view contrastive pre‑training scheme, capturing intra‑query, inter‑query, and cross‑modal interactions. Experiments show significant gains on downstream tasks such as personalized click‑through‑rate (CTR) prediction, query recommendation, and query intent classification.

Background: Mobile Taobao supports text search, photo search, and similar‑item search. Users frequently alternate among these modalities, creating MM sessions where queries contain both textual and visual information. Statistical analysis reveals that MM sessions contain more queries and exhibit richer user intent than pure text sessions.

Method: An MM session is modeled as a heterogeneous directed graph where nodes represent words, images, and queries of different types. Two complementary views are designed:

Attention Global View (AGV): Performs modality‑wise attention aggregation followed by cross‑modal aggregation to obtain a global session representation.

Hierarchical Local View (HLV): First aggregates nodes within each query, then aggregates across queries, yielding a hierarchical representation.

Both views are encoded by a graph neural network. For contrastive pre‑training, positive samples are generated by masking a random query, while hard negatives are selected as sessions with no overlapping clicked items but highest similarity. The two views are mixed during contrastive learning, and a binary classifier predicts whether a pair of sessions is positive or negative using binary cross‑entropy loss.

Experiments: A seven‑day Mobile Taobao log (clothing, beauty, electronics) is used, with sessions split by a 30‑minute inactivity threshold. The learned session embeddings are evaluated on:

CTR prediction (NDCG@10, HR@10, MRR@10) – MUVCOG‑M (multi‑modal) consistently outperforms baselines including LSTM and Transformer.

Query recommendation – MUVCOG improves all metrics over sequential models.

Query intent classification – MUVCOG‑M surpasses word2vec and BERT baselines.

Results demonstrate that heterogeneous graph representations capture cross‑modal relationships better than pure sequential models, and that incorporating visual queries yields further improvements.

Conclusion: The proposed MUVCOG framework effectively pre‑trains heterogeneous graph representations for MM search sessions, providing universal embeddings that boost multiple e‑commerce downstream tasks without task‑specific fine‑tuning.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

contrastive learning pretraining Graph Neural Network Multi-modal search session

Written by

Alimama Tech

Official Alimama tech channel, showcasing all of Alimama's technical innovations.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.