Industry Insights 11 min read

Inside the 2024 KDD Cup ShopBench Challenge: Tasks, Data, and Evaluation Metrics

The 2024 KDD Cup introduces the ShopBench benchmark, a large‑scale LLM competition that simulates real‑world online shopping with 57 tasks, over 20,000 questions, and multiple tracks covering concept understanding, knowledge reasoning, user‑behavior alignment, multilingual ability, and an all‑round track, all evaluated with task‑specific metrics and a hidden test set.

Baobao Algorithm Notes
Baobao Algorithm Notes
Baobao Algorithm Notes
Inside the 2024 KDD Cup ShopBench Challenge: Tasks, Data, and Evaluation Metrics

The 2024 KDD Cup features two LLM‑focused tracks, one organized by Meta (LLM+RAG) and another by Amazon, both built around the ShopBench benchmark.

Challenge Goal : Simplify online shopping by leveraging large language models (LLMs) to act as intelligent assistants that can understand product concepts, reason about knowledge, align with user behavior, and operate across multiple languages.

ShopBench Dataset : Derived from anonymized Amazon shopping data, the benchmark contains 57 tasks and 20,598 questions covering approximately 13,300 products across 400 categories, with 1,032 attributes, ~11,200 reviews, and ~4,500 queries. The development set is provided in JSON with fields input_field, output_field, task_type, and metric. The hidden test set only includes input_field and a boolean is_multiple_choice flag.

Tracks :

Track 1 – Shopping Concept Understanding

Track 2 – Shopping Knowledge Reasoning

Track 3 – User‑Behavior Alignment

Track 4 – Multilingual Capability

Track 5 – All‑Round (single solution for Tracks 1‑4)

Task Types (all reformatted as text‑to‑text generation): multiple‑choice, retrieval, ranking, named‑entity recognition, and generation (including extraction, translation, and detailed description).

Schedule :

Registration opens: 2024‑03‑15 23:55 UTC

Phase 1: 2024‑03‑18 – 2024‑05‑10 UTC (all registered teams)

Phase 2 (top 25 %): 2024‑05‑15 – 2024‑07‑10 UTC

Winners notified: 2024‑07‑15; announced at KDD 2024 on 2024‑08‑26

Evaluation Framework : Each task uses a specific metric—accuracy for multiple‑choice, Hit@3 for retrieval, NDCG for ranking, micro‑averaged F1 for NER, ROUGE‑L for extraction, BLEU for translation, and cosine similarity of sentence embeddings for other generation tasks. Scores are macro‑averaged within each track, and Track 5’s overall score is the average of Tracks 1‑4.

ShopBench challenge logo
ShopBench challenge logo
Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

LLMEvaluation MetricsBenchmarkDatasetKDD CupShopBench
Baobao Algorithm Notes
Written by

Baobao Algorithm Notes

Author of the BaiMian large model, offering technology and industry insights.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.