Inside the 2024 KDD Cup ShopBench Challenge: Tasks, Data, and Evaluation Metrics
The 2024 KDD Cup introduces the ShopBench benchmark, a large‑scale LLM competition that simulates real‑world online shopping with 57 tasks, over 20,000 questions, and multiple tracks covering concept understanding, knowledge reasoning, user‑behavior alignment, multilingual ability, and an all‑round track, all evaluated with task‑specific metrics and a hidden test set.
The 2024 KDD Cup features two LLM‑focused tracks, one organized by Meta (LLM+RAG) and another by Amazon, both built around the ShopBench benchmark.
Challenge Goal : Simplify online shopping by leveraging large language models (LLMs) to act as intelligent assistants that can understand product concepts, reason about knowledge, align with user behavior, and operate across multiple languages.
ShopBench Dataset : Derived from anonymized Amazon shopping data, the benchmark contains 57 tasks and 20,598 questions covering approximately 13,300 products across 400 categories, with 1,032 attributes, ~11,200 reviews, and ~4,500 queries. The development set is provided in JSON with fields input_field, output_field, task_type, and metric. The hidden test set only includes input_field and a boolean is_multiple_choice flag.
Tracks :
Track 1 – Shopping Concept Understanding
Track 2 – Shopping Knowledge Reasoning
Track 3 – User‑Behavior Alignment
Track 4 – Multilingual Capability
Track 5 – All‑Round (single solution for Tracks 1‑4)
Task Types (all reformatted as text‑to‑text generation): multiple‑choice, retrieval, ranking, named‑entity recognition, and generation (including extraction, translation, and detailed description).
Schedule :
Registration opens: 2024‑03‑15 23:55 UTC
Phase 1: 2024‑03‑18 – 2024‑05‑10 UTC (all registered teams)
Phase 2 (top 25 %): 2024‑05‑15 – 2024‑07‑10 UTC
Winners notified: 2024‑07‑15; announced at KDD 2024 on 2024‑08‑26
Evaluation Framework : Each task uses a specific metric—accuracy for multiple‑choice, Hit@3 for retrieval, NDCG for ranking, micro‑averaged F1 for NER, ROUGE‑L for extraction, BLEU for translation, and cosine similarity of sentence embeddings for other generation tasks. Scores are macro‑averaged within each track, and Track 5’s overall score is the average of Tracks 1‑4.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Baobao Algorithm Notes
Author of the BaiMian large model, offering technology and industry insights.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
