Design of a Symbolic Evaluation Set for Knowledge Question Answering (KBQA)
The paper introduces a symbolic‑logic‑focused evaluation set for knowledge‑base question answering, defining single, deterministic, and non‑deterministic element symbols with linguistic tags, constructing complex celebrity‑domain queries, and showing that current commercial KBQA systems achieve only modest accuracy, highlighting the need for more robust reasoning capabilities.
Knowledge Question Answering (KBQA) combines natural language understanding, knowledge graphs, and natural language generation to answer factual queries. It is widely used in vertical domains such as smart speakers, service robots, e‑commerce, and food delivery.
Existing public benchmarks (e.g., WebQuestions, ComplexQuestions, QALD) have gradually increased in difficulty, but they still overlap in question types. This article proposes a new evaluation set that emphasizes symbolic logic and interpretability.
The proposed test set is built around three core components of a KBQA system: semantic parsing, semantic matching, and reasoning. These are further decomposed into entity extraction, predicate mapping, knowledge representation, graph retrieval, and query operators (e.g., max, min, intersection).
To model questions symbolically, the authors define three kinds of element symbols:
Single elements (e.g., a single attribute like "height").
Deterministic set elements (e.g., the intersection of "food that can be cooked" and "fruit").
Non‑deterministic set elements, which require quantifiers such as universal or existential.
Each element is annotated with tags to capture linguistic variations:
sbj tags: polysemy, alias, typo, omission.
pred tags: polysemous predicate, alias predicate, implicit predicate.
obj tags: length, amount, time, temperature, volume, string.
Using these symbols, the authors construct questions for the entertainment‑person domain (e.g., celebrities like "胡歌" or "周润发"). Example question patterns include:
Single‑element comparisons: "娜姐的年龄是不是38岁了?" (equality), "娜姐的年龄是不是大于40岁?" (greater‑than).
Deterministic multi‑element queries: "邓超和孙俪的体重是多少" (union), "王诗龄是李湘的女儿吗" (containment).
Non‑deterministic queries with quantifiers: "张学友和刘德华的年龄都大于40岁吗" (universal), "杨丞琳和罗志祥体重有大于110斤的吗" (existential).
The evaluation was conducted on several commercial KBQA products (XiaoAi, Tmall, XiaoDu, Shuyan Technology) focusing on the celebrity‑knowledge domain. Metrics were based on answer relevance assessed by human judges.
Results show that current systems still struggle with the proposed test set; the best performer (XiaoDu) answered only 41 % of the queries correctly. For complex symbolic queries such as "who is heavier, 周润发 or 谢娜?", Shuyan Technology provided not only the correct answer but also the underlying reasoning values, demonstrating support for max‑type operators.
Overall, the paper presents a systematic method for constructing high‑level symbolic evaluation sets that can reveal the atomic capabilities of KBQA systems and guide future improvements.
Tencent Cloud Developer
Official Tencent Cloud community account that brings together developers, shares practical tech insights, and fosters an influential tech exchange community.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.