Artificial Intelligence 21 min read

How OAG Shrinks a Million‑Token Ontology to 11% While Keeping LLM Reasoning Power

This article presents the OAG (Ontology‑Augmented Generation) architecture, which uses a three‑stage pipeline of semantic filtering, graph‑based path pruning, and format conversion to compress enterprise‑scale ontologies by up to 89% of tokens while limiting inference accuracy loss to around 3% and adding only ~240 ms latency.

AsiaInfo Technology: New Tech Exploration

Apr 9, 2026

How OAG Shrinks a Million‑Token Ontology to 11% While Keeping LLM Reasoning Power

Introduction

Enterprise knowledge ontologies often reach millions of characters, far exceeding the context window of large language models (LLMs). Injecting the full ontology is neither economical nor scalable, and most inference tasks only need a small subset of the knowledge. OAG (Ontology‑Augmented Generation) addresses this by dynamically extracting the minimal required sub‑ontology for each query.

Design Principles

OAG follows the principle “activate on demand, prune dynamically”. Before sending a query to the LLM, OAG performs semantic analysis to identify the exact ontology elements needed, generates a compact sub‑graph, and injects it transparently into the prompt.

Semantic Filtering: From Full Ontology to Subset

The filtering stage consists of three layers:

Lexical matching : Detects entity and action words in the query and matches them against ontology class and property names.

Semantic expansion : Expands matched concepts to include their subclasses and inherited attributes (e.g., "customer" expands to "VIP customer" and "regular customer").

Dependency closure : Recursively adds all objects that are referenced by the matched ones, ensuring the resulting sub‑graph is semantically closed.

This three‑layer approach mirrors how human experts activate only the relevant parts of their knowledge base.

Path Pruning: Graph‑Theoretic Intervention

Ontologies can be viewed as directed graphs where nodes are classes and edges are relationships. OAG applies graph algorithms to keep only the most relevant paths:

Initial attempts used exhaustive BFS, which quickly became infeasible for graphs with >300 nodes.

Subsequent Dijkstra‑based shortest‑path methods improved speed but discarded alternative paths that might be semantically important.

The final solution is a coverage‑oriented shortest‑path algorithm: starting from a core object set, a depth‑limited BFS (default depth = 6) discovers reachable targets, then greedily selects new roots to cover remaining objects, merging all visited nodes into a compact sub‑graph.

When queries span multiple connected components, OAG performs component‑wise root selection to avoid full‑graph traversal.

Format Conversion: TTL to Markdown

TTL (Turtle) is the W3C standard for ontology serialization but contains many redundant prefixes that waste LLM tokens. OAG converts TTL to a concise Markdown representation, flattening inheritance hierarchies and inlining relationships (e.g., - subscribes: [Plan]). This conversion reduces token count by 35‑65% while preserving semantic fidelity.

Overall Pipeline

The end‑to‑end workflow is:

Semantic filtering (three‑layer pipeline) → filtered sub‑ontology.

Path pruning on the filtered graph → minimal connected sub‑graph.

TTL‑to‑Markdown conversion → LLM‑friendly text.

Typical latency for the whole pipeline ranges from 200 ms to 500 ms, well below the 2‑10 s inference time of the LLM itself.

Performance Evaluation

Key metrics from telecom‑domain experiments:

Compression ratio: up to 95% for simple queries (5‑10 objects) and 70‑80% for complex queries (50+ objects).

Inference accuracy: full ontology 92% vs. OAG‑filtered 89% (≈3% loss, mainly in edge cases).

End‑to‑end latency contribution: < 10% of total response time.

Query‑type impact: global‑overview queries compress less well; a “importance‑ranking” fallback retains high‑frequency objects when token limits are exceeded.

Design Reflections

Challenges identified include defining “relevant” objects, choosing an appropriate path‑depth limit (currently 6 layers, domain‑dependent), handling ontology evolution (automatic regression testing after changes), and exploring interactive filtering where the LLM can request additional context on‑the‑fly.

Conclusion

OAG redefines the role of ontologies in LLM‑driven agents from static knowledge burdens to dynamic, on‑demand reasoning aids. By combining semantic filtering, graph‑based pruning, and format conversion, OAG achieves up to 89% token reduction with minimal accuracy loss, paving the way for scalable enterprise‑level AI agents.