A Comprehensive Survey of Trustworthy Agentic AI: Safety, Robustness, Privacy, and System Security
This survey systematically reviews trustworthy agentic AI, focusing on safety and robustness as well as privacy and system security, mapping risks and safeguards across the agent lifecycle, proposing unified metrics and benchmarks, and discussing high‑risk real‑world applications and open challenges.
Introduction
When large language models evolve from chat assistants to autonomous agents that can plan, invoke tools, retain memory, and act continuously, new trustworthiness concerns arise. A single erroneous judgment can propagate through the perception‑planning‑action‑reflection‑learning loop, potentially triggering real‑world tool usage, modifying external systems, leaking data, or corrupting memory.
Paper authors: Jinhu Qi, Muzhi Li, Jiahong Liu, Yuqin Shu, Dianzhi Yu, Shicheng Ma, Wenqian Cui, Yiyang Zhao, Yiyi Chen, Ruoxi Jiang, Irwin King, Zenglin Xu et al.
Paper URL: https://arxiv.org/abs/2605.23989
The survey concentrates on two main strands for high‑risk deployments: Safety & Robustness and Privacy & System Security . It maps risks and mitigations to the agent lifecycle, organizing evaluation metrics, benchmarks, release thresholds, and real‑world case studies.
Preliminaries
Definition and Components of an Agent
An agent AI is defined as a system with persistent goals that can perceive the environment, perform multi‑step planning, affect external systems via tools or actuators, and reflect on outcomes under human supervision, privacy policies, and operational constraints.
Typical components include goal specifications, perception modules, planning/reasoning modules, tool/action layers, episodic and semantic memory, world models, and reflection/learning modules. Human administrators provide goals, permissions, budgets, and supervision; the environment returns states, results, and rewards.
Foundations of Reinforcement Learning and Preference Optimization
Agent decision‑making can be modeled as an MDP or POMDP, where the system selects actions based on observations and internal state, receiving new states and rewards. Real‑world tasks often involve partial observability, sparse and delayed rewards, and reliance on memory to infer hidden states, complicating credit assignment, exploration, and policy stability.
Reinforcement learning, imitation learning, offline RL, and preference optimization affect trustworthiness. Pure reward maximization can lead to reward gaming; offline data may contain biases and hazardous actions; human preference data can be inconsistent. Safe RL incorporates risk constraints, cumulative cost limits, conditional value‑at‑risk, or shield mechanisms to block unsafe actions.
Core Dimensions of Trustworthiness
Safety & Robustness
Safety concerns whether the system avoids unintended harm; robustness asks if the system maintains stable behavior under noise, adversarial perturbations, and distribution shifts. Errors in perception can cause dangerous plans; planning errors can trigger irreversible tool actions; reflection errors can embed long‑term risks.
Risks and mitigations per stage:
Perception : input contamination, adversarial examples, sensor failures, retrieval errors; mitigated by data augmentation, adversarial training, source verification, input sanitization, out‑of‑distribution detection.
Planning : goal misinterpretation, reward hacking, unsafe exploration, missing constraints; mitigated by constrained MDPs, conservative planning, risk‑sensitive objectives, constitutional rules.
Action : tool misuse, privilege escalation, parameter errors; mitigated by tool whitelists, least‑privilege, parameter validation, sandboxing, transactional execution, human approval for high‑risk actions.
Reflection & Learning : short‑term errors becoming long‑term policies, memory poisoning; mitigated by trajectory auditing, simulation testing, memory provenance tagging, regression gating, canary releases.
Multi‑agent environments introduce deception, collusion, error cascades, and responsibility diffusion. Long‑duration tasks face error accumulation, goal drift, and supervision decay; protocol‑level constraints, identity authentication, global budgets, checkpoints, and interruptibility are key mechanisms.
Privacy & System Security
Privacy concerns the proper collection, use, storage, and deletion of personal or sensitive data; system security concerns the resilience of data, tools, and execution environments against malicious attacks. Agents can access emails, documents, browsers, databases, and credentials, expanding the attack surface.
Representative threats and mitigations:
Perception : indirect prompt injection via webpages, emails, or tool outputs; mitigated by zero‑trust input handling, content‑instruction separation, source authentication, injection detection, and reduced privileges for untrusted content.
Planning & Memory : context leakage, cross‑task data mixing, long‑term storage of sensitive info, memory poisoning; mitigated by data minimization, purpose limitation, differential privacy (ε,δ), memory partitioning, retention limits, and secure deletion.
Action : credential theft, over‑privileged tool calls, code execution, data exfiltration; mitigated by secret vaults, short‑lived tokens, least‑privilege, data loss prevention filters, network egress controls, and encrypted, tamper‑evident logs.
Learning : retained training data, model supply‑chain risks, dependency vulnerabilities; mitigated by software bill of materials, component provenance reviews, and security regression testing for each upgrade.
Consolidated Metrics and Benchmarks
From Outcome to Process Evaluation
Agent evaluation must record both result metrics (task success, harm incidence, adversarial success, privacy breach, policy violation rates) and process metrics (trajectory integrity, constraint violations, tool call legality, sensitive data exposure, human takeover frequency, recovery time, log coverage). Step‑level safety does not guarantee overall trajectory safety.
Mapping Scenarios to Metrics
Evaluation should start from concrete scenarios and threat models rather than a generic checklist. Domains such as autonomous driving, medical diagnosis, and enterprise assistants have differing harm radii, reversibility, and regulatory requirements, demanding domain‑specific release thresholds.
Evaluation Pipeline
The paper proposes a seven‑stage pipeline from offline regression replay of known failures, rare‑event simulation, sandboxed tool execution, automated and human red‑team testing, read‑only shadow deployment on live traffic, limited canary release, to continuous production monitoring. Each stage must retain auditable traces of inputs, model versions, plans, tool parameters, permission decisions, environment feedback, memory updates, and human interventions.
Real‑World Applications in High‑Risk Domains
Autonomous Driving
Risks include adverse weather, occlusion, long‑tail scenarios, and multi‑agent interaction. Safeguards involve multi‑sensor fusion, V2X cooperation, simulation‑based verification, safe RL, and runtime shielding. Privacy and system security concerns cover location trace leakage, V2X spoofing, interference, and vehicle control interface attacks, mitigated by secure communication, authentication, data anonymization, and automotive network security standards.
Medical Health
Medical agents may assist diagnosis, clinical decision‑making, record summarization, and workflow coordination. Risks span hallucinations, error propagation, ignored uncertainty, and unsafe autonomous actions. Requirements include multi‑center validation, human‑in‑the‑loop oversight, confidence estimation, continuous monitoring, access control, federated learning, end‑to‑end encryption, audit trails, and compliance with HIPAA/GDPR.
Intelligent Assistants & Enterprise Systems
Assistants that access email, calendars, code repositories, payments, and internal knowledge bases face indirect prompt injection, tool failure, memory poisoning, and credential theft. Protections focus on sandboxing, least‑privilege, temporary credentials, input sanitization, policy enforcement, and comprehensive audit logging. Financial and trading agents add market manipulation, erroneous order, and compliance risks; enterprise coding agents risk malicious dependencies and dangerous command execution.
Challenges and Solutions
Self‑Evolution and Runtime Verification
Agents that continuously learn and modify memory can drift from their verified baseline. Future systems need provenance tracking of updates, policy diff analysis, security regression testing, and runtime invariant checks. Change‑point detection, checkpoints, rollbacks, and staged roll‑outs should become standard components.
Trustworthy Personalization
Personalization requires long‑term user data, increasing leakage, mis‑profiling, and manipulation risks. Viable approaches include on‑device processing, layered consent, fine‑grained deletion, usage limitation, and privacy budgeting. Users should be able to view and correct what the system remembers.
Efficiency, Explainability, and Accountability
Trustworthiness mechanisms add computation, latency, and human cost; systems must balance safety with utility. Explainability should move from post‑hoc rationales to verifiable causal evidence (which observations were used, why a tool was chosen, which rule blocked an action). Accountability demands clear responsibility boundaries among developers, deployers, tool providers, and users.
Long‑Duration Deployments
Core difficulties include error accumulation, delayed consequences, sparse rewards, planning‑action mismatch, supervision scaling, and intractable evaluation. Hierarchical task decomposition, risk budgeting, staged checkpoints, backtracking replanning, and interruptibility can mitigate risks, though mature unified solutions remain lacking.
Open‑Source Agent Security Cases
The survey examines open ecosystems such as OpenClaw and Moltbook, highlighting three fatal factors: exposure to untrusted content, access to sensitive data, and ability to communicate or execute externally. Combining these enables hidden commands to exfiltrate secrets. Plugin marketplaces and inter‑agent communication introduce supply‑chain risks; malicious components can propagate via dependencies or shared memory. Effective defenses include isolating trust domains, narrowing permissions, restricting outbound channels, validating tool parameters, and maintaining full audit trails.
Conclusions
Trustworthy agentic AI must be treated as a systems‑engineering problem rather than mere model alignment. Risks permeate perception, planning, action, reflection, and learning, and are amplified by tool permissions, long‑term memory, multi‑agent collaboration, and long‑duration operation.
Achieving safety and robustness requires systems to avoid harm under uncertainty and adversarial conditions; privacy and system security demand end‑to‑end protection of data, credentials, and execution environments. Realizing these goals calls for a layered guarantee stack spanning pre‑deployment threat modeling, training‑time constraints, runtime safeguards, and post‑deployment auditing.
Practitioners should first define the operational design domain and high‑risk actions, then implement least‑privilege and verification mechanisms, evaluate both outcome and process metrics with non‑averaged safety thresholds, and continuously monitor model, tool, memory, and policy changes after launch. Trustworthiness is an ongoing governance capability throughout the agent lifecycle.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Machine Learning Algorithms & Natural Language Processing
Focused on frontier AI technologies, empowering AI researchers' progress.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
