Artificial Intelligence 21 min read

How Anthropic Uses Claude for Self‑Service Data Analytics: Beyond Removing SQL Barriers

Anthropic claims that Claude automates about 95% of business analysis queries with roughly 95% accuracy, but the real challenge lies in embedding enterprise data definitions, governance, and validation into an agent harness, requiring skills, context layers, and rigorous offline testing to avoid silent failures.

Architect

Jun 6, 2026

How Anthropic Uses Claude for Self‑Service Data Analytics: Beyond Removing SQL Barriers

First look at the numbers

Anthropic reports that roughly 95% of internal business‑analysis queries are handled by Claude and that the overall accuracy is about 95%.

The figure does not mean that 95% of data‑science decisions are delegated to the model, nor that a simple chat‑window attached to a warehouse yields the same result. It reflects a shift from manual, repetitive, relatively stable data requests to an “agent‑by‑process” approach, freeing analysts to focus on causal modelling, forecasting, machine learning and more complex analyses.

Quietly wrong answers

Typical demos connect a large model directly to a warehouse, let users ask natural‑language questions, generate SQL, run the query and return charts and conclusions. The model can misinterpret definitions at the first step:

What time window does “last week” refer to?

Does “enterprise customers” exclude test, fraudulent or internal accounts?

Which revenue definition (billed, recognized, net) should be used?

These errors go beyond SQL syntax; the generated query may target the wrong table, field or time range, and business users often cannot see the underlying data model. Anthropic calls this a “false sense of precision” – answers look polished but may be incorrect.

Data is not code

When writing code, agents receive abundant hard feedback (compile errors, failing tests, runtime logs). In data analysis many queries run successfully, charts render, yet the answer can still be wrong because the definition was wrong. The lack of hard feedback makes debugging much harder.

Seeing is not using correctly

Teams often dump historical SQL, dashboards, notebooks and field documentation into a retrieval store, assuming the agent will simply copy past analyst behaviour. Anthropic performed an ablation experiment: giving the agent access to thousands of dashboards and notebooks changed accuracy by less than one percentage point, even though the correct answer existed in the corpus for about 80% of error cases. The bottleneck is not material availability but selecting the authoritative source, handling expiration, team‑specific context, and knowing when a material cannot be applied directly.

a16z’s “Your Data Agents Need Context” and Snowflake’s “Agent Context Layer” emphasise that the shortfall of data agents is not more text but the ability to connect business definitions, data sources and governance rules.

Skills are procedures

In Claude Code a SKILL.md directory bundles reference docs, scripts, templates and resources. Skills encode how to query a warehouse, when to clarify time windows, which metrics have authoritative definitions, privacy constraints and how to format the final answer with source, freshness, confidence and owner.

A typical Skill checklist includes:

When to default to the semantic layer.

When to clarify time windows, denominators and user scope.

Which metrics have authoritative definitions and which tables are deprecated.

Which questions should return data only, without business conclusions.

Privacy or restricted‑data checks enforced at the system layer.

Final answer must include source, data freshness, confidence and responsible owner.

Data‑level harness

The system is more than “User → Agent → Database → Answer”. It consists of:

Authoritative datasets to prune candidate sources.

Semantic layer for consistent metrics.

Lineage and transformation graphs showing data provenance.

Skills that codify analyst procedures.

Offline evaluation exposing blind spots.

Adversarial review that adds ~32% tokens and ~72% latency, yielding a 6% accuracy gain.

Source footers exposing trustworthiness.

User‑error feedback loops that update docs, tests and Skills.

This harness does not guarantee perfect answers, but it embeds why a query was made, which definition was used, how it was validated and how to correct errors.

The story behind the 95%

Maintaining Skills is costly; without upkeep accuracy can drift from ~95% to ~65% within weeks. Anthropic now stores Skill markdown and data‑model changes in the same repository and flags mismatches during PR reviews. About 90% of data‑model PRs now include Skill changes.

Adversarial review improves accuracy by 6% but adds token and latency overhead, which may be acceptable for high‑risk analyses but must be weighed for routine queries.

Permissions cannot be enforced via prompts alone; row‑level security, privacy fields and financial data must be enforced at the system layer.

Organizational silos break the maintenance chain: if data engineering, BI, finance, sales and product analysis each maintain separate definitions, the agent sees a fragmented world. Human analysts can still rely on experience to spot stale dashboards or fields, but agents need that tacit knowledge encoded as maintainable, auditable assets.

Silent failures remain unsolved: the hardest errors are those no one points out. A polished but wrong answer can propagate into reports and decisions; downstream checks (source footers, reviewer sign‑offs, KPI reconciliations) only mitigate risk partially.

Data science does not disappear

LLM APIs make teams think they can bypass data scientists, but stable systems still require traceability, error classification, evaluation sets, metric design, label quality and drift monitoring.

Professional roles shift from manually producing each analysis to designing data definitions, evaluation methods, error taxonomies, validation pipelines and feedback loops.

Karpathy’s AutoResearch demonstrates a loop that runs fixed‑budget experiments, evaluates with a val_bpb metric, and decides whether to keep changes. Although not directly transferable to data analysis, it shows how a tightly scoped harness can drive continuous improvement.

Practical checklist for teams starting AI analytics

Identify 10 high‑frequency business questions.

Bind each question to a single primary metric definition.

Select a few canonical datasets with clear owners.

Collect dozens of real questions with expected answers and provenance for offline evaluation.

Implement a thin knowledge‑routing Skill that encodes the analysis procedure.

Require every answer to include source, freshness, confidence and owner.

Set red‑line boundaries: require human review or SQL‑only responses for PII, finance or leadership reports.

Monitor semantic‑layer hit rate, correction‑language ratio and KPI reconciliation results.

When an error occurs, add it to the evaluation set, documentation and Skill.

Start small, verify that the closed‑loop works for the selected high‑frequency queries, ensure definitions stay in sync, enforce permission boundaries and evaluate cost versus latency before expanding scope.

Conclusion

Anthropic’s Claude self‑service analytics does more than erase SQL friction; it lifts hidden knowledge from analysts, dashboards, Slack discussions and historical queries into a system layer. When an agent reaches the execution layer, the system must supply definition, validation and braking mechanisms, otherwise the agent’s efficiency can mask serious errors.

References

Anthropic: How Anthropic enables self‑service data analytics with Claude – https://claude.com/blog/how-anthropic-enables-self-service-data-analytics-with-claude

Claude Code Docs: Extend Claude with skills – https://code.claude.com/docs/en/skills

Anthropic: The Complete Guide to Building Skills for Claude – https://resources.anthropic.com/hubfs/The-Complete-Guide-to-Building-Skill-for-Claude.pdf

Hamel Husain: The Revenge of the Data Scientist – https://hamel.dev/blog/posts/revenge/index.html

Josh Wills: Data Scientist definition (2022) – https://x.com/josh_wills/status/198093512149958656

Andrej Karpathy: AutoResearch – https://github.com/karpathy/autoresearch

Tristan Handy: BI’s Second Unbundling – https://roundup.getdbt.com/p/bis-second-unbundling

a16z: Your Data Agents Need Context – https://a16z.com/your-data-agents-need-context/

Snowflake: The Agent Context Layer for Trustworthy Data Agents – https://www.snowflake.com/en/blog/agent-context-layer-trustworthy-data-agents/

Genloop: What Anthropic got right about agentic analytics, and got wrong for everyone else – https://genloop.ai/blogs/anthropic-agentic-analytics-what-they-got-right-and-wrong

Benn Stancil: The Context Layer – https://benn.spicytakes.org/post/2025-08-29-the-context-layer

Atlan: Why AI Agents Need an Enterprise Context Layer in 2026 – https://atlan.com/know/why-ai-agents-need-an-enterprise-context-layer/

Aakash Gupta: Discussion on Karpathy Loop – https://x.com/aakashgupta/status/2038132294817656978

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

AI agents data-warehouse Claude Self‑service analytics Skills Agent Harness

Written by

Architect

Professional architect sharing high‑quality architecture insights. Topics include high‑availability, high‑performance, high‑stability architectures, big data, machine learning, Java, system and distributed architecture, AI, and practical large‑scale architecture case studies. Open to ideas‑driven architects who enjoy sharing and learning.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.