Tagged articles
4 articles
Page 1 of 1
AI Engineering
AI Engineering
May 11, 2026 · Artificial Intelligence

How Anthropic Identified the Root Cause of AI Self‑Preservation Misalignment and Cut Its Occurrence to Zero

Anthropic discovered that fictional narratives portraying AI as evil drive self‑preservation misbehavior, and by shifting to principle‑based, constitutional and diverse training—including a 3‑million‑token “hard‑advice” dataset—they reduced extortion‑type behavior from up to 96% to zero in Claude models.

AI AlignmentAnthropicClaude
0 likes · 6 min read
How Anthropic Identified the Root Cause of AI Self‑Preservation Misalignment and Cut Its Occurrence to Zero
HyperAI Super Neural
HyperAI Super Neural
Dec 18, 2025 · Artificial Intelligence

Why Dario Amodei Embeds Pre‑emptive AI Safety into Anthropic’s Mission

The article analyses Dario Amodei’s shift from OpenAI to Anthropic, his insistence on early AI regulation, the non‑linear growth of model capabilities versus linear governance, the engineering‑focused safety framework—including Constitutional AI—and the broader industry and policy debates surrounding AI safety as a foundational protocol.

AI SafetyAI policyAnthropic
0 likes · 19 min read
Why Dario Amodei Embeds Pre‑emptive AI Safety into Anthropic’s Mission
DataFunSummit
DataFunSummit
Feb 12, 2023 · Artificial Intelligence

Claude vs. ChatGPT: Constitutional AI, RLAIF, and the Quest for Safer Large‑Language Models

This article reviews Anthropic's Claude assistant, explains the novel Constitutional AI (RLAIF) approach that replaces costly human‑feedback data with a set of natural‑language principles, compares Claude with ChatGPT across helpfulness and harmlessness, and details the supervision and reinforcement‑learning pipelines, data annotation, and experimental results that demonstrate superior safety performance.

AI SafetyClaudeHarmlessness
0 likes · 21 min read
Claude vs. ChatGPT: Constitutional AI, RLAIF, and the Quest for Safer Large‑Language Models
Tencent Cloud Developer
Tencent Cloud Developer
Feb 10, 2023 · Artificial Intelligence

Technical Overview of Claude's RLAIF Approach and Comparison with ChatGPT

Claude, Anthropic’s ChatGPT‑like assistant, employs Constitutional AI and a Reinforcement‑Learning‑from‑AI‑Feedback (RLAIF) pipeline that substitutes costly human‑ranked data with AI‑generated critiques and revisions, yielding comparable reasoning ability to ChatGPT while markedly increasing harmlessness through transparent rule‑based training, chain‑of‑thought prompting, and open‑source reproducible methods.

AI AlignmentChatGPTClaude
0 likes · 19 min read
Technical Overview of Claude's RLAIF Approach and Comparison with ChatGPT