When AI Starts Getting Real Work Done, Are We Ready to Evaluate It?
The article analyzes recent AI updates—from DeepSeek's DSpark inference boost and FlashAttention‑4's kernel redesign to Codex UI tweaks and design‑mode tools—arguing that the competition is shifting from answering questions to actually completing tasks, and it highlights three layers of progress, evaluation challenges, and the practical questions we must now ask of AI agents.
