Alibaba Cloud Developer
Alibaba Cloud Developer
Mar 27, 2026 · Artificial Intelligence

How OpenClaw Empowers a Self‑Evolving Bank Manager Assistant

This article details a three‑day deep dive into OpenClaw, demonstrating how a self‑iterating AI assistant for bank relationship managers can be built, validated, and refined through autonomous agent communication, scheduled tasks, and memory‑driven reflection.

AI agentsAgent EvaluationAutomation
0 likes · 20 min read
How OpenClaw Empowers a Self‑Evolving Bank Manager Assistant
PaperAgent
PaperAgent
Dec 23, 2025 · Artificial Intelligence

CATArena: A Competitive Benchmark That Turns Agent Scoring into Evolutionary Learning

CATArena introduces a tournament‑style evaluation framework where AI agents iteratively code, compete, and improve across classic board games, using three‑dimensional quantitative scores to measure strategy programming, global learning, and generalization, and reveals how different LLM‑based agents learn and adapt over multiple rounds.

AI benchmarkAgent EvaluationCATArena
0 likes · 8 min read
CATArena: A Competitive Benchmark That Turns Agent Scoring into Evolutionary Learning
Fun with Large Models
Fun with Large Models
Aug 20, 2025 · Artificial Intelligence

DeepSeek V3.1 Review: 128K Context, Knowledge, Programming & Agent Skills Near Claude 4

DeepSeek V3.1, released on August 19, expands context length to 128 K tokens and updates its knowledge base to July 2024, and the author’s benchmarks show its programming and agent capabilities now rival Claude 4, with detailed prompt examples, code generation demos, and performance comparisons.

Agent EvaluationClaude 4Context Length
0 likes · 9 min read
DeepSeek V3.1 Review: 128K Context, Knowledge, Programming & Agent Skills Near Claude 4
DataFunTalk
DataFunTalk
Jul 14, 2025 · Artificial Intelligence

Can Kimi K2 Beat Claude and Gemini in Coding and Agent Tasks?

This in‑depth review examines Kimi K2’s new focus on agent and coding abilities, comparing its performance on 3D HTML generation, code generation, and real‑world agent tasks against Claude 4 and Gemini 2.5, while also evaluating cost, openness, and practical usability for developers.

AI codingAgent EvaluationKimi K2
0 likes · 15 min read
Can Kimi K2 Beat Claude and Gemini in Coding and Agent Tasks?