How Top AI Models Survived a Year‑Long Virtual Startup Simulation

A year‑long YC‑Bench simulation pits twelve leading large‑language models against a virtual startup environment, revealing stark differences in profitability, cost efficiency, memory handling, and strategic decision‑making, with only three models ending the year profitable and a handful achieving high cost‑performance ratios.

SuanNi
SuanNi
SuanNi
How Top AI Models Survived a Year‑Long Virtual Startup Simulation

Benchmark Overview

YC‑Bench implements a partially observable Markov decision process (POMDP) in which each evaluated large language model (LLM) acts as the CEO of a simulated startup. The company begins with $200,000 capital and eight employees. Over a simulated year (≈365 days) the model repeatedly receives market orders, assigns employees to tasks, and delivers the work to earn revenue and reputation.

Simulation Mechanics

Each order originates from one of four domains (training, inference, research, data engineering) and has hidden skill requirements for the eight employees. Employees possess fixed but unknown proficiency vectors; the model can only infer these by observing task completion times and outcomes, gradually improving the employee‑task matching.

Successful task completion yields revenue and a reputation increase. Reputation is required to unlock high‑value orders that have strict entry thresholds. Missing a deadline incurs a 35% penalty on the payment and a reputation drop.

After every successful task the fixed monthly salary of the involved employees rises permanently, creating a compounding labor‑cost pressure that forces the model to prioritize higher‑margin orders.

Approximately 35% of orders are designated “malicious clients.” Their workload inflates three‑fold after acceptance, often causing missed deadlines and severe penalties.

Memory Constraints and Scratchpad

The conversational context window is truncated to the most recent 20 interaction rounds. To compensate, the environment provides a persistent scratchpad that the model can write to and read from across rounds. Effective models record malicious client IDs, reputation thresholds, and employee performance notes in the scratchpad and consult it before accepting new orders.

Experimental Setup

12 state‑of‑the‑art LLMs (including Claude Opus 4.6, GLM‑5, GPT‑5.4, Gemini 3 Flash, Kimi‑K2.5, Grok 4.20, etc.)

Each model runs three random seeds for a full simulated year.

Metrics recorded: final capital, cash‑flow trajectory, number of bankruptcies, inference cost per run, cost‑performance ratio (revenue per dollar spent), number of scratchpad entries, and frequency of malicious‑client acceptance.

Key Findings

Profitability

Only five of the twelve models ended the year with non‑negative profit; the remaining seven fell below the initial $200,000 capital, with several declaring bankruptcy mid‑simulation.

Claude Opus 4.6 (Anthropic) – average final capital $1.27 M

GLM‑5 (Zhipu) – $1.21 M

GPT‑5.4 (OpenAI) – slightly behind the top two

Cost‑Performance

Inference cost per run varied widely. Claude Opus 4.6 averaged $86 per run (≈70 min), while Kimi‑K2.5 cost under $2 per run and delivered the highest revenue per dollar. GLM‑5’s cost‑efficiency was roughly ten times better than Claude Opus, and Kimi‑K2.5 outperformed Gemini 3 Flash by a factor of 2.5 in cost‑performance.

Memory and Malicious‑Client Handling

Top‑performing models used the scratchpad to blacklist malicious clients, reducing exposure to inflated workloads. In 30 full runs, only 20 runs contained any blacklist entries; the best models wrote blacklist rules about one‑quarter as often as lower‑ranked models, yet their acceptance rate of malicious orders was <9% (well below the 35% baseline).

Tool Usage

High‑scoring models (Claude Opus, GPT‑5.4) frequently invoked command‑line utilities to gather task details before committing, averaging >1 scratchpad entry per 10 rounds. Lower‑ranked models rarely used such tools, effectively operating without external intelligence.

Behavioral Patterns

Claude Opus 4.6 – ~34 scratchpad updates per run, focused on high‑trust clients, examined task details ~155 times before acceptance; occasional lapse on a blacklisted client caused a late‑year revenue dip.

Gemini 3 Flash – minimal scratchpad interaction, accepted orders blindly, suffered 12 malicious‑client failures but stayed afloat due to volume.

Claude Sonnet 4.6 – pursued high parallelism (up to 16 concurrent tasks), leading to 41% failure from resource dilution and 12% from malicious clients.

Grok 4.20 – highly cautious, recorded cash‑flow warnings, but ultimately accepted a zero‑success client when cash was low, causing a fatal stall.

Conclusions

The benchmark reveals a structural gap in current LLMs: strong reasoning does not automatically translate into robust long‑term execution. Memory limits, risk‑avoidance strategies, and tool‑use proficiency separate elite models from mediocre ones. Even in a simplified text‑only environment, leading models exhibit critical deficiencies in planning, risk management, and consistent action that must be addressed before reliable deployment in real‑world, long‑horizon business tasks.

References

https://github.com/collinear-ai/yc-bench

https://arxiv.org/pdf/2604.01212v1

memory managementsimulationAIbenchmarkcost efficiencyYC‑Bench
SuanNi
Written by

SuanNi

A community for AI developers that aggregates large-model development services, models, and compute power.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.