Can Neural Computers Replace Traditional CPUs? Inside the Latest AI Harness Designs
This article analyzes the emerging concept of Neural Computers, explains how Harness engineering unifies compute, memory, and I/O into a single learned runtime, reviews recent multimodal models from Anthropic, Meta, and OpenAI, and presents detailed experimental results from the NCCLIGen and NCGUIWorld prototypes.
Neural Computer Overview
Definition and CNC criteria
A Neural Computer (NC) unifies computation, memory, and I/O into a single latent runtime state. The latent state itself serves as working memory, while pixel‑level observations and action tokens act as the I/O interface. The paper defines a Completely Neural Computer (CNC) as a system that satisfies four criteria: (1) unified runtime state, (2) latent‑state memory, (3) integrated visual/action I/O, and (4) fully neural computation.
Video‑model based NC prototypes
Overall architecture
Both prototypes are built on the video generation model Wan2.1 . The model consumes a text prompt and the first terminal frame, encodes frames with a VAE, extracts visual features with CLIP, and encodes the prompt with T5. A DiT diffusion Transformer updates the latent state, and the decoder predicts the next frame.
CLI prototype – NCCLIGen (Terminal Neural Computer)
Dataset construction
CLIGen (General) : ~820 k video streams (~1 100 h) of real terminal workflows captured via asciinema.
CLIGen (Clean) : Deterministic Dockerized scripts (~250 k scripts) filtered to ~128 k high‑quality trajectories.
Model architecture
Input : Text prompt + first terminal frame image.
Encoding : VAE‑encoded frame + CLIP visual features + T5 text encoding.
Core : DiT diffusion Transformer that iteratively updates the latent state.
Output : Predicted next terminal frame.
Key experimental results
Font size and readability : Average PSNR = 40.77 dB, SSIM = 0.989. Fonts ≥13 px remain clear; 6 px fonts become locally blurry, indicating that adequate font size is essential for stable NC training.
Early training saturation : Metrics plateau around 25 k training steps. Extending to 460 k steps yields only marginal changes, suggesting that structural patterns are captured early and further gains require higher‑quality supervision.
Prompt specificity : Detailed, word‑by‑word prompts improve PSNR by ≈5 dB, demonstrating that precise textual scaffolding enables accurate text‑to‑pixel alignment.
Character‑level precision : The model attains near‑practical character rendering quality.
Symbolic reasoning bottleneck : Native symbolic reasoning accuracy is low (≈4 %). System‑level conditioning (reprompting) raises accuracy to ≈83 %, showing that external conditioning can compensate for current reasoning limitations.
GUI prototype – NCGUIWorld (Desktop Neural Computer)
Action‑conditioning injection modes
External : Actions are provided as an external condition before the VAE input, keeping them separate from the main token stream.
Contextual : Frames and actions are concatenated into a single sequence with temporal masking, allowing shared attention.
Residual : Action deltas are added to the hidden state as a residual modulation.
Internal : A dedicated action cross‑attention layer is inserted after the main cross‑attention and before the feed‑forward network, tightly integrating actions into the backbone.
Key experimental results
Data quality vs. scale : Target‑oriented interaction data (≈110 h of Claude CUA trajectories) yields far better performance than large volumes of passive data, emphasizing the importance of curated interaction datasets.
Explicit visual supervision for cursor tracking : Adding position + SVG mask/reference supervision raises cursor accuracy from 8.7 % (position only) to 98.7 %.
Injection depth impact : The Internal injection mode achieves the highest structural similarity (SSIM) and lowest Fréchet Video Distance (FVD), confirming that deep integration of actions improves temporal fidelity.
Key insight: Treating the cursor as a visual object to be learned, rather than as an abstract coordinate, is critical for reliable GUI interaction.
https://arxiv.org/pdf/2604.06425
https://metauto.ai/neuralcomputer
https://claude.com/blog/claude-managed-agents
https://www.anthropic.com/engineering/harness-design-long-running-appsHow this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
