11 min read

AI News Flash: DeepSeek Multimodal Breakthrough, Codex Major Update, Grok 4.3 Launch (May 1‑2)

The AI roundup covers OpenAI's Codex upgrade with Workspace Agents and 40% token efficiency, xAI's Grok 4.3 API offering 128K context and 60% lower pricing, Ant Group's open‑source Ling 2.6‑1T model, DeepSeek's multimodal Visual Primitives framework and its sudden removal, plus the ongoing GPT‑Plus account bans and their mitigation.

Lao Guo's Learning Space

May 2, 2026

AI News Flash: DeepSeek Multimodal Breakthrough, Codex Major Update, Grok 4.3 Launch (May 1‑2)

OpenAI Codex Major Update

Core changes

Workspace Agents launched (announced Apr 22, fully rolled out May 1) for Business, Enterprise, Edu, and Teacher plans.

Designed to replace Custom GPTs for repeatable team workflows.

Supports cross‑tool orchestration, scheduled triggers, and result write‑back.

Programming capabilities enhanced.

Context handling improved.

Token efficiency claimed to increase by 40%.

Computational usage capabilities enhanced.

xAI Grok 4.3 API

Key parameters

Context window: 128K tokens.

Multimodal support: image + text.

Real‑time search: X‑platform live data.

API pricing: 60% lower than GPT‑4.5.

Early feedback (early stage)

Real‑time search accuracy: ★★★★ (better than GPT‑4.5).

Code generation: ★★★ (weaker than Claude Opus 4.7).

Multimodal understanding: ★★★★ (on par with Gemini 3.1).

Ant Group Ling 2.6‑1T Open‑Source Model

Model specifications

Total parameters: 1.02 trillion.

Active parameters (MoE): 420 billion.

Context window: 1 million tokens.

License: MIT (commercial use allowed).

Deployment threshold: 4 × H100 can run the full‑scale version.

Core technologies

Engram memory architecture compresses KV‑Cache by 90%.

ClawEval benchmark shows token efficiency savings of 40‑60% versus Opus/GPT.

Multilingual capability (Chinese, English, code) reaches top‑tier performance.

DeepSeek Multimodal Model – Open Then Deleted

Timeline of events

Apr 29: Team lead Chen Xiaokang posts “Now we can see you.”

Apr 29 evening: Web demo gray‑scale test “image recognition mode”.

Apr 30 early morning: GitHub upload of technical report “Thinking with Visual Primitives”.

Apr 30 late night: Paper and code repository deleted (GitHub 404).

May 1‑2: Industry discussion about DeepSeek’s actions.

Problem definition (from report)

Multimodal models fail on complex tasks not because of perception gaps (“can’t see clearly”) but because of reference gaps (“can’t point precisely”). Example: counting apples in an image without explicit pointing leads to human‑level errors; natural‑language references like “the left one” or “the second” are ambiguous in complex scenes.

Solution – Visual Primitives framework

Elevate “point <|point|> ” and “bounding box <|box|> ” to the smallest thinking units.

Traditional multimodal reasoning: "How many apples in the picture?" → think → answer: "3" ❌ (easy to miscount)
Visual Primitives reasoning: "How many apples in the picture?" → <|point| (12,34) → <|point| (56,78) → <|point| (90,12) → answer: "3" ✅ (precise anchoring)

Technical architecture

Image input → DeepSeek‑ViT (visual encoder)
↓ CSA sparse attention compression (7056× compression)
↓ DeepSeek‑V4‑Flash backbone (2840 billion parameters)
↓ Generate coordinate‑aware reasoning process (Visual Primitives)
↓ Output precise answer

Performance (benchmark scores)

Spatial reasoning: 92.3% (vs. GPT‑5.4 87.1%, Claude‑4.6 85.6%, Gemini‑3.1 88.9%).

Visual QA: 89.7% (tied with Gemini‑3.1, above Claude‑4.6 87.1%).

Maze navigation: 96.8% (vs. GPT‑5.4 78.2%, Claude‑4.6 76.5%, Gemini‑3.1 81.3%).

Path tracing: 94.1% (vs. GPT‑5.4 82.7%, Claude‑4.6 80.9%, Gemini‑3.1 85.4%).

Early user feedback (gray‑scale test)

Image recognition accuracy: 4.5/5 (better than GPT‑4.5).

Spatial reasoning: 4.8/5 (significantly ahead of competitors).

Response speed: 3.9/5 (slightly slower than Claude).

Multimodal interaction: 4.3/5 (on par with Gemini).

Speculated reasons for paper deletion

Technical leakage risk – Visual Primitives could be quickly copied.

Patent protection – public disclosure might affect pending patents.

Product not ready – report revealed core parameters of an unreleased product.

Internal strategy shift – multimodal model may be bundled with the upcoming V4 release.

Comparison: DeepSeek Multimodal vs. Competitors

Open‑source status : DeepSeek planned open‑source (now withdrawn); GPT‑5.5, Claude Opus 4.7, Gemini 3.1 are closed‑source.

Context window : DeepSeek 1 M tokens; GPT‑5.5 2 M; Claude Opus 4.7 1 M; Gemini 3.1 Pro 1 M.

Multimodal modality : DeepSeek vision + language; GPT‑5.5 full‑modal; Claude vision + language; Gemini full‑modal.

Spatial reasoning rating : DeepSeek ★★★★★; GPT‑5.5 ★★★★; Claude Opus 4.7 ★★★★; Gemini 3.1 Pro ★★★★.

Cost‑performance rating : DeepSeek ★★★★★; GPT‑5.5 ★★★; Claude ★★★; Gemini 3.1 Pro ★★★★.

Availability : DeepSeek in gray‑scale testing; others fully open.

Implications for Developers

New multimodal paradigm

Developers building multimodal applications should focus on:

Enabling models to reference exact image regions.

Integrating spatial coordinates with language reasoning.

Evaluating model performance on spatial tasks such as counting, navigation, and path tracing.

Open‑source vs. closed‑source dynamics

DeepSeek V4 (Apr 24): open‑source, 1 M context.

DeepSeek multimodal (Apr 30): originally planned open‑source, later withdrawn.

Ling 2.6 (May 1): open‑source under MIT, zero API cost.

GPT‑5.5 / Claude 4.7: closed‑source but performance‑leading.

API price trends

Grok 4.3: 60% cheaper than GPT‑4.5.

DeepSeek V4 API (reported): 70% cheaper than Claude.

Ling 2.6: open‑source, no API fees.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

multimodal AI DeepSeek Grok Codex Visual Primitives AI model benchmarks

Written by

Lao Guo's Learning Space

AI learning, discussion, and hands‑on practice with self‑reflection

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.

OpenAI Codex Major Update

xAI Grok 4.3 API

Ant Group Ling 2.6‑1T Open‑Source Model

DeepSeek Multimodal Model – Open Then Deleted

Comparison: DeepSeek Multimodal vs. Competitors

Implications for Developers

New multimodal paradigm

Open‑source vs. closed‑source dynamics

API price trends

Lao Guo's Learning Space

How this landed with the community

Was this worth your time?

0 Comments

xAI Grok 4.3 API

Ant Group Ling 2.6‑1T Open‑Source Model