AI News Flash: DeepSeek Multimodal Breakthrough, Codex Major Update, Grok 4.3 Launch (May 1‑2)
The AI roundup covers OpenAI's Codex upgrade with Workspace Agents and 40% token efficiency, xAI's Grok 4.3 API offering 128K context and 60% lower pricing, Ant Group's open‑source Ling 2.6‑1T model, DeepSeek's multimodal Visual Primitives framework and its sudden removal, plus the ongoing GPT‑Plus account bans and their mitigation.
OpenAI Codex Major Update
Core changes
Workspace Agents launched (announced Apr 22, fully rolled out May 1) for Business, Enterprise, Edu, and Teacher plans.
Designed to replace Custom GPTs for repeatable team workflows.
Supports cross‑tool orchestration, scheduled triggers, and result write‑back.
Programming capabilities enhanced.
Context handling improved.
Token efficiency claimed to increase by 40%.
Computational usage capabilities enhanced.
xAI Grok 4.3 API
Key parameters
Context window: 128K tokens.
Multimodal support: image + text.
Real‑time search: X‑platform live data.
API pricing: 60% lower than GPT‑4.5.
Early feedback (early stage)
Real‑time search accuracy: ★★★★ (better than GPT‑4.5).
Code generation: ★★★ (weaker than Claude Opus 4.7).
Multimodal understanding: ★★★★ (on par with Gemini 3.1).
Ant Group Ling 2.6‑1T Open‑Source Model
Model specifications
Total parameters: 1.02 trillion.
Active parameters (MoE): 420 billion.
Context window: 1 million tokens.
License: MIT (commercial use allowed).
Deployment threshold: 4 × H100 can run the full‑scale version.
Core technologies
Engram memory architecture compresses KV‑Cache by 90%.
ClawEval benchmark shows token efficiency savings of 40‑60% versus Opus/GPT.
Multilingual capability (Chinese, English, code) reaches top‑tier performance.
DeepSeek Multimodal Model – Open Then Deleted
Timeline of events
Apr 29: Team lead Chen Xiaokang posts “Now we can see you.”
Apr 29 evening: Web demo gray‑scale test “image recognition mode”.
Apr 30 early morning: GitHub upload of technical report “Thinking with Visual Primitives”.
Apr 30 late night: Paper and code repository deleted (GitHub 404).
May 1‑2: Industry discussion about DeepSeek’s actions.
Problem definition (from report)
Multimodal models fail on complex tasks not because of perception gaps (“can’t see clearly”) but because of reference gaps (“can’t point precisely”). Example: counting apples in an image without explicit pointing leads to human‑level errors; natural‑language references like “the left one” or “the second” are ambiguous in complex scenes.
Solution – Visual Primitives framework
Elevate “point <|point|> ” and “bounding box <|box|> ” to the smallest thinking units.
Traditional multimodal reasoning: "How many apples in the picture?" → think → answer: "3" ❌ (easy to miscount)
Visual Primitives reasoning: "How many apples in the picture?" → <|point| (12,34) → <|point| (56,78) → <|point| (90,12) → answer: "3" ✅ (precise anchoring)Technical architecture
Image input → DeepSeek‑ViT (visual encoder)
↓ CSA sparse attention compression (7056× compression)
↓ DeepSeek‑V4‑Flash backbone (2840 billion parameters)
↓ Generate coordinate‑aware reasoning process (Visual Primitives)
↓ Output precise answerPerformance (benchmark scores)
Spatial reasoning: 92.3% (vs. GPT‑5.4 87.1%, Claude‑4.6 85.6%, Gemini‑3.1 88.9%).
Visual QA: 89.7% (tied with Gemini‑3.1, above Claude‑4.6 87.1%).
Maze navigation: 96.8% (vs. GPT‑5.4 78.2%, Claude‑4.6 76.5%, Gemini‑3.1 81.3%).
Path tracing: 94.1% (vs. GPT‑5.4 82.7%, Claude‑4.6 80.9%, Gemini‑3.1 85.4%).
Early user feedback (gray‑scale test)
Image recognition accuracy: 4.5/5 (better than GPT‑4.5).
Spatial reasoning: 4.8/5 (significantly ahead of competitors).
Response speed: 3.9/5 (slightly slower than Claude).
Multimodal interaction: 4.3/5 (on par with Gemini).
Speculated reasons for paper deletion
Technical leakage risk – Visual Primitives could be quickly copied.
Patent protection – public disclosure might affect pending patents.
Product not ready – report revealed core parameters of an unreleased product.
Internal strategy shift – multimodal model may be bundled with the upcoming V4 release.
Comparison: DeepSeek Multimodal vs. Competitors
Open‑source status : DeepSeek planned open‑source (now withdrawn); GPT‑5.5, Claude Opus 4.7, Gemini 3.1 are closed‑source.
Context window : DeepSeek 1 M tokens; GPT‑5.5 2 M; Claude Opus 4.7 1 M; Gemini 3.1 Pro 1 M.
Multimodal modality : DeepSeek vision + language; GPT‑5.5 full‑modal; Claude vision + language; Gemini full‑modal.
Spatial reasoning rating : DeepSeek ★★★★★; GPT‑5.5 ★★★★; Claude Opus 4.7 ★★★★; Gemini 3.1 Pro ★★★★.
Cost‑performance rating : DeepSeek ★★★★★; GPT‑5.5 ★★★; Claude ★★★; Gemini 3.1 Pro ★★★★.
Availability : DeepSeek in gray‑scale testing; others fully open.
Implications for Developers
New multimodal paradigm
Developers building multimodal applications should focus on:
Enabling models to reference exact image regions.
Integrating spatial coordinates with language reasoning.
Evaluating model performance on spatial tasks such as counting, navigation, and path tracing.
Open‑source vs. closed‑source dynamics
DeepSeek V4 (Apr 24): open‑source, 1 M context.
DeepSeek multimodal (Apr 30): originally planned open‑source, later withdrawn.
Ling 2.6 (May 1): open‑source under MIT, zero API cost.
GPT‑5.5 / Claude 4.7: closed‑source but performance‑leading.
API price trends
Grok 4.3: 60% cheaper than GPT‑4.5.
DeepSeek V4 API (reported): 70% cheaper than Claude.
Ling 2.6: open‑source, no API fees.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Lao Guo's Learning Space
AI learning, discussion, and hands‑on practice with self‑reflection
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
