How SkyReels, DeepSeek NSA, Grok‑3, and KG²RAG Are Shaping the Next AI Wave
This issue reviews China's first open‑source short‑film model SkyReels‑V1, DeepSeek's Native Sparse Attention breakthrough, xAI's massive Grok‑3 deployment on 200k H100 GPUs, and a knowledge‑graph‑guided RAG framework, highlighting their performance gains, architectural innovations, and industry impact.
Market and Voices
China's First Open‑Source AI Short‑Film Model: SkyReels‑V1
On February 18, 2025, SkyReels‑V1 was released as the first open‑source AI model for short‑film generation in China. The model is a human‑centric video foundation model that can synthesize coherent, vivid short‑film clips from textual prompts, handling facial expressions, scene transitions, and action details with near‑photoreal quality. It supports over 60 behavior semantics, enabling diverse genres such as romance, suspense, and urban life while dramatically reducing production costs for individual creators and professional studios alike.
SkyReels‑V1’s open‑source nature invites developers to build custom extensions, potentially accelerating innovation across the short‑film ecosystem.
Valuable Technologies
DeepSeek's Native Sparse Attention (NSA)
DeepSeek introduced a new architecture called Native Sparse Attention (NSA) that tackles the quadratic O(L²) cost of traditional Transformers, especially for long‑context tasks. By redesigning the attention engine and optimizing hardware usage, NSA achieves a 5.8× training speedup and an 11.6× inference acceleration.
Training phase: forward pass 9× faster, backward pass 6× faster; a task that previously required 100 days now finishes in 17 days.
Inference phase: first‑token latency reduced to 89 ms for 32k context; memory usage drops to 13 GB, enabling consumer‑grade GPUs to run ultra‑long models.
Benchmarks: average score improvement of 3 % on nine tasks (MMLU, GSM8K, HumanEval, etc.); perfect retrieval accuracy on Needle‑in‑a‑Haystack 64k tests; 163 % accuracy gain on AIME math problems.
The NSA engine consists of three cooperating attention modules:
Compressed Attention : processes tokens in 32‑token blocks with a 16‑token stride, reducing the token count to 1/8 while preserving global patterns.
Selective Attention : dynamically selects top‑N blocks based on compressed results, loading 64‑token blocks that fit GPU memory bandwidth and using GQA to eliminate KV cache redundancy.
Sliding Window : maintains a 512‑token local context window to ensure syntactic coherence, with separate parameters to avoid local dominance.
These modules are fused via learnable gates, enabling continuous block‑wise memory access and balanced arithmetic intensity, which together break the memory‑wall limitation.
NSA demonstrates that algorithmic efficiency can coexist with high accuracy, suggesting a shift from brute‑force scaling toward smarter compute paradigms.
xAI Grok‑3 on 200k H100 GPUs
On February 18, 2025, xAI unveiled Grok‑3, the first large model trained on a 200,000‑GPU H100 cluster. The model comes in a full‑scale version and a lightweight “mini” variant, featuring a new DeepSearch engine and chain‑of‑thought reasoning. Subscription tiers include an X‑Platform Premium+ plan and a standalone SuperGrok service ($30/month).
Benchmark: first model to exceed 1400 points in the Arena leaderboard, ranking first across all categories.
Task performance: top scores in coding, mathematics, creative writing, instruction following, and multi‑turn dialogue.
Mathematics: 93 points on the 2024 AIME, far ahead of DeepSeek‑V3 (39) and GPT‑4o (85).
Scientific reasoning: 75 points on GPQA, surpassing Gemini 2 Pro (68).
Coding: 57 points on LCB, a 58 % improvement over DeepSeek‑V3.
Training cost details: 200k H100 GPUs consumed 4 × 10⁸ GPU‑hours, equivalent to 12.8× the compute used for GPT‑4. Peak data‑center power reached 250 MW, powered by Tesla Megapack batteries and liquid‑cooling systems. Despite the massive scale, performance gains were under 10 %, raising questions about diminishing returns of pure scaling.
Karpathy’s early evaluation highlighted Grok‑3’s ability to generate complex game code comparable to OpenAI’s premium models, while noting occasional failures on emoji‑decoding puzzles.
KG²RAG: Knowledge‑Graph‑Guided Retrieval‑Augmented Generation
The KG²RAG framework augments standard Retrieval‑Augmented Generation (RAG) by injecting structured knowledge‑graph information to overcome three key limitations of traditional RAG: isolated and redundant document blocks, limited reasoning capability, and incoherent context stitching.
Offline Document Processing : Documents are split into chunks, then entities and relations are extracted and linked to a knowledge graph, establishing factual connections between chunks.
KG‑Enhanced Chunk Retrieval : Initial semantic similarity retrieves seed chunks; a graph‑guided expansion traverses the knowledge graph (e.g., BFS) to fetch related chunks, increasing diversity and relevance.
KG‑Based Context Organization : A weighted undirected graph of retrieved chunks is built, and a Maximum Spanning Tree algorithm removes redundancy, yielding a coherent, self‑consistent passage for the LLM.
Experiments show that KG²RAG improves response quality by providing fact‑level relationships and better context continuity, enabling larger language models to reason more effectively.
The authors argue that graph‑guided expansion and KG‑based organization constitute a novel direction for RAG research, offering a practical path to higher‑quality LLM outputs.
ZhongAn Tech Team
China's first online insurer. Through tech innovation we make insurance simpler, warmer, and more valuable. Powered by technology, we support 50 billion RMB of policies and serve 600 million users with smart, personalized solutions. ZhongAn's hardcore tech and article shares are here.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
