Tagged articles
43 articles
Page 1 of 1
Machine Heart
Machine Heart
May 19, 2026 · Artificial Intelligence

How New LLM Architectures Like Gemma 4 and DeepSeek V4 Cut Long‑Context Costs

Recent open‑weight LLMs such as Gemma 4, Laguna XS.2, ZAYA1‑8B, and DeepSeek V4 introduce KV‑cache sharing, per‑layer embeddings, layer‑wise attention budgeting, and compressed attention mechanisms that dramatically reduce memory and compute overhead for very long contexts while preserving model quality.

KV sharingLLMarchitecture
0 likes · 25 min read
How New LLM Architectures Like Gemma 4 and DeepSeek V4 Cut Long‑Context Costs
AI Engineer Programming
AI Engineer Programming
May 4, 2026 · Artificial Intelligence

RAG in the Long-Context Era: Challenges, Benchmarks, and Context Engineering

The article analyzes how expanding LLM context windows to millions of tokens reshape Retrieval‑Augmented Generation, detailing chunking trade‑offs, embedding retrieval limits, attention U‑shaped distribution, benchmark results, and the emerging practice of Context Engineering for optimal end‑to‑end pipelines.

BenchmarkingEmbedding RetrievalLLM
0 likes · 10 min read
RAG in the Long-Context Era: Challenges, Benchmarks, and Context Engineering
Machine Heart
Machine Heart
Apr 29, 2026 · Artificial Intelligence

LCA Boosts Long-Context Inference: 2.5× Speedup and 90% KV Cache Reduction

The Latent‑Condensed Attention (LCA) method dramatically cuts KV‑cache memory by 90%, speeds up pre‑fill by 2.5× and reduces decode latency by 1.8× for 128K‑token contexts, while requiring no extra parameters and preserving model performance across diverse LLMs.

Inference AccelerationKV cache reductionLCA
0 likes · 10 min read
LCA Boosts Long-Context Inference: 2.5× Speedup and 90% KV Cache Reduction
ArcThink
ArcThink
Apr 27, 2026 · Artificial Intelligence

Why GPT‑5.5 Is a True Generational Leap: Deep Dive vs. Claude Opus 4.7

GPT‑5.5, the first fully retrained base model since GPT‑4.5, delivers an 11.7‑point jump on ARC‑AGI‑2, wins 9 of 10 shared benchmarks, shows superior agent and ultra‑long‑context performance, yet incurs higher latency and token pricing, while Claude Opus 4.7 excels on deep‑reasoning tasks, marking a multi‑pole era for frontier AI.

AI benchmarksClaude Opus 4.7GPT-5.5
0 likes · 16 min read
Why GPT‑5.5 Is a True Generational Leap: Deep Dive vs. Claude Opus 4.7
ArcThink
ArcThink
Apr 27, 2026 · Artificial Intelligence

GPT-5.5 Deep Dive: What Makes This True Generational Leap Stand Out?

GPT‑5.5, the first fully retrained base model since GPT‑4.5, delivers an 11.7‑point jump on ARC‑AGI‑2, dramatic long‑context gains, and wins 9 of 10 shared benchmarks against GPT‑5.4, while a side‑by‑side comparison with Claude Opus 4.7 shows each model excelling in different domains, heralding a multi‑polar era for frontier AI.

AgentBenchmarkClaude Opus 4.7
0 likes · 16 min read
GPT-5.5 Deep Dive: What Makes This True Generational Leap Stand Out?
CodeTrend
CodeTrend
Apr 26, 2026 · Artificial Intelligence

DeepSeek V4 Architecture: High‑Efficiency Long‑Context Model Design

DeepSeek V4, released in April 2026, introduces two versions—Pro and Flash—with up to 1.6 trillion parameters and a million‑token context window, leveraging hybrid attention, compressed KV cache, and specialized training techniques to dramatically cut hardware dependence and inference cost.

DeepSeekFP4Mixture of Experts
0 likes · 5 min read
DeepSeek V4 Architecture: High‑Efficiency Long‑Context Model Design
Architect's Tech Stack
Architect's Tech Stack
Apr 25, 2026 · Artificial Intelligence

DeepSeek‑V4 Launch: 1.6 T Parameters, 1 M‑Token Context, Programming Skills Lead Open‑Source Rankings

DeepSeek released the V4 series—V4‑Pro (1.6 T total, 49 B active) and V4‑Flash (284 B total, 13 B active)—featuring three architectural upgrades, three inference modes, mixed‑precision FP4/FP8 weights, and benchmark results that place its programming ability at the top of open‑source models while supporting a million‑token context window.

AI ArchitectureBenchmarkDeepSeek
0 likes · 5 min read
DeepSeek‑V4 Launch: 1.6 T Parameters, 1 M‑Token Context, Programming Skills Lead Open‑Source Rankings
Shuge Unlimited
Shuge Unlimited
Apr 25, 2026 · Artificial Intelligence

DeepSeek V4: Comeback? 1.6 T Params, Million‑Token Context, Open‑Source Matches Closed‑Source

DeepSeek V4, released shortly after GPT‑5.5, offers two models—V4‑Pro (1.6 T parameters) and V4‑Flash (284 B parameters)—that introduce a hybrid CSA/HCA attention architecture to enable efficient million‑token context, achieve dramatic FLOPs and KV savings, deliver competitive programming and agent benchmarks, and adopt a disruptive pricing strategy, while also exposing training‑stability tricks and highlighting both strengths and remaining gaps.

BenchmarkDeepSeek-V4LLM
0 likes · 25 min read
DeepSeek V4: Comeback? 1.6 T Params, Million‑Token Context, Open‑Source Matches Closed‑Source
Design Hub
Design Hub
Apr 24, 2026 · Artificial Intelligence

When DeepSeek V4 Meets GPT‑5.5: How Workflows Are Splitting Apart

Two heavyweight LLMs launched on the same day—DeepSeek V4 emphasizing open, ultra‑long‑context, deployable foundations, and GPT‑5.5 pushing agentic, tool‑using execution—highlight a clear industry fork between owning work context and delegating task execution.

Agentic AIDeepSeekGPT-5.5
0 likes · 13 min read
When DeepSeek V4 Meets GPT‑5.5: How Workflows Are Splitting Apart
Machine Heart
Machine Heart
Apr 24, 2026 · Artificial Intelligence

DeepSeek V4 Unveiled: Dual Versions with 1M Token Context and New Mixed‑Attention Architecture

DeepSeek V4 launches two models—Flash and Pro—both supporting up to 1 million token context and 384 K output tokens, offering non‑thinking and thinking modes with a reasoning_effort parameter, and featuring mixed attention, manifold‑constrained hyperconnections, a Muon optimizer, massive training data, and up to 73% FLOPs reduction versus V3.

AI modelCambriconDeepSeek-V4
0 likes · 5 min read
DeepSeek V4 Unveiled: Dual Versions with 1M Token Context and New Mixed‑Attention Architecture
Machine Learning Algorithms & Natural Language Processing
Machine Learning Algorithms & Natural Language Processing
Apr 17, 2026 · Artificial Intelligence

Claude Opus 4.7’s Visual and Long‑Context Leap: Near‑Full Vision and 1M‑Token Tasks Redefine Knowledge Work

Claude Opus 4.7, announced as Anthropic’s most capable publicly available model, dramatically improves visual reasoning, long‑context task handling and instruction following, delivering up to a 2.4‑fold boost on benchmarks such as XBOW, SWE‑bench and structural biology, while also introducing new security guardrails and token‑usage costs.

AI benchmarksAnthropicClaude Opus 4.7
0 likes · 11 min read
Claude Opus 4.7’s Visual and Long‑Context Leap: Near‑Full Vision and 1M‑Token Tasks Redefine Knowledge Work
Machine Heart
Machine Heart
Apr 17, 2026 · Artificial Intelligence

Combining Transformers and RNNs: Google’s Memory Caching Unlocks Ultra‑Long Context

Google Research introduces Memory Caching (MC), a technique that gives RNNs growing memory capacity, bridging the gap with Transformers to enable ultra‑long context processing while reducing memory demands, and demonstrates its effectiveness through extensive language‑modeling and recall experiments.

AI ArchitectureGoogle ResearchMemory Caching
0 likes · 7 min read
Combining Transformers and RNNs: Google’s Memory Caching Unlocks Ultra‑Long Context
Baidu Intelligent Cloud Tech Hub
Baidu Intelligent Cloud Tech Hub
Apr 8, 2026 · Artificial Intelligence

Unlocking 8‑Hour Autonomous Coding: GLM‑5.1’s Leap with Kunlun XPU

The open‑source GLM‑5.1 model, adapted to Baidu Baige's Kunlun XPU via the vLLM‑Kunlun Plugin, delivers record‑breaking SWE‑bench scores, eight‑hour autonomous coding, long‑context handling up to 64K tokens, and scalable deployment across tens of thousands of chips, showcasing end‑to‑end AI acceleration.

GLM-5.1Kunlun XPUModel Deployment
0 likes · 8 min read
Unlocking 8‑Hour Autonomous Coding: GLM‑5.1’s Leap with Kunlun XPU
Machine Learning Algorithms & Natural Language Processing
Machine Learning Algorithms & Natural Language Processing
Mar 10, 2026 · Artificial Intelligence

How InfLLM‑V2 Achieves Seamless Short‑to‑Long Context Upgrade with Minimal Structural Changes

InfLLM‑V2 introduces a dense‑sparse switchable attention framework that preserves the original dense‑attention parameters while enabling efficient long‑context training, matching full‑attention performance on benchmarks such as RULER, LongBench, and chain‑reasoning tasks, and delivering up to 2.3× end‑to‑end inference speedup without degrading short‑sequence abilities.

InfLLM-V2Transformerdense-sparse attention
0 likes · 16 min read
How InfLLM‑V2 Achieves Seamless Short‑to‑Long Context Upgrade with Minimal Structural Changes
SuanNi
SuanNi
Feb 26, 2026 · Artificial Intelligence

How Alibaba’s Qwen3.5 Series Redefines Efficient Large‑Model Design

Alibaba’s newly released Qwen3.5 series—spanning 27B, 35B, and 122B parameter models—demonstrates how hybrid compute, high‑quality data, and reinforcement‑learning can boost multimodal understanding, ultra‑long‑context handling, and multilingual support while drastically lowering hardware requirements, marking a shift from pure scaling to efficient AI evolution.

AI ArchitectureMultimodal AIlong context
0 likes · 7 min read
How Alibaba’s Qwen3.5 Series Redefines Efficient Large‑Model Design
PaperAgent
PaperAgent
Feb 15, 2026 · Artificial Intelligence

How MiniCPM‑SALA Merges Sparse and Linear Attention to Break Long‑Context Limits

MiniCPM‑SALA introduces a hybrid sparse‑linear attention architecture that reduces quadratic compute and memory costs, achieves state‑of‑the‑art performance on long‑context benchmarks, and delivers up to 3.5× faster inference than full‑attention models on sequences up to 1 million tokens.

LLMLinear AttentionModel architecture
0 likes · 17 min read
How MiniCPM‑SALA Merges Sparse and Linear Attention to Break Long‑Context Limits
AI Engineering
AI Engineering
Feb 14, 2026 · Artificial Intelligence

DeepSeek‑V4‑Lite‑285B Hits 100% Recall in 256K Token Tests – A Needle‑in‑a‑Haystack Benchmark

Community testing of DeepSeek's rumored V4‑Lite‑285B model using the OpenAI MRCR 8‑pin standard shows perfect 1.0000 scores on several 128K‑token samples and a 256K‑token sample, achieving 100% recall in native 256K context while longer contexts drop to about 60%, with a note that the "needle‑in‑a‑haystack" method may be exploitable by DSA mechanisms.

DeepSeekLLMlong context
0 likes · 3 min read
DeepSeek‑V4‑Lite‑285B Hits 100% Recall in 256K Token Tests – A Needle‑in‑a‑Haystack Benchmark
Machine Learning Algorithms & Natural Language Processing
Machine Learning Algorithms & Natural Language Processing
Feb 12, 2026 · Artificial Intelligence

Is the Transformer Paradigm Shifting? SALA Handles Million‑Token Context on RTX 5090

The article presents SALA, a sparse‑linear hybrid attention architecture that replaces full attention in 9B‑parameter models, achieving comparable accuracy while cutting compute and memory costs, enabling million‑token inference on a single RTX 5090 and delivering up to 3.5× speed‑up over Qwen3‑8B.

Hybrid Position EncodingLLM efficiencyLinear Attention
0 likes · 18 min read
Is the Transformer Paradigm Shifting? SALA Handles Million‑Token Context on RTX 5090
Data Party THU
Data Party THU
Feb 4, 2026 · Artificial Intelligence

How Sakana AI Redefines Long-Context Transformers: DroPE, REPO, and FwPKM Explained

This article analyzes Sakana AI's three recent papers that challenge traditional Transformer long‑sequence handling by removing positional embeddings, reconstructing position awareness, and adding a fast‑weight external memory, showing how each approach improves ultra‑long text understanding.

Memory MechanismPositional EmbeddingTransformer
0 likes · 12 min read
How Sakana AI Redefines Long-Context Transformers: DroPE, REPO, and FwPKM Explained
PaperAgent
PaperAgent
Jan 6, 2026 · Artificial Intelligence

How Recursive Language Models Enable Unlimited Context for LLMs

Recursive Language Models (RLM) offer a cost‑effective alternative to expanding LLM context windows by storing prompts as variables and enabling recursive calls, allowing models to process over 100,000 tokens, with experiments showing superior performance and lower median costs compared to baseline approaches.

AI researchLLM scalingPrompt engineering
0 likes · 5 min read
How Recursive Language Models Enable Unlimited Context for LLMs
Data Party THU
Data Party THU
Oct 16, 2025 · Artificial Intelligence

How Tensor Product Attention Redefines Long‑Context Transformers

The article analyzes the Tensor Product Attention (TPA) method presented at NeurIPS 2025, explaining how it factorizes Q, K, V tensors to drastically reduce KV cache size and attention complexity, and demonstrates superior convergence, lower perplexity, and faster inference on long‑sequence tasks compared with existing attention variants.

KV cacheRoPETensor Product Attention
0 likes · 11 min read
How Tensor Product Attention Redefines Long‑Context Transformers
Architects' Tech Alliance
Architects' Tech Alliance
Sep 19, 2025 · Artificial Intelligence

Why Nvidia’s Rubin CPX GPU Could Revolutionize Long-Context AI Inference

Nvidia's Rubin CPX GPU, unveiled in September 2025, uses GDDR7 memory and a split‑stage architecture to dramatically boost token‑per‑second rates for long‑context inference, while its integration into third‑generation Oberon servers promises higher power density, improved ROI, and scalable data‑center deployments.

AI inferenceData centerGPU architecture
0 likes · 9 min read
Why Nvidia’s Rubin CPX GPU Could Revolutionize Long-Context AI Inference
Instant Consumer Technology Team
Instant Consumer Technology Team
Sep 11, 2025 · Artificial Intelligence

How REFRAG Cuts LLM Decoding Time by 30×: A New Efficient RAG Framework

REFRAG (REpresentation For RAG) introduces a novel decoding framework that compresses, senses, and expands context using precomputed chunk embeddings, achieving up to 30.85× faster first-token generation and 16× larger context windows without sacrificing perplexity, as validated across diverse long‑context tasks.

LLMRAGchunk embeddings
0 likes · 18 min read
How REFRAG Cuts LLM Decoding Time by 30×: A New Efficient RAG Framework
AI Algorithm Path
AI Algorithm Path
Aug 20, 2025 · Artificial Intelligence

DeepSeek V3.1 Open‑Source: Unlocking a New Era of Long‑Context AI

DeepSeek V3.1, a 685‑billion‑parameter open‑source model, supports up to 128,000 tokens, delivers mixed‑architecture capabilities, matches top‑tier closed systems in benchmarks, and its rapid community adoption signals a shift toward democratized AI development and new industry dynamics.

AI PerformanceDeepSeeklarge language model
0 likes · 6 min read
DeepSeek V3.1 Open‑Source: Unlocking a New Era of Long‑Context AI
DataFunTalk
DataFunTalk
Jul 16, 2025 · Artificial Intelligence

MiniMax-M1 Revealed: Hybrid Attention, RL Training, and 1M Token Context

MiniMax’s latest M1 model, unveiled after a $300 million funding round, showcases a 4.56‑trillion‑parameter hybrid‑expert architecture with lightning attention, supporting up to one million tokens, and leverages reinforcement‑learning techniques to enhance long‑context handling, inference efficiency, and system‑2 reasoning capabilities.

AI scalingModel architecturehybrid attention
0 likes · 16 min read
MiniMax-M1 Revealed: Hybrid Attention, RL Training, and 1M Token Context
iQIYI Technical Product Team
iQIYI Technical Product Team
Jul 3, 2025 · Artificial Intelligence

Three iQIYI AI Papers Break New Ground at ACL 2025 & INTERSPEECH 2025

iQIYI’s AI research team secured three paper acceptances—two at ACL 2025 (including a main conference and a Findings paper) and one at INTERSPEECH 2025—covering long‑context large language model evaluation, Chinese novel summarization, and efficient Thai speech recognition, with links to each work.

ACL 2025AI researchINTERSPEECH 2025
0 likes · 7 min read
Three iQIYI AI Papers Break New Ground at ACL 2025 & INTERSPEECH 2025
AIWalker
AIWalker
Jun 18, 2025 · Artificial Intelligence

Six New Directions for Large Language Models

Large language models are booming, and this article highlights six cutting‑edge research directions—LLM‑plus synthetic data, reward modeling, inference techniques, LLM‑as‑a‑Judge, safety alignment, and long‑context handling—each illustrated with recent papers, experimental results, and links to code repositories.

InferenceLLMReward Modeling
0 likes · 9 min read
Six New Directions for Large Language Models
Baobao Algorithm Notes
Baobao Algorithm Notes
Jun 6, 2025 · Artificial Intelligence

What AI Programming Agents Reveal About RL, Feedback Loops, and Long‑Context Challenges

In a deep dive into the Cursor team's podcast, core members dissect the current hurdles of AI programming agents, covering feedback‑mechanism design, reinforcement‑learning reward sparsity, tool‑chain integration, long‑context handling, and emerging attention mechanisms that shape the future of code‑centric AI.

AI programmingattention mechanismslong context
0 likes · 35 min read
What AI Programming Agents Reveal About RL, Feedback Loops, and Long‑Context Challenges
AntTech
AntTech
Apr 10, 2025 · Artificial Intelligence

Ant Group Presents Four AI Research Papers at ICLR 2025 Live Showcase

At the ICLR 2025 live session in Singapore, Ant Group showcased four cutting‑edge papers—CodePlan, Animate‑X, Group Position Embedding, and OmniKV—demonstrating advances in large‑language‑model reasoning, universal character animation, layout‑aware document understanding, and efficient long‑context inference.

AI researchdocument understandinglarge language models
0 likes · 6 min read
Ant Group Presents Four AI Research Papers at ICLR 2025 Live Showcase
21CTO
21CTO
Apr 7, 2025 · Artificial Intelligence

Llama 4 Unveiled: Breakthrough Multimodal Models Redefine AI Capabilities

Meta's Llama 4 series introduces the Scout, Maverick, and Behemoth models—featuring Mixture‑of‑Experts architectures, unprecedented 10‑million‑token context windows, and state‑of‑the‑art performance across vision, language, and multimodal benchmarks—while emphasizing efficient training, open‑source availability, and robust safety safeguards.

AI SafetyLlama 4Mixture of Experts
0 likes · 14 min read
Llama 4 Unveiled: Breakthrough Multimodal Models Redefine AI Capabilities
DataFunTalk
DataFunTalk
Apr 6, 2025 · Artificial Intelligence

Meta Unveils Llama 4: New Multimodal AI Models with Mixture‑of‑Experts Architecture and 10 Million‑Token Context

Meta announced the Llama 4 series—Scout, Maverick and Behemoth—featuring multimodal capabilities, Mixture‑of‑Experts design, up to 10 million‑token context windows, and state‑of‑the‑art performance on STEM, multilingual and image benchmarks, with models now downloadable from llama.com and Hugging Face.

Llama 4Mixture of ExpertsModel Training
0 likes · 14 min read
Meta Unveils Llama 4: New Multimodal AI Models with Mixture‑of‑Experts Architecture and 10 Million‑Token Context
Fighter's World
Fighter's World
Apr 5, 2025 · Artificial Intelligence

Is Gemini 2.5 Pro the Turning Point for Google’s AI Strategy?

The article analyses Google’s Gemini 2.5 Pro as a decisive shift toward a “Reasoning Model”, detailing its architectural focus on inference, benchmark breakthroughs such as Humanity’s Last Exam and GPQA Diamond, long‑context capability, multimodal strengths, Vibe‑coding experience, and the roadmap for future Gemini models.

AI strategyBenchmarkGemini 2.5 Pro
0 likes · 25 min read
Is Gemini 2.5 Pro the Turning Point for Google’s AI Strategy?
Architect
Architect
Feb 24, 2025 · Artificial Intelligence

Inside MoBA: A Sparse Attention Framework for 10‑Million‑Token Contexts

The article details the development, architectural evolution, and practical challenges of MoBA—a sparse attention framework inspired by Mixture‑of‑Experts that scales LLM context length to 10 M tokens, supports seamless switching between full and sparse attention, and is now released as a minimal open‑source solution.

AI ArchitectureContext ParallelLLM training
0 likes · 13 min read
Inside MoBA: A Sparse Attention Framework for 10‑Million‑Token Contexts
Architecture Digest
Architecture Digest
Feb 24, 2025 · Artificial Intelligence

MoBA: Mixture of Block Attention for Long‑Context Large Language Models

The article introduces MoBA, a Mixture‑of‑Block‑Attention mechanism that applies Mixture‑of‑Experts principles to transformer attention, enabling efficient long‑context processing for large language models while maintaining performance comparable to full attention through sparse, trainable block selection and seamless switching.

Attention MechanismLLMMixture of Experts
0 likes · 12 min read
MoBA: Mixture of Block Attention for Long‑Context Large Language Models
Bilibili Tech
Bilibili Tech
Sep 18, 2024 · Artificial Intelligence

Index-1.9B-32K: A 2% GPT-Size Model with Powerful Long-Context Capabilities

Index-1.9B-32K is a 1.9B-parameter model with a 32K token context window, achieving strong long‑text performance comparable to larger models while using only about 2% of GPT‑4’s compute, trained via long pre‑training and supervised fine‑tuning, with a trade‑off of reduced short‑context ability.

AIFine-tuningevaluation
0 likes · 12 min read
Index-1.9B-32K: A 2% GPT-Size Model with Powerful Long-Context Capabilities
NewBeeNLP
NewBeeNLP
Aug 3, 2024 · Artificial Intelligence

Extending LLM Context to 1M Tokens: SAMBA, CoPE, RoPE, Retrieval Heads & Infini‑Attention

This article reviews recent research on extending large language model context windows to millions of tokens, covering SAMBA's hybrid architecture, Contextual Position Encoding (CoPE), RoPE base length theory, Retrieval Head analysis, and the memory‑efficient Infini‑Attention mechanism.

LLM researchefficient attentionlarge language models
0 likes · 10 min read
Extending LLM Context to 1M Tokens: SAMBA, CoPE, RoPE, Retrieval Heads & Infini‑Attention
Baobao Algorithm Notes
Baobao Algorithm Notes
May 31, 2024 · Industry Insights

Do Scaling Laws Still Hold? Deep Dive into Synthetic Data, New Model Architectures, and Long‑Context Solutions

In a May 15 round‑table, experts debated the validity of scaling laws, the role of synthetic and semi‑synthetic data in overcoming data bottlenecks, explored alternatives to the Transformer such as RNN‑based and hybrid designs, evaluated the practicality of Mixture‑of‑Experts models, and examined two main strategies—KV‑cache compression and input‑context reduction—to enable truly long‑context processing.

Mixture of Expertslong context
0 likes · 13 min read
Do Scaling Laws Still Hold? Deep Dive into Synthetic Data, New Model Architectures, and Long‑Context Solutions
AI Large Model Application Practice
AI Large Model Application Practice
May 3, 2024 · Artificial Intelligence

Can Giant Context LLMs Replace RAG? Exploring the Limits of Long‑Context Retrieval

This article examines whether the rapid growth of large‑language‑model context windows can eliminate the need for retrieval‑augmented generation, presenting experimental needle‑in‑a‑haystack tests, analysis of model performance across token lengths and needle positions, and practical guidance using an open‑source evaluation tool.

AILLMNeedle-in-a-Haystack
0 likes · 13 min read
Can Giant Context LLMs Replace RAG? Exploring the Limits of Long‑Context Retrieval
Java Tech Enthusiast
Java Tech Enthusiast
Feb 16, 2024 · Artificial Intelligence

Google's Gemini 1.5: Breakthrough in Long-Context Understanding and Multimodal Capabilities

Google’s Gemini 1.5, a new multimodal Mixture‑of‑Experts model, supports up to a million‑token context (10 million internally), can understand text, video, audio and code, learns a new language from a single prompt, and is already being used by Samsung, Jasper and Quora, positioning it as a direct challenger to OpenAI’s flagship models.

Gemini 1.5Google AILLM
0 likes · 7 min read
Google's Gemini 1.5: Breakthrough in Long-Context Understanding and Multimodal Capabilities
Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
Sep 19, 2023 · Artificial Intelligence

BladeLLM: Ultra‑Long Context LLM Inference via RaggedAttention & AutoTuner

BladeLLM, Alibaba Cloud’s large‑model inference engine, pushes the limits of LLMs by supporting ultra‑long context lengths up to 70 K tokens, leveraging novel RaggedAttention and a DNN‑based AutoTuner to deliver superior performance, memory efficiency, and low‑latency inference across diverse workloads.

AI InfrastructureAutoTunerLLM inference
0 likes · 11 min read
BladeLLM: Ultra‑Long Context LLM Inference via RaggedAttention & AutoTuner
ITPUB
ITPUB
Mar 22, 2023 · Artificial Intelligence

What Can GPT‑4 Do? Vision, Long Memory, Safer AI and More

OpenAI’s GPT‑4 arrives with multimodal vision, a dramatically longer context window, higher exam scores, Socratic prompting, improved safety, and new partnerships, while still in research mode and subject to bias and code‑trust limitations.

AI SafetyGPT-4large language model
0 likes · 7 min read
What Can GPT‑4 Do? Vision, Long Memory, Safer AI and More