Tagged articles

43 articles

Page 1 of 1

May 19, 2026 · Artificial Intelligence

How New LLM Architectures Like Gemma 4 and DeepSeek V4 Cut Long‑Context Costs

Recent open‑weight LLMs such as Gemma 4, Laguna XS.2, ZAYA1‑8B, and DeepSeek V4 introduce KV‑cache sharing, per‑layer embeddings, layer‑wise attention budgeting, and compressed attention mechanisms that dramatically reduce memory and compute overhead for very long contexts while preserving model quality.

KV sharingLLMarchitecture

0 likes · 25 min read

How New LLM Architectures Like Gemma 4 and DeepSeek V4 Cut Long‑Context Costs

Machine Heart

May 6, 2026 · Artificial Intelligence

Beyond Transformers: SubQ Achieves 12‑Million‑Token Context at Just 5% of Opus Cost

The SubQ model introduces Subquadratic Sparse Attention (SSA), a content‑dependent routing mechanism that reduces attention complexity to linear, enabling a 12‑million‑token context window with a 52.2× speedup and only 5% of Opus's cost, as demonstrated on MRCR v2, RULER, and SWE‑Bench benchmarks.

LLMSubQlong context

0 likes · 14 min read

Beyond Transformers: SubQ Achieves 12‑Million‑Token Context at Just 5% of Opus Cost

AI Engineer Programming

May 4, 2026 · Artificial Intelligence

RAG in the Long-Context Era: Challenges, Benchmarks, and Context Engineering

The article analyzes how expanding LLM context windows to millions of tokens reshape Retrieval‑Augmented Generation, detailing chunking trade‑offs, embedding retrieval limits, attention U‑shaped distribution, benchmark results, and the emerging practice of Context Engineering for optimal end‑to‑end pipelines.

BenchmarkingEmbedding RetrievalLLM

0 likes · 10 min read

RAG in the Long-Context Era: Challenges, Benchmarks, and Context Engineering

Machine Heart

Apr 29, 2026 · Artificial Intelligence

LCA Boosts Long-Context Inference: 2.5× Speedup and 90% KV Cache Reduction

The Latent‑Condensed Attention (LCA) method dramatically cuts KV‑cache memory by 90%, speeds up pre‑fill by 2.5× and reduces decode latency by 1.8× for 128K‑token contexts, while requiring no extra parameters and preserving model performance across diverse LLMs.

Inference AccelerationKV cache reductionLCA

0 likes · 10 min read

LCA Boosts Long-Context Inference: 2.5× Speedup and 90% KV Cache Reduction

ArcThink

Apr 27, 2026 · Artificial Intelligence

Why GPT‑5.5 Is a True Generational Leap: Deep Dive vs. Claude Opus 4.7

GPT‑5.5, the first fully retrained base model since GPT‑4.5, delivers an 11.7‑point jump on ARC‑AGI‑2, wins 9 of 10 shared benchmarks, shows superior agent and ultra‑long‑context performance, yet incurs higher latency and token pricing, while Claude Opus 4.7 excels on deep‑reasoning tasks, marking a multi‑pole era for frontier AI.

AI benchmarksClaude Opus 4.7GPT-5.5

0 likes · 16 min read

Why GPT‑5.5 Is a True Generational Leap: Deep Dive vs. Claude Opus 4.7

ArcThink

Apr 27, 2026 · Artificial Intelligence

GPT-5.5 Deep Dive: What Makes This True Generational Leap Stand Out?

GPT‑5.5, the first fully retrained base model since GPT‑4.5, delivers an 11.7‑point jump on ARC‑AGI‑2, dramatic long‑context gains, and wins 9 of 10 shared benchmarks against GPT‑5.4, while a side‑by‑side comparison with Claude Opus 4.7 shows each model excelling in different domains, heralding a multi‑polar era for frontier AI.

AgentBenchmarkClaude Opus 4.7

0 likes · 16 min read

GPT-5.5 Deep Dive: What Makes This True Generational Leap Stand Out?

CodeTrend

Apr 26, 2026 · Artificial Intelligence

DeepSeek V4 Architecture: High‑Efficiency Long‑Context Model Design

DeepSeek V4, released in April 2026, introduces two versions—Pro and Flash—with up to 1.6 trillion parameters and a million‑token context window, leveraging hybrid attention, compressed KV cache, and specialized training techniques to dramatically cut hardware dependence and inference cost.

DeepSeekFP4Mixture of Experts

0 likes · 5 min read

DeepSeek V4 Architecture: High‑Efficiency Long‑Context Model Design

Architect's Tech Stack

Apr 25, 2026 · Artificial Intelligence

DeepSeek‑V4 Launch: 1.6 T Parameters, 1 M‑Token Context, Programming Skills Lead Open‑Source Rankings

DeepSeek released the V4 series—V4‑Pro (1.6 T total, 49 B active) and V4‑Flash (284 B total, 13 B active)—featuring three architectural upgrades, three inference modes, mixed‑precision FP4/FP8 weights, and benchmark results that place its programming ability at the top of open‑source models while supporting a million‑token context window.

AI ArchitectureBenchmarkDeepSeek

0 likes · 5 min read

DeepSeek‑V4 Launch: 1.6 T Parameters, 1 M‑Token Context, Programming Skills Lead Open‑Source Rankings

Shuge Unlimited

Apr 25, 2026 · Artificial Intelligence

DeepSeek V4: Comeback? 1.6 T Params, Million‑Token Context, Open‑Source Matches Closed‑Source

DeepSeek V4, released shortly after GPT‑5.5, offers two models—V4‑Pro (1.6 T parameters) and V4‑Flash (284 B parameters)—that introduce a hybrid CSA/HCA attention architecture to enable efficient million‑token context, achieve dramatic FLOPs and KV savings, deliver competitive programming and agent benchmarks, and adopt a disruptive pricing strategy, while also exposing training‑stability tricks and highlighting both strengths and remaining gaps.

BenchmarkDeepSeek-V4LLM

0 likes · 25 min read

DeepSeek V4: Comeback? 1.6 T Params, Million‑Token Context, Open‑Source Matches Closed‑Source

Design Hub

Apr 24, 2026 · Artificial Intelligence

When DeepSeek V4 Meets GPT‑5.5: How Workflows Are Splitting Apart

Two heavyweight LLMs launched on the same day—DeepSeek V4 emphasizing open, ultra‑long‑context, deployable foundations, and GPT‑5.5 pushing agentic, tool‑using execution—highlight a clear industry fork between owning work context and delegating task execution.

Agentic AIDeepSeekGPT-5.5

0 likes · 13 min read

When DeepSeek V4 Meets GPT‑5.5: How Workflows Are Splitting Apart

Machine Heart

Apr 24, 2026 · Artificial Intelligence

DeepSeek V4 Unveiled: Dual Versions with 1M Token Context and New Mixed‑Attention Architecture

DeepSeek V4 launches two models—Flash and Pro—both supporting up to 1 million token context and 384 K output tokens, offering non‑thinking and thinking modes with a reasoning_effort parameter, and featuring mixed attention, manifold‑constrained hyperconnections, a Muon optimizer, massive training data, and up to 73% FLOPs reduction versus V3.

AI modelCambriconDeepSeek-V4

0 likes · 5 min read

DeepSeek V4 Unveiled: Dual Versions with 1M Token Context and New Mixed‑Attention Architecture

Machine Learning Algorithms & Natural Language Processing

Apr 17, 2026 · Artificial Intelligence

Claude Opus 4.7’s Visual and Long‑Context Leap: Near‑Full Vision and 1M‑Token Tasks Redefine Knowledge Work

Claude Opus 4.7, announced as Anthropic’s most capable publicly available model, dramatically improves visual reasoning, long‑context task handling and instruction following, delivering up to a 2.4‑fold boost on benchmarks such as XBOW, SWE‑bench and structural biology, while also introducing new security guardrails and token‑usage costs.

AI benchmarksAnthropicClaude Opus 4.7

0 likes · 11 min read

Claude Opus 4.7’s Visual and Long‑Context Leap: Near‑Full Vision and 1M‑Token Tasks Redefine Knowledge Work

Machine Heart

Apr 17, 2026 · Artificial Intelligence

Combining Transformers and RNNs: Google’s Memory Caching Unlocks Ultra‑Long Context

Google Research introduces Memory Caching (MC), a technique that gives RNNs growing memory capacity, bridging the gap with Transformers to enable ultra‑long context processing while reducing memory demands, and demonstrates its effectiveness through extensive language‑modeling and recall experiments.

AI ArchitectureGoogle ResearchMemory Caching

0 likes · 7 min read

Combining Transformers and RNNs: Google’s Memory Caching Unlocks Ultra‑Long Context

Baidu Intelligent Cloud Tech Hub

Apr 8, 2026 · Artificial Intelligence

Unlocking 8‑Hour Autonomous Coding: GLM‑5.1’s Leap with Kunlun XPU

The open‑source GLM‑5.1 model, adapted to Baidu Baige's Kunlun XPU via the vLLM‑Kunlun Plugin, delivers record‑breaking SWE‑bench scores, eight‑hour autonomous coding, long‑context handling up to 64K tokens, and scalable deployment across tens of thousands of chips, showcasing end‑to‑end AI acceleration.

GLM-5.1Kunlun XPUModel Deployment

0 likes · 8 min read

Unlocking 8‑Hour Autonomous Coding: GLM‑5.1’s Leap with Kunlun XPU

Machine Learning Algorithms & Natural Language Processing

Mar 10, 2026 · Artificial Intelligence

How InfLLM‑V2 Achieves Seamless Short‑to‑Long Context Upgrade with Minimal Structural Changes

InfLLM‑V2 introduces a dense‑sparse switchable attention framework that preserves the original dense‑attention parameters while enabling efficient long‑context training, matching full‑attention performance on benchmarks such as RULER, LongBench, and chain‑reasoning tasks, and delivering up to 2.3× end‑to‑end inference speedup without degrading short‑sequence abilities.

InfLLM-V2Transformerdense-sparse attention

0 likes · 16 min read

How InfLLM‑V2 Achieves Seamless Short‑to‑Long Context Upgrade with Minimal Structural Changes

SuanNi

Feb 26, 2026 · Artificial Intelligence

How Alibaba’s Qwen3.5 Series Redefines Efficient Large‑Model Design

Alibaba’s newly released Qwen3.5 series—spanning 27B, 35B, and 122B parameter models—demonstrates how hybrid compute, high‑quality data, and reinforcement‑learning can boost multimodal understanding, ultra‑long‑context handling, and multilingual support while drastically lowering hardware requirements, marking a shift from pure scaling to efficient AI evolution.

AI ArchitectureMultimodal AIlong context

0 likes · 7 min read

How Alibaba’s Qwen3.5 Series Redefines Efficient Large‑Model Design

PaperAgent

Feb 15, 2026 · Artificial Intelligence

How MiniCPM‑SALA Merges Sparse and Linear Attention to Break Long‑Context Limits

MiniCPM‑SALA introduces a hybrid sparse‑linear attention architecture that reduces quadratic compute and memory costs, achieves state‑of‑the‑art performance on long‑context benchmarks, and delivers up to 3.5× faster inference than full‑attention models on sequences up to 1 million tokens.

LLMLinear AttentionModel architecture

0 likes · 17 min read

How MiniCPM‑SALA Merges Sparse and Linear Attention to Break Long‑Context Limits

AI Engineering

Feb 14, 2026 · Artificial Intelligence

DeepSeek‑V4‑Lite‑285B Hits 100% Recall in 256K Token Tests – A Needle‑in‑a‑Haystack Benchmark

Community testing of DeepSeek's rumored V4‑Lite‑285B model using the OpenAI MRCR 8‑pin standard shows perfect 1.0000 scores on several 128K‑token samples and a 256K‑token sample, achieving 100% recall in native 256K context while longer contexts drop to about 60%, with a note that the "needle‑in‑a‑haystack" method may be exploitable by DSA mechanisms.

DeepSeekLLMlong context

0 likes · 3 min read

DeepSeek‑V4‑Lite‑285B Hits 100% Recall in 256K Token Tests – A Needle‑in‑a‑Haystack Benchmark

Machine Learning Algorithms & Natural Language Processing

Feb 12, 2026 · Artificial Intelligence

Is the Transformer Paradigm Shifting? SALA Handles Million‑Token Context on RTX 5090

The article presents SALA, a sparse‑linear hybrid attention architecture that replaces full attention in 9B‑parameter models, achieving comparable accuracy while cutting compute and memory costs, enabling million‑token inference on a single RTX 5090 and delivering up to 3.5× speed‑up over Qwen3‑8B.

Hybrid Position EncodingLLM efficiencyLinear Attention

0 likes · 18 min read

Is the Transformer Paradigm Shifting? SALA Handles Million‑Token Context on RTX 5090

Data Party THU

Feb 4, 2026 · Artificial Intelligence

How Sakana AI Redefines Long-Context Transformers: DroPE, REPO, and FwPKM Explained

This article analyzes Sakana AI's three recent papers that challenge traditional Transformer long‑sequence handling by removing positional embeddings, reconstructing position awareness, and adding a fast‑weight external memory, showing how each approach improves ultra‑long text understanding.

Memory MechanismPositional EmbeddingTransformer

0 likes · 12 min read

How Sakana AI Redefines Long-Context Transformers: DroPE, REPO, and FwPKM Explained

PaperAgent

Jan 6, 2026 · Artificial Intelligence

How Recursive Language Models Enable Unlimited Context for LLMs

Recursive Language Models (RLM) offer a cost‑effective alternative to expanding LLM context windows by storing prompts as variables and enabling recursive calls, allowing models to process over 100,000 tokens, with experiments showing superior performance and lower median costs compared to baseline approaches.

AI researchLLM scalingPrompt engineering

0 likes · 5 min read

How Recursive Language Models Enable Unlimited Context for LLMs

Data Party THU

Oct 16, 2025 · Artificial Intelligence

How Tensor Product Attention Redefines Long‑Context Transformers

The article analyzes the Tensor Product Attention (TPA) method presented at NeurIPS 2025, explaining how it factorizes Q, K, V tensors to drastically reduce KV cache size and attention complexity, and demonstrates superior convergence, lower perplexity, and faster inference on long‑sequence tasks compared with existing attention variants.

KV cacheRoPETensor Product Attention

0 likes · 11 min read

How Tensor Product Attention Redefines Long‑Context Transformers

Architects' Tech Alliance

Sep 19, 2025 · Artificial Intelligence

Why Nvidia’s Rubin CPX GPU Could Revolutionize Long-Context AI Inference

Nvidia's Rubin CPX GPU, unveiled in September 2025, uses GDDR7 memory and a split‑stage architecture to dramatically boost token‑per‑second rates for long‑context inference, while its integration into third‑generation Oberon servers promises higher power density, improved ROI, and scalable data‑center deployments.

AI inferenceData centerGPU architecture

0 likes · 9 min read

Why Nvidia’s Rubin CPX GPU Could Revolutionize Long-Context AI Inference

Instant Consumer Technology Team

Sep 11, 2025 · Artificial Intelligence

How REFRAG Cuts LLM Decoding Time by 30×: A New Efficient RAG Framework

REFRAG (REpresentation For RAG) introduces a novel decoding framework that compresses, senses, and expands context using precomputed chunk embeddings, achieving up to 30.85× faster first-token generation and 16× larger context windows without sacrificing perplexity, as validated across diverse long‑context tasks.

LLMRAGchunk embeddings

0 likes · 18 min read

How REFRAG Cuts LLM Decoding Time by 30×: A New Efficient RAG Framework

AI Algorithm Path

Aug 20, 2025 · Artificial Intelligence

DeepSeek V3.1 Open‑Source: Unlocking a New Era of Long‑Context AI

DeepSeek V3.1, a 685‑billion‑parameter open‑source model, supports up to 128,000 tokens, delivers mixed‑architecture capabilities, matches top‑tier closed systems in benchmarks, and its rapid community adoption signals a shift toward democratized AI development and new industry dynamics.

AI PerformanceDeepSeeklarge language model

0 likes · 6 min read

DeepSeek V3.1 Open‑Source: Unlocking a New Era of Long‑Context AI

Alibaba Cloud Big Data AI Platform

Jul 17, 2025 · Artificial Intelligence

How ChunkFlow Boosts Long-Context Model Training Up to 4.5× Faster

The paper "Efficient Long Context Fine-tuning with Chunk Flow" introduces ChunkFlow, a training framework that reorganizes variable‑length sequences into fixed‑size chunks, achieving up to 4.53× speedup and more stable GPU memory usage for large language models.

ChunkFlowGPU OptimizationLLM training

0 likes · 7 min read

How ChunkFlow Boosts Long-Context Model Training Up to 4.5× Faster

DataFunTalk

Jul 16, 2025 · Artificial Intelligence

MiniMax-M1 Revealed: Hybrid Attention, RL Training, and 1M Token Context

MiniMax’s latest M1 model, unveiled after a $300 million funding round, showcases a 4.56‑trillion‑parameter hybrid‑expert architecture with lightning attention, supporting up to one million tokens, and leverages reinforcement‑learning techniques to enhance long‑context handling, inference efficiency, and system‑2 reasoning capabilities.

AI scalingModel architecturehybrid attention

0 likes · 16 min read

MiniMax-M1 Revealed: Hybrid Attention, RL Training, and 1M Token Context

iQIYI Technical Product Team

Jul 3, 2025 · Artificial Intelligence

Three iQIYI AI Papers Break New Ground at ACL 2025 & INTERSPEECH 2025

iQIYI’s AI research team secured three paper acceptances—two at ACL 2025 (including a main conference and a Findings paper) and one at INTERSPEECH 2025—covering long‑context large language model evaluation, Chinese novel summarization, and efficient Thai speech recognition, with links to each work.

ACL 2025AI researchINTERSPEECH 2025

0 likes · 7 min read

Three iQIYI AI Papers Break New Ground at ACL 2025 & INTERSPEECH 2025

AIWalker

Jun 18, 2025 · Artificial Intelligence

Six New Directions for Large Language Models

Large language models are booming, and this article highlights six cutting‑edge research directions—LLM‑plus synthetic data, reward modeling, inference techniques, LLM‑as‑a‑Judge, safety alignment, and long‑context handling—each illustrated with recent papers, experimental results, and links to code repositories.

InferenceLLMReward Modeling

0 likes · 9 min read

Six New Directions for Large Language Models

Baobao Algorithm Notes

Jun 6, 2025 · Artificial Intelligence

What AI Programming Agents Reveal About RL, Feedback Loops, and Long‑Context Challenges

In a deep dive into the Cursor team's podcast, core members dissect the current hurdles of AI programming agents, covering feedback‑mechanism design, reinforcement‑learning reward sparsity, tool‑chain integration, long‑context handling, and emerging attention mechanisms that shape the future of code‑centric AI.

AI programmingattention mechanismslong context

0 likes · 35 min read

What AI Programming Agents Reveal About RL, Feedback Loops, and Long‑Context Challenges

AntTech

Apr 10, 2025 · Artificial Intelligence

Ant Group Presents Four AI Research Papers at ICLR 2025 Live Showcase

At the ICLR 2025 live session in Singapore, Ant Group showcased four cutting‑edge papers—CodePlan, Animate‑X, Group Position Embedding, and OmniKV—demonstrating advances in large‑language‑model reasoning, universal character animation, layout‑aware document understanding, and efficient long‑context inference.

AI researchdocument understandinglarge language models

0 likes · 6 min read

Ant Group Presents Four AI Research Papers at ICLR 2025 Live Showcase

21CTO

Apr 7, 2025 · Artificial Intelligence

Llama 4 Unveiled: Breakthrough Multimodal Models Redefine AI Capabilities

Meta's Llama 4 series introduces the Scout, Maverick, and Behemoth models—featuring Mixture‑of‑Experts architectures, unprecedented 10‑million‑token context windows, and state‑of‑the‑art performance across vision, language, and multimodal benchmarks—while emphasizing efficient training, open‑source availability, and robust safety safeguards.

AI SafetyLlama 4Mixture of Experts

0 likes · 14 min read

Llama 4 Unveiled: Breakthrough Multimodal Models Redefine AI Capabilities

DataFunTalk

Apr 6, 2025 · Artificial Intelligence

Meta Unveils Llama 4: New Multimodal AI Models with Mixture‑of‑Experts Architecture and 10 Million‑Token Context

Meta announced the Llama 4 series—Scout, Maverick and Behemoth—featuring multimodal capabilities, Mixture‑of‑Experts design, up to 10 million‑token context windows, and state‑of‑the‑art performance on STEM, multilingual and image benchmarks, with models now downloadable from llama.com and Hugging Face.

Llama 4Mixture of ExpertsModel Training

0 likes · 14 min read

Meta Unveils Llama 4: New Multimodal AI Models with Mixture‑of‑Experts Architecture and 10 Million‑Token Context

Fighter's World

Apr 5, 2025 · Artificial Intelligence

Is Gemini 2.5 Pro the Turning Point for Google’s AI Strategy?

The article analyses Google’s Gemini 2.5 Pro as a decisive shift toward a “Reasoning Model”, detailing its architectural focus on inference, benchmark breakthroughs such as Humanity’s Last Exam and GPQA Diamond, long‑context capability, multimodal strengths, Vibe‑coding experience, and the roadmap for future Gemini models.

AI strategyBenchmarkGemini 2.5 Pro

0 likes · 25 min read

Is Gemini 2.5 Pro the Turning Point for Google’s AI Strategy?

Architect

Feb 24, 2025 · Artificial Intelligence

Inside MoBA: A Sparse Attention Framework for 10‑Million‑Token Contexts

The article details the development, architectural evolution, and practical challenges of MoBA—a sparse attention framework inspired by Mixture‑of‑Experts that scales LLM context length to 10 M tokens, supports seamless switching between full and sparse attention, and is now released as a minimal open‑source solution.

AI ArchitectureContext ParallelLLM training

0 likes · 13 min read

Inside MoBA: A Sparse Attention Framework for 10‑Million‑Token Contexts

Architecture Digest

Feb 24, 2025 · Artificial Intelligence

MoBA: Mixture of Block Attention for Long‑Context Large Language Models

The article introduces MoBA, a Mixture‑of‑Block‑Attention mechanism that applies Mixture‑of‑Experts principles to transformer attention, enabling efficient long‑context processing for large language models while maintaining performance comparable to full attention through sparse, trainable block selection and seamless switching.

Attention MechanismLLMMixture of Experts

0 likes · 12 min read

MoBA: Mixture of Block Attention for Long‑Context Large Language Models

Bilibili Tech

Sep 18, 2024 · Artificial Intelligence

Index-1.9B-32K: A 2% GPT-Size Model with Powerful Long-Context Capabilities

Index-1.9B-32K is a 1.9B-parameter model with a 32K token context window, achieving strong long‑text performance comparable to larger models while using only about 2% of GPT‑4’s compute, trained via long pre‑training and supervised fine‑tuning, with a trade‑off of reduced short‑context ability.

AIFine-tuningevaluation

0 likes · 12 min read

Index-1.9B-32K: A 2% GPT-Size Model with Powerful Long-Context Capabilities

NewBeeNLP

Aug 3, 2024 · Artificial Intelligence

Extending LLM Context to 1M Tokens: SAMBA, CoPE, RoPE, Retrieval Heads & Infini‑Attention

This article reviews recent research on extending large language model context windows to millions of tokens, covering SAMBA's hybrid architecture, Contextual Position Encoding (CoPE), RoPE base length theory, Retrieval Head analysis, and the memory‑efficient Infini‑Attention mechanism.

LLM researchefficient attentionlarge language models

0 likes · 10 min read

Extending LLM Context to 1M Tokens: SAMBA, CoPE, RoPE, Retrieval Heads & Infini‑Attention

Baobao Algorithm Notes

May 31, 2024 · Industry Insights

Do Scaling Laws Still Hold? Deep Dive into Synthetic Data, New Model Architectures, and Long‑Context Solutions

In a May 15 round‑table, experts debated the validity of scaling laws, the role of synthetic and semi‑synthetic data in overcoming data bottlenecks, explored alternatives to the Transformer such as RNN‑based and hybrid designs, evaluated the practicality of Mixture‑of‑Experts models, and examined two main strategies—KV‑cache compression and input‑context reduction—to enable truly long‑context processing.

Mixture of Expertslong context

0 likes · 13 min read

Do Scaling Laws Still Hold? Deep Dive into Synthetic Data, New Model Architectures, and Long‑Context Solutions

AI Large Model Application Practice

May 3, 2024 · Artificial Intelligence

Can Giant Context LLMs Replace RAG? Exploring the Limits of Long‑Context Retrieval

This article examines whether the rapid growth of large‑language‑model context windows can eliminate the need for retrieval‑augmented generation, presenting experimental needle‑in‑a‑haystack tests, analysis of model performance across token lengths and needle positions, and practical guidance using an open‑source evaluation tool.

AILLMNeedle-in-a-Haystack

0 likes · 13 min read

Can Giant Context LLMs Replace RAG? Exploring the Limits of Long‑Context Retrieval

Java Tech Enthusiast

Feb 16, 2024 · Artificial Intelligence

Google's Gemini 1.5: Breakthrough in Long-Context Understanding and Multimodal Capabilities

Google’s Gemini 1.5, a new multimodal Mixture‑of‑Experts model, supports up to a million‑token context (10 million internally), can understand text, video, audio and code, learns a new language from a single prompt, and is already being used by Samsung, Jasper and Quora, positioning it as a direct challenger to OpenAI’s flagship models.

Gemini 1.5Google AILLM

0 likes · 7 min read

Google's Gemini 1.5: Breakthrough in Long-Context Understanding and Multimodal Capabilities

Alibaba Cloud Big Data AI Platform

Sep 19, 2023 · Artificial Intelligence

BladeLLM: Ultra‑Long Context LLM Inference via RaggedAttention & AutoTuner

BladeLLM, Alibaba Cloud’s large‑model inference engine, pushes the limits of LLMs by supporting ultra‑long context lengths up to 70 K tokens, leveraging novel RaggedAttention and a DNN‑based AutoTuner to deliver superior performance, memory efficiency, and low‑latency inference across diverse workloads.

AI InfrastructureAutoTunerLLM inference

0 likes · 11 min read

BladeLLM: Ultra‑Long Context LLM Inference via RaggedAttention & AutoTuner

ITPUB

Mar 22, 2023 · Artificial Intelligence

What Can GPT‑4 Do? Vision, Long Memory, Safer AI and More

OpenAI’s GPT‑4 arrives with multimodal vision, a dramatically longer context window, higher exam scores, Socratic prompting, improved safety, and new partnerships, while still in research mode and subject to bias and code‑trust limitations.

AI SafetyGPT-4large language model

0 likes · 7 min read

What Can GPT‑4 Do? Vision, Long Memory, Safer AI and More