LLM efficiency — 5 Technical Articles

Apr 10, 2026 · Artificial Intelligence

Boost AI Smarts and Cut Costs with Open‑Source Memory and Compression Tools

The article analyzes why AI chats are costly—repeating context each time—and presents two open‑source projects, mempalace and caveman, that together provide a large‑scale memory system and aggressive token compression, dramatically reducing token usage and expenses while preserving reasoning ability.

AI memoryLLM efficiencycaveman

0 likes · 7 min read

Boost AI Smarts and Cut Costs with Open‑Source Memory and Compression Tools

Machine Learning Algorithms & Natural Language Processing

Feb 12, 2026 · Artificial Intelligence

Is the Transformer Paradigm Shifting? SALA Handles Million‑Token Context on RTX 5090

The article presents SALA, a sparse‑linear hybrid attention architecture that replaces full attention in 9B‑parameter models, achieving comparable accuracy while cutting compute and memory costs, enabling million‑token inference on a single RTX 5090 and delivering up to 3.5× speed‑up over Qwen3‑8B.

Hybrid Position EncodingLLM efficiencyLinear Attention

0 likes · 18 min read

Is the Transformer Paradigm Shifting? SALA Handles Million‑Token Context on RTX 5090

Tencent Technical Engineering

Nov 10, 2025 · Artificial Intelligence

How Large Language Models Evolved in 2025: From DeepSeek to Kimi‑K2 and Beyond

This article maps the rapid evolution of open‑source large language models in 2025, explains the underlying architectural breakthroughs such as MLA, MoE, and NSA, compares dozens of models—including DeepSeek‑V3, OLMo2, Gemma3, Llama4, Qwen3, and Kimi‑K2—and highlights the emergence of powerful AI assistants like Dola, providing developers with a concise technical roadmap.

AI AssistantLLM efficiencyMixture of Experts

0 likes · 44 min read

How Large Language Models Evolved in 2025: From DeepSeek to Kimi‑K2 and Beyond

Xiaohe Frontend Team

Oct 15, 2025 · Artificial Intelligence

REFRAG: Using Tiny Models to Compress RAG for Faster, Smarter AI

Meta’s new REFRAG framework lets a lightweight encoder compress retrieved text into semantic tags, enabling large language models to answer queries with far fewer tokens, lower latency, and higher throughput, while preserving core meaning and allowing flexible placement of compressed information within prompts.

LLM efficiencyRAGmodel compression

0 likes · 8 min read

REFRAG: Using Tiny Models to Compress RAG for Faster, Smarter AI

AI Frontier Lectures

Jun 20, 2025 · Artificial Intelligence

How GCA Achieves 1000× Length Generalization in Large Language Models

Ant Research introduces GCA, a causal retrieval‑based grouped cross‑attention mechanism that end‑to‑end learns to fetch relevant past chunks, dramatically reducing memory usage and achieving over 1000× length generalization on long‑context language modeling tasks, with near‑constant inference memory and linear training cost.

AI researchGrouped Cross AttentionLLM efficiency

0 likes · 11 min read

How GCA Achieves 1000× Length Generalization in Large Language Models

Boost AI Smarts and Cut Costs with Open‑Source Memory and Compression Tools

Is the Transformer Paradigm Shifting? SALA Handles Million‑Token Context on RTX 5090

How Large Language Models Evolved in 2025: From DeepSeek to Kimi‑K2 and Beyond

REFRAG: Using Tiny Models to Compress RAG for Faster, Smarter AI

How GCA Achieves 1000× Length Generalization in Large Language Models

Is the Transformer Paradigm Shifting? SALA Handles Million‑Token Context on RTX 5090