Geek Labs
Geek Labs
Apr 10, 2026 · Artificial Intelligence

Boost AI Smarts and Cut Costs with Open‑Source Memory and Compression Tools

The article analyzes why AI chats are costly—repeating context each time—and presents two open‑source projects, mempalace and caveman, that together provide a large‑scale memory system and aggressive token compression, dramatically reducing token usage and expenses while preserving reasoning ability.

AI memoryLLM efficiencycaveman
0 likes · 7 min read
Boost AI Smarts and Cut Costs with Open‑Source Memory and Compression Tools
Machine Learning Algorithms & Natural Language Processing
Machine Learning Algorithms & Natural Language Processing
Feb 12, 2026 · Artificial Intelligence

Is the Transformer Paradigm Shifting? SALA Handles Million‑Token Context on RTX 5090

The article presents SALA, a sparse‑linear hybrid attention architecture that replaces full attention in 9B‑parameter models, achieving comparable accuracy while cutting compute and memory costs, enabling million‑token inference on a single RTX 5090 and delivering up to 3.5× speed‑up over Qwen3‑8B.

Hybrid Position EncodingLLM efficiencyLinear Attention
0 likes · 18 min read
Is the Transformer Paradigm Shifting? SALA Handles Million‑Token Context on RTX 5090
Tencent Technical Engineering
Tencent Technical Engineering
Nov 10, 2025 · Artificial Intelligence

How Large Language Models Evolved in 2025: From DeepSeek to Kimi‑K2 and Beyond

This article maps the rapid evolution of open‑source large language models in 2025, explains the underlying architectural breakthroughs such as MLA, MoE, and NSA, compares dozens of models—including DeepSeek‑V3, OLMo2, Gemma3, Llama4, Qwen3, and Kimi‑K2—and highlights the emergence of powerful AI assistants like Dola, providing developers with a concise technical roadmap.

AI AssistantLLM efficiencyMixture of Experts
0 likes · 44 min read
How Large Language Models Evolved in 2025: From DeepSeek to Kimi‑K2 and Beyond
Xiaohe Frontend Team
Xiaohe Frontend Team
Oct 15, 2025 · Artificial Intelligence

REFRAG: Using Tiny Models to Compress RAG for Faster, Smarter AI

Meta’s new REFRAG framework lets a lightweight encoder compress retrieved text into semantic tags, enabling large language models to answer queries with far fewer tokens, lower latency, and higher throughput, while preserving core meaning and allowing flexible placement of compressed information within prompts.

LLM efficiencyRAGmodel compression
0 likes · 8 min read
REFRAG: Using Tiny Models to Compress RAG for Faster, Smarter AI
AI Frontier Lectures
AI Frontier Lectures
Jun 20, 2025 · Artificial Intelligence

How GCA Achieves 1000× Length Generalization in Large Language Models

Ant Research introduces GCA, a causal retrieval‑based grouped cross‑attention mechanism that end‑to‑end learns to fetch relevant past chunks, dramatically reducing memory usage and achieving over 1000× length generalization on long‑context language modeling tasks, with near‑constant inference memory and linear training cost.

AI researchGrouped Cross AttentionLLM efficiency
0 likes · 11 min read
How GCA Achieves 1000× Length Generalization in Large Language Models