Tagged articles

Transformer

416 articles · Page 1 of 5
Data Party THU
Data Party THU
Jul 2, 2026 · Artificial Intelligence

Multi-Task Bayesian In-Context Learning: Transformers Adapt to New Priors

The ICML 2026 paper reframes in‑context learning as approximate Bayesian inference, introduces explicit prior datasets as a context prefix for Transformers, and demonstrates through synthetic and real‑world experiments that this multi‑task approach closely matches Bayesian oracles while offering fast, controllable inference.

Bayesian InferenceICML 2026In-Context Learning
0 likes · 15 min read
Multi-Task Bayesian In-Context Learning: Transformers Adapt to New Priors
Lisa Notes
Lisa Notes
Jul 2, 2026 · Artificial Intelligence

NLP Study Notes: How Pre‑trained Models Transform Language Processing

This article reviews the evolution of pre‑trained models in natural language processing, from early word embeddings to Transformer‑based architectures like BERT and its variants, outlines their wide‑range applications such as QA, translation, and dialogue, and discusses remaining challenges and future research directions.

AIBERTNLP
0 likes · 6 min read
NLP Study Notes: How Pre‑trained Models Transform Language Processing
Machine Heart
Machine Heart
Jun 29, 2026 · Artificial Intelligence

Re‑shaping Transformers: Moving Capacity Forward Makes LLMs Smarter

A new study shows that reallocating the feed‑forward network capacity toward the early layers of a Transformer—without adding parameters or FLOPs—lowers perplexity by up to 1.84 points, and the same technique improves performance across several modern LLM architectures.

FFN widthLanguage ModelTapered Language Model
0 likes · 9 min read
Re‑shaping Transformers: Moving Capacity Forward Makes LLMs Smarter
Machine Heart
Machine Heart
Jun 29, 2026 · Artificial Intelligence

Why Nvidia Praises LoopWM: A Chinese Startup’s New Scaling Axis for World Models

LoopWM introduces a looped Transformer architecture that shares parameters across iterations, adds spectral stability, deferred decoding, and early‑exit mechanisms, achieving up to 100× parameter efficiency and superior scores on ScienceWorld and AlfWorld compared with large proprietary models.

AIDeferred DecodingLoopWM
0 likes · 10 min read
Why Nvidia Praises LoopWM: A Chinese Startup’s New Scaling Axis for World Models
Machine Heart
Machine Heart
Jun 28, 2026 · Industry Insights

Where Have the Eight Transformers' Pioneers Ended Up?

The article traces the post‑Google journeys of the eight "Attention Is All You Need" authors, detailing recent high‑profile exits to OpenAI and Anthropic, market fallout, each researcher’s contributions to the Transformer architecture, and how their divergent paths continue to shape AI beyond the original paper.

AI researchEssential AIGoogle DeepMind
0 likes · 21 min read
Where Have the Eight Transformers' Pioneers Ended Up?
Geek Labs
Geek Labs
Jun 28, 2026 · Industry Insights

Five Practical Open‑Source Projects: FPGA Inference, Agent Alignment, and Multi‑Server SSH Management

This article highlights five active GitHub projects—a Verilog‑based FPGA transformer inference engine, an AI agent personality alignment framework, a Zig‑written multi‑host SSH command tool, an AUR supply‑chain malware detector, and a real‑time phishing domain blacklist API—detailing their purpose, implementation, and key metrics.

AURAgentFPGA
0 likes · 7 min read
Five Practical Open‑Source Projects: FPGA Inference, Agent Alignment, and Multi‑Server SSH Management
Lisa Notes
Lisa Notes
Jun 25, 2026 · Artificial Intelligence

NLP Study Notes: How Word Vectors Capture Meaning

This article explains the evolution of natural language processing, introduces transformer‑based large models such as BERT, GPT and T5, and details how words are represented through one‑hot vectors and dense word embeddings, illustrating their training and analogy capabilities.

CBOWEmbeddingNLP
0 likes · 7 min read
NLP Study Notes: How Word Vectors Capture Meaning
Lisa Notes
Lisa Notes
Jun 24, 2026 · Artificial Intelligence

A Brief History of Neural Network Approaches in NLP

From the 1943 perceptron concept to modern Transformer-based large language models, this article traces the evolution of neural network techniques in NLP, highlighting key milestones such as early perceptrons, the 1986 back‑propagation breakthrough, statistical methods, LSTM, word2vec, multitask learning, and the rise of GPT.

LSTMLanguage ModelsNLP
0 likes · 7 min read
A Brief History of Neural Network Approaches in NLP
Machine Heart
Machine Heart
Jun 22, 2026 · Artificial Intelligence

Why Dropping VAE and Private Data Boosts Text-to-Image Generation Performance

MiniT2I, a minimalist pixel-space text-to-image model that discards VAE, AdaLN, and private data, achieves 0.87 GenEval and 84.2 DPG-Bench scores with only 258 M parameters, demonstrating that a stripped-down architecture and public data can outperform larger, more complex systems.

AI researchMiniT2ITransformer
0 likes · 8 min read
Why Dropping VAE and Private Data Boosts Text-to-Image Generation Performance
Machine Heart
Machine Heart
Jun 19, 2026 · Artificial Intelligence

Beyond SONIC: Humanoid Robot Cerebellum Hits GPT‑Level Performance with 2 B Motion‑Capture Frames

Galaxy General unveils AstraBrain‑WBC 0.5, a transformer‑based humanoid robot control model that scales from 200 K to 2 billion motion‑capture frames, achieving up to 92.58% tracking success, 0.39 ms latency, and five‑fold speed over TWIST, thereby confirming a scaling law for robot motion control.

AstraBrain-WBCDAgger DistillationHumanoid Robot
0 likes · 16 min read
Beyond SONIC: Humanoid Robot Cerebellum Hits GPT‑Level Performance with 2 B Motion‑Capture Frames
DeepHub IMBA
DeepHub IMBA
Jun 18, 2026 · Artificial Intelligence

From Bayesian Models to Generative Pre‑trained Transformers (GPT): A Brief History of Generative Learning

The article traces generative learning from its probabilistic roots in Bayesian classification, through Gaussian mixture models, hidden Markov models, N‑gram and neural language models, to attention mechanisms, Transformers and GPT, highlighting how each innovation expanded the ability to model data‑generating processes.

BayesianGPTGaussian Mixture
0 likes · 26 min read
From Bayesian Models to Generative Pre‑trained Transformers (GPT): A Brief History of Generative Learning
Machine Heart
Machine Heart
Jun 17, 2026 · Artificial Intelligence

Why Transformers Struggle with State Tracking and How Recurrence Could Fix It

The DeepMind paper “The Topological Trouble With Transformers” reveals that the Transformer architecture inherently fails at state tracking, making chain‑of‑thought prompting only a costly patch, and proposes returning to recurrent mechanisms—such as looped or sequence‑wise recurrence—to achieve true, continuous memory.

AI researchChain-of-ThoughtDeepMind
0 likes · 9 min read
Why Transformers Struggle with State Tracking and How Recurrence Could Fix It
IT Services Circle
IT Services Circle
Jun 13, 2026 · Artificial Intelligence

What Interviewers Expect: Understanding Transformers Beyond Codex and AI Code Generation

The article explains why modern interviewers ask about Transformer fundamentals, breaks down its core components such as self‑attention, multi‑head attention, feed‑forward networks, residual connections and positional encodings, and demonstrates a complete PyTorch toy model that predicts the sum‑mod‑10 of integer sequences while visualizing loss curves, attention heatmaps, embedding PCA and early‑stage gradient norms.

Gradient AnalysisModel VisualizationMulti-Head Attention
0 likes · 20 min read
What Interviewers Expect: Understanding Transformers Beyond Codex and AI Code Generation
Machine Heart
Machine Heart
Jun 12, 2026 · Artificial Intelligence

Can Transformers Solve Any Computable Problem? RUC Study Shows Context Management Sets the Upper Bound

A recent ICML 2026 position paper clarifies that the computational power of a fixed Transformer model is limited by its context‑management strategy, distinguishing fixed‑system and scaling‑family settings and showing how five concrete management approaches span from constant‑space to full Turing‑completeness.

Computational theoryContext ManagementTransformer
0 likes · 16 min read
Can Transformers Solve Any Computable Problem? RUC Study Shows Context Management Sets the Upper Bound
HyperAI Super Neural
HyperAI Super Neural
Jun 11, 2026 · Artificial Intelligence

UniCM: A Unified Global Climate Mode Prediction Model Paving a New AI‑Driven Path for Climate Science

The UniCM model unifies ocean‑atmosphere climate modes in a dual‑branch transformer, achieving record‑long ENSO forecasts and revealing emergent predictability across seven key global modes, while offering interpretable attention maps that turn AI from a pure predictor into a climate discovery tool.

AI for ScienceTransformerclimate modeling
0 likes · 10 min read
UniCM: A Unified Global Climate Mode Prediction Model Paving a New AI‑Driven Path for Climate Science
Machine Learning Algorithms & Natural Language Processing
Machine Learning Algorithms & Natural Language Processing
Jun 10, 2026 · Artificial Intelligence

Bypassing BPTT: MIT’s SMT Puts RNNs on the Parallel Training Path

The article reviews MIT’s Supervised Memory Training (SMT) and its DAgger extension (DMT), which replace traditional back‑propagation through time with a Transformer‑based teacher, enabling one‑step memory supervision for RNNs, achieving parallel‑friendly training and superior long‑sequence performance on synthetic benchmarks, TinyStories and pixel‑wise image generation.

BPTTDMTRNN
0 likes · 10 min read
Bypassing BPTT: MIT’s SMT Puts RNNs on the Parallel Training Path
Machine Heart
Machine Heart
Jun 7, 2026 · Artificial Intelligence

Can AI Learn Mental Math? Implicit Chain‑of‑Thought Proven Theoretically (Stuart Russell)

The article reviews a new UC Berkeley and Princeton study that mathematically proves the feasibility of Implicit Chain‑of‑Thought (ICoT), showing how a tree‑structured training curriculum lets Transformers internalize reasoning steps, dramatically reducing token cost and training stages while achieving 100 % accuracy on the k‑parity task.

Chain-of-ThoughtImplicit ReasoningTheoretical Proof
0 likes · 11 min read
Can AI Learn Mental Math? Implicit Chain‑of‑Thought Proven Theoretically (Stuart Russell)
Machine Heart
Machine Heart
Jun 5, 2026 · Artificial Intelligence

Stem Sparse Attention Cuts First-Token Latency by 3.6× for Long-Context LLMs

The article introduces Tencent Hunyuan's Stem sparse‑attention algorithm, which reduces first‑token latency by 3.6× on 128K context LLMs by reallocating compute with Token Position Decay and Output‑Aware Metric, and validates the gains with HPC‑optimized operators that outperform existing sparse methods in extensive benchmarks.

HPC OperatorsLLM InferenceOutput-Aware Metric
0 likes · 11 min read
Stem Sparse Attention Cuts First-Token Latency by 3.6× for Long-Context LLMs
Data Party THU
Data Party THU
Jun 5, 2026 · Artificial Intelligence

A Unified Global Climate Mode Prediction Model (UniCM) Opens New Paths for AI‑Empowered Climate Science

The UniCM model introduced by Tsinghua University's Li Yong team unifies learning of multiple ocean‑atmosphere climate modes with a dual‑branch Transformer, achieving record‑long ENSO forecasts and revealing hidden inter‑modal couplings that turn AI from a fast weather predictor into a climate discovery tool.

AIENSOMulti‑modal Prediction
0 likes · 11 min read
A Unified Global Climate Mode Prediction Model (UniCM) Opens New Paths for AI‑Empowered Climate Science
Network Intelligence Research Center (NIRC)
Network Intelligence Research Center (NIRC)
Jun 4, 2026 · Artificial Intelligence

How DeepSeek‑V4 Achieves Million‑Token Context via Aggressive KV‑Cache Compression

DeepSeek‑V4 reaches a million‑token context window by aggressively compressing its KV‑cache and employing a hybrid attention scheme that combines Compressed Sparse Attention (CSA) for selective top‑k retrieval with Heavily Compressed Attention (HCA) for full‑attention over heavily merged entries, alongside mixed‑precision storage and other engineering optimizations.

Compressed Sparse AttentionDeepSeek-V4Heavily Compressed Attention
0 likes · 7 min read
How DeepSeek‑V4 Achieves Million‑Token Context via Aggressive KV‑Cache Compression
AI Architecture Hub
AI Architecture Hub
Jun 4, 2026 · Artificial Intelligence

10 Essential AI Concepts Every Developer Must Master

This article explains ten core AI concepts—including tokens, embeddings, attention, the Transformer architecture, large language models, hallucination, temperature, context windows, Retrieval‑Augmented Generation, and AI agents—so developers can understand model behavior, avoid common pitfalls, and build reliable AI applications.

AI AgentsAI FundamentalsRAG
0 likes · 15 min read
10 Essential AI Concepts Every Developer Must Master
CodePath
CodePath
Jun 3, 2026 · Artificial Intelligence

A Deliberate Paradigm Shift: How “Attention Is All You Need” Reshaped Deep Learning

The article dissects how the 2017 "Attention Is All You Need" paper sparked a fundamental redesign of sequence modeling by replacing recurrent and convolutional approaches with self‑attention, detailing its mathematical foundations, architectural components, training tricks, limitations, and emerging alternatives such as Mamba.

Attention MechanismMambaMulti-Head Attention
0 likes · 24 min read
A Deliberate Paradigm Shift: How “Attention Is All You Need” Reshaped Deep Learning
Machine Heart
Machine Heart
Jun 2, 2026 · Artificial Intelligence

Training Transformers to Be Compression‑Friendly: A New Memory‑Discard Paradigm

The article analyzes the KV‑Cache memory bottleneck of long‑context Transformers, introduces the KV‑CAT (KV‑Compression Aware Training) approach that simulates cache compression during pre‑training, and presents experiments showing unchanged base abilities while dramatically improving post‑training compression, retrieval and long‑text QA performance.

KV cacheKV-CATMemory Efficiency
0 likes · 10 min read
Training Transformers to Be Compression‑Friendly: A New Memory‑Discard Paradigm
Baidu Geek Talk
Baidu Geek Talk
May 25, 2026 · Artificial Intelligence

Accelerating Multimodal Model Training: LoongForge's DP Load‑Balancing Optimization Explained

The article analyzes how data‑parallel (DP) load imbalance hampers large‑scale multimodal model training, details LoongForge's two‑stage adaptive data‑reallocation method that builds a precise compute‑cost model and dynamically redistributes samples, and presents experimental results showing up to 10% throughput gains on massive DP clusters.

DP load balancingData ParallelLoongForge
0 likes · 16 min read
Accelerating Multimodal Model Training: LoongForge's DP Load‑Balancing Optimization Explained
Machine Heart
Machine Heart
May 24, 2026 · Artificial Intelligence

Can CODA Enable LLMs and Beginners to Write Lightning‑Fast Transformer Kernels?

CODA rewrites Transformer blocks as GEMM‑epilogue programs, exposing five primitive building blocks that let both AI‑generated code and human programmers fuse memory‑intensive operations into the GEMM epilogue, eliminating costly tensor moves and achieving up to 1.8× speed‑ups on H100 GPUs for RMSNorm, SwiGLU, RoPE and other components, while preserving numerical accuracy.

CODACUDAGEMM
0 likes · 11 min read
Can CODA Enable LLMs and Beginners to Write Lightning‑Fast Transformer Kernels?
CodeNotes
CodeNotes
May 23, 2026 · Artificial Intelligence

AI Era Arrives: What Everyone Should Know

The article introduces the AI era for laypeople, defines artificial intelligence and generative AI, highlights ChatGPT’s 2022 launch and rapid adoption, lists current AI capabilities across text, image, video, code and voice, explains the three drivers—compute, data, and Transformer architecture, and advises a balanced, learning‑oriented mindset.

AI capabilitiesChatGPTGenerative AI
0 likes · 6 min read
AI Era Arrives: What Everyone Should Know
Mike Chen's Internet Architecture
Mike Chen's Internet Architecture
May 21, 2026 · Artificial Intelligence

Demystifying AI Large Models: Architecture, Principles, and Workflow

The article explains that large language models are massive probability engines built on the Transformer architecture with self‑attention, trained through costly pre‑training on trillions of tokens, then refined by instruction fine‑tuning and RLHF, ultimately predicting the next token to generate text.

Large Language ModelRLHFSelf-Attention
0 likes · 5 min read
Demystifying AI Large Models: Architecture, Principles, and Workflow
Machine Learning Algorithms & Natural Language Processing
Machine Learning Algorithms & Natural Language Processing
May 20, 2026 · Artificial Intelligence

Can 99% Sparse Transformers Run Faster? Insights from the ‘Attention Is All You Need’ Authors

The paper shows that applying lightweight L1 regularization can make over 99% of FFN activations zero, and by using a new tile‑wise ELLPACK (TwELL) format together with a hybrid routing scheme, inference speed improves up to 30% while memory usage drops over 24% and energy consumption is reduced, all with negligible impact on downstream task performance.

CUDAGPU OptimizationHybrid Routing
0 likes · 8 min read
Can 99% Sparse Transformers Run Faster? Insights from the ‘Attention Is All You Need’ Authors
Machine Learning Algorithms & Natural Language Processing
Machine Learning Algorithms & Natural Language Processing
May 20, 2026 · Artificial Intelligence

How New LLM Architectures Like Gemma 4 and DeepSeek V4 Cut Long‑Context Costs

The article surveys recent open‑weight LLM releases—Gemma 4, Laguna XS.2, ZAYA1‑8B and DeepSeek V4—detailing how KV‑cache sharing, per‑layer embeddings, layer‑wise attention budgeting, compressed convolutional attention and manifold‑constrained hyper‑connections dramatically reduce memory and compute for ultra‑long contexts while preserving model quality.

Attention optimizationKV cacheLLM
0 likes · 25 min read
How New LLM Architectures Like Gemma 4 and DeepSeek V4 Cut Long‑Context Costs
Lao Guo's Learning Space
Lao Guo's Learning Space
May 12, 2026 · Artificial Intelligence

Demystifying the Core Technologies Behind ChatGPT, GPT‑4, and DeepSeek

This article breaks down the key algorithms that power large‑language models—Transformer, Mixture‑of‑Experts, Flash Attention, KV‑Cache, Multi‑Token Prediction, quantization, Chain‑of‑Thought and Retrieval‑Augmented Generation—explaining how each contributes to the performance of ChatGPT, GPT‑4 and DeepSeek.

Chain-of-ThoughtFlash AttentionKV cache
0 likes · 10 min read
Demystifying the Core Technologies Behind ChatGPT, GPT‑4, and DeepSeek
AI Architecture Path
AI Architecture Path
May 11, 2026 · Artificial Intelligence

OpenMythos: 22‑Year‑Old Recreates Claude Mythos with Recurrent Depth Transformers

A 22‑year‑old developer reverse‑engineered Anthropic’s confidential Claude Mythos, releasing the OpenMythos project that employs a Recurrent Depth Transformer looping a single weight set up to 16 times, matching a 1.3 B‑parameter transformer’s performance with only 770 M parameters while enabling deeper inference and solving gradient instability.

AIClaude MythosOpenMythos
0 likes · 9 min read
OpenMythos: 22‑Year‑Old Recreates Claude Mythos with Recurrent Depth Transformers
Machine Learning Algorithms & Natural Language Processing
Machine Learning Algorithms & Natural Language Processing
May 9, 2026 · Artificial Intelligence

Can 99% Sparse Transformers Run Faster? Insights from the Original Authors

A new ICML 2026 paper by Sakana AI and NVIDIA shows that applying lightweight L1 regularization can make Feed‑Forward Network activations in Transformers over 99% sparse, and with the TwELL storage format and a hybrid routing scheme this sparsity translates into up to 20.5% inference speedup, 21.9% training‑step acceleration, lower energy consumption and reduced peak memory across 0.5‑2 B‑parameter models while preserving downstream performance.

CUDAGPU OptimizationHybrid Routing
0 likes · 9 min read
Can 99% Sparse Transformers Run Faster? Insights from the Original Authors
Xiaomi Tech
Xiaomi Tech
May 7, 2026 · Artificial Intelligence

OmniVoice: Open‑Source TTS Model Clones Voices in 600+ Languages with a Single Architecture

OmniVoice, an open‑source TTS system from Xiaomi AI Lab, uses a minimalist bidirectional Transformer and LLM‑enhanced pre‑training to synthesize high‑quality speech in over 600 languages, outperforming commercial systems while offering fine‑grained control and fully public code and models.

Multilingual speech synthesisOmniVoiceTTS
0 likes · 8 min read
OmniVoice: Open‑Source TTS Model Clones Voices in 600+ Languages with a Single Architecture
Data Party THU
Data Party THU
Apr 30, 2026 · Artificial Intelligence

Turning Transformers into Mamba: How Apple Linearized Inference Costs

Apple introduced a two‑step cross‑architecture distillation method that converts costly quadratic‑time Transformers into cheaper linear‑time Mamba models, preserving most of the original performance while dramatically reducing inference cost.

AI researchLinear AttentionMamba
0 likes · 8 min read
Turning Transformers into Mamba: How Apple Linearized Inference Costs
SuanNi
SuanNi
Apr 30, 2026 · Artificial Intelligence

Why Transformers Are Naturally Succinct: Insights from the ICLR Best Paper

The ICLR 2026 best paper reveals that Transformers achieve extreme succinctness—encoding complex concepts with exponentially fewer symbols than RNNs—while proving that analyzing or verifying such models incurs EXPSPACE‑complete computational costs.

Computational ComplexityEXPSPACESuccinctness
0 likes · 8 min read
Why Transformers Are Naturally Succinct: Insights from the ICLR Best Paper
Machine Heart
Machine Heart
Apr 29, 2026 · Artificial Intelligence

LCA Boosts Long-Context Inference: 2.5× Speedup and 90% KV Cache Reduction

The Latent‑Condensed Attention (LCA) method dramatically cuts KV‑cache memory by 90%, speeds up pre‑fill by 2.5× and reduces decode latency by 1.8× for 128K‑token contexts, while requiring no extra parameters and preserving model performance across diverse LLMs.

Efficient AttentionKV cache reductionLCA
0 likes · 10 min read
LCA Boosts Long-Context Inference: 2.5× Speedup and 90% KV Cache Reduction
Bighead's Algorithm Notes
Bighead's Algorithm Notes
Apr 22, 2026 · Artificial Intelligence

How DeepAries’s Adaptive Rebalancing Timing Boosts Portfolio Returns

DeepAries is a novel deep reinforcement‑learning framework that jointly learns when to rebalance a portfolio and how to allocate assets by combining a Transformer‑based state encoder with PPO, and extensive experiments on four major markets show it significantly outperforms fixed‑frequency baselines in risk‑adjusted return, transaction cost, and drawdown.

DeepAriesPPOPortfolio Management
0 likes · 15 min read
How DeepAries’s Adaptive Rebalancing Timing Boosts Portfolio Returns
Machine Heart
Machine Heart
Apr 22, 2026 · Artificial Intelligence

Apple Turns Transformers into Mamba with Linear‑Cost Distillation

Apple proposes a two‑step cross‑architecture distillation that converts expensive, high‑performing Transformers into cheaper, nearly equally strong Mamba models by first replacing softmax attention with learned linear attention (Hedgehog) and then embedding this intermediate form into Mamba, achieving comparable perplexity and downstream task performance with far lower inference cost.

Linear AttentionMambaTransformer
0 likes · 7 min read
Apple Turns Transformers into Mamba with Linear‑Cost Distillation
Machine Heart
Machine Heart
Apr 17, 2026 · Artificial Intelligence

Combining Transformers and RNNs: Google’s Memory Caching Unlocks Ultra‑Long Context

Google Research introduces Memory Caching (MC), a technique that gives RNNs growing memory capacity, bridging the gap with Transformers to enable ultra‑long context processing while reducing memory demands, and demonstrates its effectiveness through extensive language‑modeling and recall experiments.

AI ArchitectureGoogle ResearchLong Context
0 likes · 7 min read
Combining Transformers and RNNs: Google’s Memory Caching Unlocks Ultra‑Long Context
ZhiKe AI
ZhiKe AI
Apr 15, 2026 · Artificial Intelligence

From Sci‑Fi to Reality: How AI Large Models Are Reshaping Our World

The article explains what AI is, traces its three historical waves—from rule‑based expert systems to statistical learning and deep learning—focuses on the current large‑language‑model era, surveys leading domestic and overseas models, and highlights key trends such as open‑source competition, reasoning capabilities, multimodality, and edge deployment.

AIEdge deploymentMultimodal
0 likes · 4 min read
From Sci‑Fi to Reality: How AI Large Models Are Reshaping Our World
Machine Heart
Machine Heart
Apr 14, 2026 · Artificial Intelligence

Training a Transformer on a 1970s PDP‑11 Takes Only 5.5 Minutes

A developer recreated a 1970s PDP‑11 environment, wrote a single‑layer, single‑head Transformer in assembly, and trained it on a sequence‑reversal task, achieving 100% accuracy after about 350 steps and a total training time of roughly 5.5 minutes.

AssemblyLow-resource AIPDP-11
0 likes · 16 min read
Training a Transformer on a 1970s PDP‑11 Takes Only 5.5 Minutes
Lao Guo's Learning Space
Lao Guo's Learning Space
Apr 12, 2026 · Artificial Intelligence

Who Wins the AI Video Throne? HappyHorse-1.0 vs ByteDance Seedance 2.0

The article dissects the April 2026 showdown between the anonymous 15‑billion‑parameter HappyHorse‑1.0 and ByteDance’s two‑year‑old Seedance 2.0, detailing Elo score gaps, contrasting single‑stream versus dual‑branch Transformer designs, speed advantages, quality trade‑offs, and offering a decision tree for different production needs.

AI videoElo rankingMultimodal
0 likes · 11 min read
Who Wins the AI Video Throne? HappyHorse-1.0 vs ByteDance Seedance 2.0
LuTiao Programming
LuTiao Programming
Apr 12, 2026 · Artificial Intelligence

Master AI Core in 20 Minutes: 20 Key Concepts That Set You Apart

In just 20 minutes this article walks you through 20 essential AI concepts—from neural networks and transformers to prompt engineering and diffusion models—showing how understanding the underlying mechanisms, rather than merely using tools, can separate you from the majority of practitioners.

LLMPrompt EngineeringRAG
0 likes · 10 min read
Master AI Core in 20 Minutes: 20 Key Concepts That Set You Apart
AI Explorer
AI Explorer
Apr 11, 2026 · Artificial Intelligence

How Kronos Redefines Quantitative Analysis with a Financial‑Market Language Model

Kronos, an open‑source large model trained on OHLCV data from over 45 exchanges, treats financial time‑series as a specialized language, using a custom tokenizer and a two‑stage Transformer to enable price prediction, market state detection, signal generation, and risk simulation, with easy Hugging Face integration and a live demo for BTC/USDT.

KronosLarge Language ModelTransformer
0 likes · 6 min read
How Kronos Redefines Quantitative Analysis with a Financial‑Market Language Model
AI Tech Publishing
AI Tech Publishing
Apr 9, 2026 · Artificial Intelligence

Engineering‑Focused Guide to Training and Inference of Large Language Models

This article walks engineers through the full LLM stack—from tokenization and positional encoding to transformer blocks, efficient fine‑tuning, quantization, and production‑grade inference techniques such as KV‑cache, FlashAttention, PagedAttention, continuous batching, and speculative decoding—highlighting trade‑offs, toolchains, and practical workflow steps.

LLMLoRAQuantization
0 likes · 13 min read
Engineering‑Focused Guide to Training and Inference of Large Language Models
Bighead's Algorithm Notes
Bighead's Algorithm Notes
Apr 6, 2026 · Artificial Intelligence

STORM: A Bidirectional Spatiotemporal Factor Model Achieving Sharpe Ratio >1

STORM introduces a bidirectional VQ‑VAE‑based spatiotemporal factor model that extracts fine‑grained time‑series and cross‑sectional features, uses discrete codebooks for orthogonal, diverse factor embeddings, and outperforms nine baselines on portfolio management and algorithmic trading tasks, delivering Sharpe ratios exceeding 1.

Algorithmic TradingPortfolio ManagementTransformer
0 likes · 17 min read
STORM: A Bidirectional Spatiotemporal Factor Model Achieving Sharpe Ratio >1
AI Programming Lab
AI Programming Lab
Apr 5, 2026 · Artificial Intelligence

Do You Really Understand Tokens? A Deep Dive Starting from a Claude Code Session

The article explains what tokens are, how different models tokenize text, the role of token embeddings, positional encoding, self‑attention, KV cache, and why output tokens cost far more than input tokens, while also covering pricing differences and prompt‑caching savings across major LLM providers.

KV cacheLLM pricingLarge Language Model
0 likes · 13 min read
Do You Really Understand Tokens? A Deep Dive Starting from a Claude Code Session
Data Party THU
Data Party THU
Apr 3, 2026 · Artificial Intelligence

Can Attention Replace Residuals? Inside the New Attention Residuals Breakthrough

The article reviews the Kimi team's Attention Residuals approach, which substitutes traditional ResNet additive shortcuts with learned attention‑based weighting, explains the theoretical motivation linking depth to time, details full‑attention and block‑wise implementations, presents experimental results showing up to 1.25× compute efficiency and improved performance on reasoning and knowledge tasks.

Attention MechanismModel EfficiencyResidual Networks
0 likes · 11 min read
Can Attention Replace Residuals? Inside the New Attention Residuals Breakthrough
ShiZhen AI
ShiZhen AI
Apr 2, 2026 · Artificial Intelligence

How KV Cache Works and Why Large Model Outputs Cost Five Times More Than Inputs

The article explains the KV Cache mechanism that stores previously computed key/value vectors to avoid redundant Transformer calculations, delivering roughly a 5× speedup, while also detailing why generating output tokens is far more expensive than processing input tokens due to serial generation and memory trade‑offs.

KV cacheLLM InferenceMemory optimization
0 likes · 9 min read
How KV Cache Works and Why Large Model Outputs Cost Five Times More Than Inputs
ArcThink
ArcThink
Apr 2, 2026 · Artificial Intelligence

Why LLMs Forget You: Uncovering the Limits and Solutions for Long‑Term Memory

The article explains why large language models lack persistent memory due to the stateless Transformer architecture, breaks down the four dimensions of memory loss, surveys seven technical approaches, three product implementations, and emerging research, and discusses security and privacy implications.

AILLMRAG
0 likes · 22 min read
Why LLMs Forget You: Uncovering the Limits and Solutions for Long‑Term Memory
AI Explorer
AI Explorer
Apr 1, 2026 · Artificial Intelligence

Google Open‑Sources TimesFM: A Foundation Model for Plug‑and‑Play Time‑Series Forecasting

Google’s open‑source TimesFM is a decoder‑only Transformer foundation model that delivers plug‑and‑play time‑series forecasting with zero‑shot accuracy, larger context windows, quantile predictions, and a simple Hugging Face API, making it suitable for retail, energy, finance, monitoring, and IoT use cases.

Hugging FacePyTorchTimesFM
0 likes · 7 min read
Google Open‑Sources TimesFM: A Foundation Model for Plug‑and‑Play Time‑Series Forecasting
Data Party THU
Data Party THU
Mar 31, 2026 · Artificial Intelligence

Can Lookup-Based Memory Revolutionize Transformers? Inside the STEM Architecture

The STEM architecture replaces the Transformer feed‑forward network with a static token‑indexed embedding table, enabling lookup‑based memory that decouples capacity from compute, improves training stability, expands addressable memory, and delivers consistent performance gains on long‑context and knowledge‑intensive tasks.

Lookup MemoryModel EfficiencySTEM Architecture
0 likes · 8 min read
Can Lookup-Based Memory Revolutionize Transformers? Inside the STEM Architecture
AI Large-Model Wave and Transformation Guide
AI Large-Model Wave and Transformation Guide
Mar 28, 2026 · Artificial Intelligence

From RNNs to Multimodal Agents: A Decade of Transformer Evolution

This article traces the evolution of sequence models from early RNN/LSTM designs through the breakthrough Transformer, its major branches, dense scaling, efficiency‑focused variants, next‑generation linear‑complexity SSMs, and finally multimodal agent architectures, highlighting each stage's strengths, weaknesses, and typical use cases.

AI ArchitectureEfficient AttentionLLM
0 likes · 12 min read
From RNNs to Multimodal Agents: A Decade of Transformer Evolution
Data Party THU
Data Party THU
Mar 26, 2026 · Artificial Intelligence

How Mixture-of-Depths Attention Boosts Large Language Model Efficiency

This article examines the Mixture‑of‑Depths Attention (MoDA) mechanism, detailing its novel flash‑compatible KV layout, combined sequence‑depth attention, theoretical analysis, and extensive experiments that show significant reductions in validation loss and accuracy gains on downstream tasks compared to the OLMo2 baseline.

Attention MechanismDeep KVFlashAttention
0 likes · 9 min read
How Mixture-of-Depths Attention Boosts Large Language Model Efficiency
Full-Stack Cultivation Path
Full-Stack Cultivation Path
Mar 23, 2026 · Artificial Intelligence

What Exactly Is a Token in LLMs? A First‑Principles Explanation

The article explains that a token is the smallest discrete text unit a large language model processes, detailing why tokenization is essential, how tokenizers work, how tokens flow through the transformer, and how token counts affect context windows, cost, latency, and overall model behavior.

EmbeddingLLMTokenization
0 likes · 20 min read
What Exactly Is a Token in LLMs? A First‑Principles Explanation
SuanNi
SuanNi
Mar 17, 2026 · Artificial Intelligence

How Attention Residuals Boost Transformer Efficiency and Scale

The article presents the Attention Residuals architecture, explains how it replaces uniform residual addition with learned attention‑based aggregation, details full and block variants, engineering tricks for distributed training, and shows extensive scaling‑law experiments where the new design consistently improves validation loss and training efficiency across model sizes.

Attention ResidualsEfficient TrainingModel Scaling
0 likes · 13 min read
How Attention Residuals Boost Transformer Efficiency and Scale
ShiZhen AI
ShiZhen AI
Mar 17, 2026 · Artificial Intelligence

Kimi’s Attention Residuals Swap a Decade-Old Residual Trick for 1.25× Faster 48B MoE

The Kimi team introduces Attention Residuals, a softmax‑based replacement for the uniform residual connections used in Transformers for a decade, enabling selective aggregation of layer histories, reducing hidden‑state growth, and achieving a 1.25× compute‑efficiency gain on a 48‑billion‑parameter MoE model with less than 2% inference latency increase.

Attention ResidualsMoEResidual Connection
0 likes · 10 min read
Kimi’s Attention Residuals Swap a Decade-Old Residual Trick for 1.25× Faster 48B MoE
Machine Learning Algorithms & Natural Language Processing
Machine Learning Algorithms & Natural Language Processing
Mar 15, 2026 · Artificial Intelligence

HY‑WU: Real‑Time Adaptive AI Model That Generates Parameters On‑The‑Fly

HY‑WU demonstrates that generating model parameters dynamically during inference enables a single foundation model to perform diverse image‑editing tasks, outperforming fixed‑parameter baselines in human and automatic evaluations, benchmark tests, and conflict‑task experiments, highlighting a practical real‑time adaptation approach for AI systems.

HY-WULoRATransformer
0 likes · 16 min read
HY‑WU: Real‑Time Adaptive AI Model That Generates Parameters On‑The‑Fly
Machine Learning Algorithms & Natural Language Processing
Machine Learning Algorithms & Natural Language Processing
Mar 14, 2026 · Artificial Intelligence

Can Large Language Models Get Stronger Without Human Language Training? A New Pre‑Pre‑Training Path

A recent study shows that pre‑training Transformers on synthetic, non‑language data generated by Neural Cellular Automata can boost language‑model performance by up to 6%, accelerate convergence by 40%, and improve downstream reasoning, even outperforming models trained on massive natural‑text corpora.

In-Context LearningLanguage ModelsNeural Cellular Automata
0 likes · 12 min read
Can Large Language Models Get Stronger Without Human Language Training? A New Pre‑Pre‑Training Path
Bighead's Algorithm Notes
Bighead's Algorithm Notes
Mar 14, 2026 · Artificial Intelligence

Quantitative Finance Paper Digest: AI‑Driven Market Prediction Studies (Mar 7‑13 2026)

This digest summarizes four recent research papers that apply advanced AI techniques—node‑transformer graphs with BERT sentiment analysis, a quantum‑classical LSTM‑Born machine hybrid, large‑language‑model benchmarking for portfolio optimization, and a conditional diffusion model—to improve stock market prediction, volatility forecasting, and investment decision making, providing detailed experimental results and statistical validation.

BERTLarge Language ModelTransformer
0 likes · 10 min read
Quantitative Finance Paper Digest: AI‑Driven Market Prediction Studies (Mar 7‑13 2026)
High Availability Architecture
High Availability Architecture
Mar 12, 2026 · Artificial Intelligence

How Claude Code Hits 92% Prompt Cache Rate and Slashes AI Agent Costs by 81%

This article explains the prompt‑caching mechanism used by Claude Code, showing how separating static prefixes from dynamic tails and leveraging KV‑tensor caching reduces the O(n²) complexity of transformer pre‑fill to O(n), achieving a 92% cache hit rate and up to 81% cost savings in long‑running AI agent sessions.

AI AgentsClaudeLLM Optimization
0 likes · 12 min read
How Claude Code Hits 92% Prompt Cache Rate and Slashes AI Agent Costs by 81%
Machine Learning Algorithms & Natural Language Processing
Machine Learning Algorithms & Natural Language Processing
Mar 11, 2026 · Artificial Intelligence

Random Parameter Pruning Boosts Transferable Targeted Attacks Across Model Architectures

The RaPA method introduces random parameter pruning during adversarial generation, creating diverse model variants that markedly increase the success rate of targeted transfer attacks across CNN and Transformer architectures, even against defended models and with higher computational budgets, as demonstrated on ImageNet‑compatible benchmarks.

CNNTransformeradversarial attacks
0 likes · 14 min read
Random Parameter Pruning Boosts Transferable Targeted Attacks Across Model Architectures
Machine Learning Algorithms & Natural Language Processing
Machine Learning Algorithms & Natural Language Processing
Mar 10, 2026 · Artificial Intelligence

How InfLLM‑V2 Achieves Seamless Short‑to‑Long Context Upgrade with Minimal Structural Changes

InfLLM‑V2 introduces a dense‑sparse switchable attention framework that preserves the original dense‑attention parameters while enabling efficient long‑context training, matching full‑attention performance on benchmarks such as RULER, LongBench, and chain‑reasoning tasks, and delivering up to 2.3× end‑to‑end inference speedup without degrading short‑sequence abilities.

EfficiencyInfLLM-V2Long Context
0 likes · 16 min read
How InfLLM‑V2 Achieves Seamless Short‑to‑Long Context Upgrade with Minimal Structural Changes
Machine Learning Algorithms & Natural Language Processing
Machine Learning Algorithms & Natural Language Processing
Mar 10, 2026 · Artificial Intelligence

Why the First Token Becomes a Value Garbage Bin – LeCun Team Dissects Spike and Attention Sink Mechanics

The paper by Yann LeCun’s team reveals that massive activation spikes and attention sinks in Transformers are not inherently coupled; spikes arise from position‑0 token interactions and specific feed‑forward dynamics, while attention sinks emerge from Pre‑norm normalization and head dimension, offering practical insights for model quantization and long‑context inference.

Attention SinkLLMMassive Activations
0 likes · 9 min read
Why the First Token Becomes a Value Garbage Bin – LeCun Team Dissects Spike and Attention Sink Mechanics
Machine Learning Algorithms & Natural Language Processing
Machine Learning Algorithms & Natural Language Processing
Mar 9, 2026 · Artificial Intelligence

Instant LoRA Generation and Long‑Document Internalization: Cost‑Amortized Model Updates via 0.1‑Second Forward Pass

The article analyzes the quadratic attention and KV‑Cache bottlenecks of Transformers on ultra‑long inputs and the heavy compute cost of traditional supervised fine‑tuning, then presents Sakana AI's Cost Amortization framework—Doc‑to‑LoRA and Text‑to‑LoRA—that shifts weight updates to a meta‑training hypernetwork, achieving sub‑50 MB memory for 128K‑token inference, sub‑GB update memory for long‑document QA, and zero‑shot task adaptation with sub‑second latency.

Cost AmortizationLoRALong-context
0 likes · 13 min read
Instant LoRA Generation and Long‑Document Internalization: Cost‑Amortized Model Updates via 0.1‑Second Forward Pass
Machine Learning Algorithms & Natural Language Processing
Machine Learning Algorithms & Natural Language Processing
Mar 7, 2026 · Artificial Intelligence

Transformer Hidden States Can Reconstruct Input with 100% Accuracy – New Invertibility Study

A recent paper from Sapienza University's GLADIA Lab shows that mainstream Transformer language models are injective, enabling a novel SIPIT algorithm to recover original text from hidden states with perfect accuracy, while extensive experiments confirm the models retain all input information.

InjectiveInvertibilityLanguage Model
0 likes · 11 min read
Transformer Hidden States Can Reconstruct Input with 100% Accuracy – New Invertibility Study
Data Party THU
Data Party THU
Mar 6, 2026 · Artificial Intelligence

How Small Can a Transformer Get? Inside the 121‑Parameter AdderBoard Challenge

This article chronicles the AdderBoard competition, detailing how researchers compressed a Transformer for 10‑digit addition down to just 121 parameters, the experimental rules, the contrasting hand‑coded and data‑driven approaches, and the insights gained about model minimalism and discoverability.

AdderBoardParameter EfficiencyTransformer
0 likes · 13 min read
How Small Can a Transformer Get? Inside the 121‑Parameter AdderBoard Challenge
Machine Learning Algorithms & Natural Language Processing
Machine Learning Algorithms & Natural Language Processing
Mar 3, 2026 · Artificial Intelligence

Identity Constraint Beats DeepSeek mHC After 150B Tokens: A Surprising Reversal

Extensive experiments on DeepSeek's 1.7B and 8B models reveal that replacing the manifold hyper‑connection (mHC) constraint with a simple identity matrix consistently outperforms the original mHC, improves signal flow stability, and avoids the collapse caused by repeated Sinkhorn‑Knopp projections.

DeepSeekHyper-ConnectionSinkhorn
0 likes · 12 min read
Identity Constraint Beats DeepSeek mHC After 150B Tokens: A Surprising Reversal
Machine Learning Algorithms & Natural Language Processing
Machine Learning Algorithms & Natural Language Processing
Mar 3, 2026 · Artificial Intelligence

Beyond Dense and MoE: JTok Module Cuts Compute by One‑Third as a New Scaling Path

The paper introduces JTok and its dynamic variant JTok‑M, a token‑indexed parameter scaling method that decouples model capacity from compute, achieving up to 35% compute reduction while delivering consistent performance gains across a wide range of downstream tasks and model sizes.

JTokToken-indexed scalingTransformer
0 likes · 16 min read
Beyond Dense and MoE: JTok Module Cuts Compute by One‑Third as a New Scaling Path
Data STUDIO
Data STUDIO
Feb 25, 2026 · Artificial Intelligence

Build a Large Language Model from Scratch with PyTorch—No Libraries, No Shortcuts

This guide walks you through building, training, and fine‑tuning a Transformer‑based large language model entirely from scratch using PyTorch, covering tokenization, self‑attention, multi‑head attention, positional encoding, model architecture, data preparation, training loops, and fine‑tuning on custom lyrics.

GPTLLMPyTorch
0 likes · 43 min read
Build a Large Language Model from Scratch with PyTorch—No Libraries, No Shortcuts
Qborfy AI
Qborfy AI
Feb 21, 2026 · Artificial Intelligence

How Self-Attention Powers Modern AI: From Theory to Real-World Impact

This article explains the self‑attention mechanism behind transformers, detailing its core components, mathematical formulation, step‑by‑step example, multi‑head extension, industry use cases, and a thorough comparison with RNN and CNN approaches, all supported by concrete numbers and citations.

Attention MechanismSelf-AttentionTransformer
0 likes · 8 min read
How Self-Attention Powers Modern AI: From Theory to Real-World Impact
Data Party THU
Data Party THU
Feb 21, 2026 · Artificial Intelligence

Unlocking Compositional Generalization: Meta‑Learning Strategies for Neural Networks

This article examines how meta‑learning combined with compositionality enables neural networks to rapidly adapt to new tasks by formalizing hierarchical optimization, leveraging modular architectures with hypernetworks, and exploiting Transformer latent codes for effective compositional generalization.

Bilevel OptimizationCompositional GeneralizationMeta Learning
0 likes · 5 min read
Unlocking Compositional Generalization: Meta‑Learning Strategies for Neural Networks
Bighead's Algorithm Notes
Bighead's Algorithm Notes
Feb 18, 2026 · Artificial Intelligence

Which Loss Function Ranks Stocks Best? An Empirical Study with Transformer Models

This paper evaluates point‑wise, pair‑wise, and list‑wise loss functions for Transformer‑based stock‑return prediction on 110 S&P 500 stocks, showing that Margin loss achieves the highest annual return (16.23%) and Sharpe ratio (0.75), ListNet delivers strong returns with low volatility, and BPR minimizes maximum drawdown, highlighting how loss design critically shapes ranking‑driven portfolio performance.

Loss FunctionsTransformerfinancial time series
0 likes · 15 min read
Which Loss Function Ranks Stocks Best? An Empirical Study with Transformer Models
AI Cyberspace
AI Cyberspace
Feb 15, 2026 · Artificial Intelligence

From GPT-1 to GPT-4o: A Deep Dive into the Evolution of Large Language Models

This article chronicles the rapid progression of GPT models from the 2018 GPT‑1 pre‑training breakthrough through GPT‑2’s multitask learning, GPT‑3’s scaling laws and few‑shot abilities, to GPT‑4’s multimodal capabilities and the 2024 GPT‑4 Turbo, Sora, and GPT‑4o releases, while also explaining core LLM abilities and the decoder‑only architecture of GPT‑2.

AI evolutionGPTTransformer
0 likes · 20 min read
From GPT-1 to GPT-4o: A Deep Dive into the Evolution of Large Language Models
AI Cyberspace
AI Cyberspace
Feb 14, 2026 · Artificial Intelligence

Unpacking the Transformer: From Embeddings to Multi‑Head Attention

This article provides a comprehensive, step‑by‑step walkthrough of the Transformer architecture, covering input embedding, positional encoding, the mechanics of Q‑K‑V attention, scaled dot‑product formulas, multi‑head and masked attention, feed‑forward networks, residual connections, layer normalization, decoder generation, and recent attention‑optimization techniques.

Feed-Forward NetworkMulti-Head AttentionPositional Encoding
0 likes · 39 min read
Unpacking the Transformer: From Embeddings to Multi‑Head Attention
AI Cyberspace
AI Cyberspace
Feb 13, 2026 · Artificial Intelligence

How Attention Mechanisms Revolutionized Computer Vision and Machine Translation

This article traces the evolution of attention mechanisms from their inaugural application in computer vision and machine translation to their central role in modern Transformer models, detailing the underlying RNN‑Attention designs, the breakthrough in sequence alignment, and the innovations that enabled high‑performance, parallelizable deep learning architectures.

Attention MechanismMachine TranslationTransformer
0 likes · 14 min read
How Attention Mechanisms Revolutionized Computer Vision and Machine Translation
Data Party THU
Data Party THU
Feb 4, 2026 · Artificial Intelligence

How Sakana AI Redefines Long-Context Transformers: DroPE, REPO, and FwPKM Explained

This article analyzes Sakana AI's three recent papers that challenge traditional Transformer long‑sequence handling by removing positional embeddings, reconstructing position awareness, and adding a fast‑weight external memory, showing how each approach improves ultra‑long text understanding.

Long ContextMemory MechanismPositional Embedding
0 likes · 12 min read
How Sakana AI Redefines Long-Context Transformers: DroPE, REPO, and FwPKM Explained
HyperAI Super Neural
HyperAI Super Neural
Feb 3, 2026 · Artificial Intelligence

Walrus: 1.3B Transformer Model Beats Prior Foundations Across 19 Physics Domains

Walrus, a 1.3 billion‑parameter Transformer built by Polymathic AI, is pretrained on 19 diverse physics scenarios—including astrophysics, geoscience, rheology, plasma physics and acoustics—using techniques like patch jittering, adaptive compute tokenization and space‑time factorized attention, and consistently outperforms earlier foundation models on both short‑ and long‑term continuum dynamics predictions.

Scientific AITransformerWalrus
0 likes · 13 min read
Walrus: 1.3B Transformer Model Beats Prior Foundations Across 19 Physics Domains
Tencent Technical Engineering
Tencent Technical Engineering
Feb 2, 2026 · Artificial Intelligence

Why Neural Networks Are the Hidden Engine Behind Modern AI: From Basics to Large Language Models

This comprehensive guide walks through the fundamentals of neural networks, activation functions, training methods, and how they power large language models, while also covering tokenization, self‑attention, transformer architectures, AI infrastructure, and practical usage through agents and retrieval‑augmented generation.

Agent systemsGPU infrastructureTransformer
0 likes · 75 min read
Why Neural Networks Are the Hidden Engine Behind Modern AI: From Basics to Large Language Models
Network Intelligence Research Center (NIRC)
Network Intelligence Research Center (NIRC)
Jan 31, 2026 · Artificial Intelligence

How Engram Lets Large Models Swap GPU Memory for Cheap RAM to ‘Look Up’ Knowledge

The article dissects DeepSeek’s new Engram architecture, which separates computation from memory by using a large, cheap‑RAM‑based lookup table to store factual knowledge, allowing the transformer’s compute layers to focus on reasoning, dramatically reducing GPU memory demand while improving code, math, and long‑context performance.

EngramGPU memoryLarge Language Model
0 likes · 7 min read
How Engram Lets Large Models Swap GPU Memory for Cheap RAM to ‘Look Up’ Knowledge
HyperAI Super Neural
HyperAI Super Neural
Jan 23, 2026 · Artificial Intelligence

Weekly AI Paper Digest: New Transformer Advances in Sparsity, Memory, and Reasoning

This article reviews five recent Transformer papers—including Engram's conditional memory, STEM's embedding‑based scaling, SeedFold's biomolecular structure prediction, a critique of Transformers for time‑series forecasting, and reasoning models as societies of thought—highlighting their methods, datasets, and performance gains.

Biomolecular Structure PredictionMemory MechanismsStructural Sparsity
0 likes · 7 min read
Weekly AI Paper Digest: New Transformer Advances in Sparsity, Memory, and Reasoning
PaperAgent
PaperAgent
Jan 22, 2026 · Artificial Intelligence

How STEM Replaces MoE Routing with Simple Table Lookup for Faster Transformers

The article presents STEM, a method that transforms dense and MoE transformer architectures by converting the expert routing step into a static table‑lookup operation, achieving higher parameter efficiency, lower communication overhead, and improved interpretability while maintaining or boosting downstream task performance.

Embedding LookupMixture of ExpertsModel Efficiency
0 likes · 6 min read
How STEM Replaces MoE Routing with Simple Table Lookup for Faster Transformers
Java Tech Enthusiast
Java Tech Enthusiast
Jan 21, 2026 · Artificial Intelligence

Inside X’s Open‑Source Recommendation Engine: How the Grok‑Powered Transformer Works

X platform has open‑sourced its new "For You" recommendation system, revealing a Grok‑based Transformer architecture, detailed module breakdown, seven‑step content ranking pipeline, and the strategic motivations behind the unprecedented move toward algorithmic transparency and community‑driven improvement.

Social MediaTransformerX Platform
0 likes · 12 min read
Inside X’s Open‑Source Recommendation Engine: How the Grok‑Powered Transformer Works
PaperAgent
PaperAgent
Jan 20, 2026 · Artificial Intelligence

How X’s Open‑Source “For You” Recommendation Engine Works

X (formerly Twitter) has open‑sourced its “For You” recommendation algorithm, revealing a Grok‑based Transformer that merges on‑platform and off‑platform content, removes manual features, and scores posts through a multi‑stage pipeline with candidate sourcing, hydration, filtering, scoring, and selection.

GrokTransformerX Platform
0 likes · 5 min read
How X’s Open‑Source “For You” Recommendation Engine Works
Data Party THU
Data Party THU
Jan 19, 2026 · Artificial Intelligence

How VersatileFFN Cuts Memory Use While Boosting LLM Performance

The article introduces Huawei's VersatileFFN, an adaptive wide‑and‑deep feed‑forward design for large language models that reuses parameters to slash memory consumption while delivering stronger inference, detailing its dual‑system inspiration, technical mechanisms, experimental gains, and implications for efficient LLM deployment.

Adaptive ComputationLLMParameter Efficiency
0 likes · 8 min read
How VersatileFFN Cuts Memory Use While Boosting LLM Performance
AI Architecture Hub
AI Architecture Hub
Jan 19, 2026 · Artificial Intelligence

Demystifying the Transformer: From Input Embedding to Multi‑Head Attention

This article breaks down the core components of the Transformer architecture—including input embedding, positional encoding, multi‑head self‑attention, residual connections with layer normalization, position‑wise feed‑forward networks, and the rationale behind stacking multiple encoder layers—using clear explanations and illustrative diagrams.

Add&NormFeed ForwardInput Embedding
0 likes · 12 min read
Demystifying the Transformer: From Input Embedding to Multi‑Head Attention
AI Cyberspace
AI Cyberspace
Jan 13, 2026 · Artificial Intelligence

From Symbolic AI to LLMs: A Complete NLP History and Model Guide

This article provides a comprehensive overview of natural language processing, tracing its evolution from early symbolic and statistical stages through deep learning breakthroughs, detailing sequence models, key NLP tasks, text representation methods, and the development of modern architectures like RNN, LSTM, GRU, Transformer, and GPT series.

GPTLSTMNLP
0 likes · 60 min read
From Symbolic AI to LLMs: A Complete NLP History and Model Guide
PaperAgent
PaperAgent
Jan 13, 2026 · Artificial Intelligence

How Engram’s Conditional Memory Redefines Sparsity in Large Language Models

DeepSeek’s newly released Engram module introduces a conditional memory mechanism that leverages O(1) N‑gram lookup to create a new sparsity axis for large language models, reducing early‑layer compute, improving inference efficiency, and delivering notable performance gains across reasoning and knowledge tasks, as demonstrated by extensive experiments on 27‑billion‑parameter models.

Efficient InferenceEngramLLM Sparsity
0 likes · 8 min read
How Engram’s Conditional Memory Redefines Sparsity in Large Language Models
AI Architecture Hub
AI Architecture Hub
Jan 7, 2026 · Artificial Intelligence

Why “Attention Is All You Need” Still Shapes AI: A Beginner’s Deep Dive

This article provides a comprehensive, beginner‑friendly walkthrough of the landmark 2017 paper “Attention Is All You Need,” covering its authors, historical context, the shortcomings of RNNs and CNNs, the birth of self‑attention, the Transformer architecture, and its transformative impact on modern AI.

AI historyAttention MechanismTransformer
0 likes · 9 min read
Why “Attention Is All You Need” Still Shapes AI: A Beginner’s Deep Dive
Network Intelligence Research Center (NIRC)
Network Intelligence Research Center (NIRC)
Jan 4, 2026 · Artificial Intelligence

How UniCodebook’s Unified 2D‑3D Discrete Priors Boost Noise‑Robust, Calibration‑Free 3D Human Pose Estimation

UniCodebook introduces a unified 2D‑3D discrete prior that combines continuous and discrete representations, enabling calibration‑free multiview 3D human pose estimation with superior noise robustness and higher accuracy, as demonstrated by state‑of‑the‑art results on Human3.6M and MPI‑INF‑3DHP.

3D pose estimationNeurIPS 2025Transformer
0 likes · 8 min read
How UniCodebook’s Unified 2D‑3D Discrete Priors Boost Noise‑Robust, Calibration‑Free 3D Human Pose Estimation
Tencent Technical Engineering
Tencent Technical Engineering
Dec 24, 2025 · Artificial Intelligence

Build a Mini LLM from Scratch: Step‑by‑Step Guide to Tokenizer, Attention, and Transformer

This article walks through constructing a small large‑language model from the ground up, covering model architecture, tokenization methods, BPE vocabulary building, embedding, positional encoding, attention mechanisms, multi‑head attention, transformer blocks, training pipelines, inference, and sampling strategies, all with runnable Python code.

LLMPythonTransformer
0 likes · 34 min read
Build a Mini LLM from Scratch: Step‑by‑Step Guide to Tokenizer, Attention, and Transformer