Tagged articles
383 articles
Page 1 of 4
Machine Learning Algorithms & Natural Language Processing
Machine Learning Algorithms & Natural Language Processing
May 20, 2026 · Artificial Intelligence

Can 99% Sparse Transformers Run Faster? Insights from the ‘Attention Is All You Need’ Authors

The paper shows that applying lightweight L1 regularization can make over 99% of FFN activations zero, and by using a new tile‑wise ELLPACK (TwELL) format together with a hybrid routing scheme, inference speed improves up to 30% while memory usage drops over 24% and energy consumption is reduced, all with negligible impact on downstream task performance.

CUDAGPU OptimizationHybrid Routing
0 likes · 8 min read
Can 99% Sparse Transformers Run Faster? Insights from the ‘Attention Is All You Need’ Authors
Machine Learning Algorithms & Natural Language Processing
Machine Learning Algorithms & Natural Language Processing
May 20, 2026 · Artificial Intelligence

How New LLM Architectures Like Gemma 4 and DeepSeek V4 Cut Long‑Context Costs

The article surveys recent open‑weight LLM releases—Gemma 4, Laguna XS.2, ZAYA1‑8B and DeepSeek V4—detailing how KV‑cache sharing, per‑layer embeddings, layer‑wise attention budgeting, compressed convolutional attention and manifold‑constrained hyper‑connections dramatically reduce memory and compute for ultra‑long contexts while preserving model quality.

Attention optimizationKV cacheLLM
0 likes · 25 min read
How New LLM Architectures Like Gemma 4 and DeepSeek V4 Cut Long‑Context Costs
Lao Guo's Learning Space
Lao Guo's Learning Space
May 12, 2026 · Artificial Intelligence

Demystifying the Core Technologies Behind ChatGPT, GPT‑4, and DeepSeek

This article breaks down the key algorithms that power large‑language models—Transformer, Mixture‑of‑Experts, Flash Attention, KV‑Cache, Multi‑Token Prediction, quantization, Chain‑of‑Thought and Retrieval‑Augmented Generation—explaining how each contributes to the performance of ChatGPT, GPT‑4 and DeepSeek.

Flash AttentionKV cacheMixture of Experts
0 likes · 10 min read
Demystifying the Core Technologies Behind ChatGPT, GPT‑4, and DeepSeek
AI Architecture Path
AI Architecture Path
May 11, 2026 · Artificial Intelligence

OpenMythos: 22‑Year‑Old Recreates Claude Mythos with Recurrent Depth Transformers

A 22‑year‑old developer reverse‑engineered Anthropic’s confidential Claude Mythos, releasing the OpenMythos project that employs a Recurrent Depth Transformer looping a single weight set up to 16 times, matching a 1.3 B‑parameter transformer’s performance with only 770 M parameters while enabling deeper inference and solving gradient instability.

AIClaude MythosOpenMythos
0 likes · 9 min read
OpenMythos: 22‑Year‑Old Recreates Claude Mythos with Recurrent Depth Transformers
Machine Learning Algorithms & Natural Language Processing
Machine Learning Algorithms & Natural Language Processing
May 9, 2026 · Artificial Intelligence

Can 99% Sparse Transformers Run Faster? Insights from the Original Authors

A new ICML 2026 paper by Sakana AI and NVIDIA shows that applying lightweight L1 regularization can make Feed‑Forward Network activations in Transformers over 99% sparse, and with the TwELL storage format and a hybrid routing scheme this sparsity translates into up to 20.5% inference speedup, 21.9% training‑step acceleration, lower energy consumption and reduced peak memory across 0.5‑2 B‑parameter models while preserving downstream performance.

CUDAGPU OptimizationHybrid Routing
0 likes · 9 min read
Can 99% Sparse Transformers Run Faster? Insights from the Original Authors
Data Party THU
Data Party THU
Apr 30, 2026 · Artificial Intelligence

Turning Transformers into Mamba: How Apple Linearized Inference Costs

Apple introduced a two‑step cross‑architecture distillation method that converts costly quadratic‑time Transformers into cheaper linear‑time Mamba models, preserving most of the original performance while dramatically reducing inference cost.

AI researchLinear AttentionMamba
0 likes · 8 min read
Turning Transformers into Mamba: How Apple Linearized Inference Costs
SuanNi
SuanNi
Apr 30, 2026 · Artificial Intelligence

Why Transformers Are Naturally Succinct: Insights from the ICLR Best Paper

The ICLR 2026 best paper reveals that Transformers achieve extreme succinctness—encoding complex concepts with exponentially fewer symbols than RNNs—while proving that analyzing or verifying such models incurs EXPSPACE‑complete computational costs.

Computational ComplexityEXPSPACESuccinctness
0 likes · 8 min read
Why Transformers Are Naturally Succinct: Insights from the ICLR Best Paper
Machine Heart
Machine Heart
Apr 29, 2026 · Artificial Intelligence

LCA Boosts Long-Context Inference: 2.5× Speedup and 90% KV Cache Reduction

The Latent‑Condensed Attention (LCA) method dramatically cuts KV‑cache memory by 90%, speeds up pre‑fill by 2.5× and reduces decode latency by 1.8× for 128K‑token contexts, while requiring no extra parameters and preserving model performance across diverse LLMs.

Inference AccelerationKV cache reductionLCA
0 likes · 10 min read
LCA Boosts Long-Context Inference: 2.5× Speedup and 90% KV Cache Reduction
Bighead's Algorithm Notes
Bighead's Algorithm Notes
Apr 22, 2026 · Artificial Intelligence

How DeepAries’s Adaptive Rebalancing Timing Boosts Portfolio Returns

DeepAries is a novel deep reinforcement‑learning framework that jointly learns when to rebalance a portfolio and how to allocate assets by combining a Transformer‑based state encoder with PPO, and extensive experiments on four major markets show it significantly outperforms fixed‑frequency baselines in risk‑adjusted return, transaction cost, and drawdown.

DeepAriesPPOPortfolio Management
0 likes · 15 min read
How DeepAries’s Adaptive Rebalancing Timing Boosts Portfolio Returns
Machine Heart
Machine Heart
Apr 22, 2026 · Artificial Intelligence

Apple Turns Transformers into Mamba with Linear‑Cost Distillation

Apple proposes a two‑step cross‑architecture distillation that converts expensive, high‑performing Transformers into cheaper, nearly equally strong Mamba models by first replacing softmax attention with learned linear attention (Hedgehog) and then embedding this intermediate form into Mamba, achieving comparable perplexity and downstream task performance with far lower inference cost.

Linear AttentionMambaTransformer
0 likes · 7 min read
Apple Turns Transformers into Mamba with Linear‑Cost Distillation
Machine Heart
Machine Heart
Apr 17, 2026 · Artificial Intelligence

Combining Transformers and RNNs: Google’s Memory Caching Unlocks Ultra‑Long Context

Google Research introduces Memory Caching (MC), a technique that gives RNNs growing memory capacity, bridging the gap with Transformers to enable ultra‑long context processing while reducing memory demands, and demonstrates its effectiveness through extensive language‑modeling and recall experiments.

AI ArchitectureGoogle ResearchMemory Caching
0 likes · 7 min read
Combining Transformers and RNNs: Google’s Memory Caching Unlocks Ultra‑Long Context
ZhiKe AI
ZhiKe AI
Apr 15, 2026 · Artificial Intelligence

From Sci‑Fi to Reality: How AI Large Models Are Reshaping Our World

The article explains what AI is, traces its three historical waves—from rule‑based expert systems to statistical learning and deep learning—focuses on the current large‑language‑model era, surveys leading domestic and overseas models, and highlights key trends such as open‑source competition, reasoning capabilities, multimodality, and edge deployment.

AITransformeredge deployment
0 likes · 4 min read
From Sci‑Fi to Reality: How AI Large Models Are Reshaping Our World
Machine Heart
Machine Heart
Apr 14, 2026 · Artificial Intelligence

Training a Transformer on a 1970s PDP‑11 Takes Only 5.5 Minutes

A developer recreated a 1970s PDP‑11 environment, wrote a single‑layer, single‑head Transformer in assembly, and trained it on a sequence‑reversal task, achieving 100% accuracy after about 350 steps and a total training time of roughly 5.5 minutes.

AssemblyLow-resource AIPDP-11
0 likes · 16 min read
Training a Transformer on a 1970s PDP‑11 Takes Only 5.5 Minutes
Lao Guo's Learning Space
Lao Guo's Learning Space
Apr 12, 2026 · Artificial Intelligence

Who Wins the AI Video Throne? HappyHorse-1.0 vs ByteDance Seedance 2.0

The article dissects the April 2026 showdown between the anonymous 15‑billion‑parameter HappyHorse‑1.0 and ByteDance’s two‑year‑old Seedance 2.0, detailing Elo score gaps, contrasting single‑stream versus dual‑branch Transformer designs, speed advantages, quality trade‑offs, and offering a decision tree for different production needs.

AI videoElo rankingTransformer
0 likes · 11 min read
Who Wins the AI Video Throne? HappyHorse-1.0 vs ByteDance Seedance 2.0
AI Explorer
AI Explorer
Apr 11, 2026 · Artificial Intelligence

How Kronos Redefines Quantitative Analysis with a Financial‑Market Language Model

Kronos, an open‑source large model trained on OHLCV data from over 45 exchanges, treats financial time‑series as a specialized language, using a custom tokenizer and a two‑stage Transformer to enable price prediction, market state detection, signal generation, and risk simulation, with easy Hugging Face integration and a live demo for BTC/USDT.

KronosTokenizerTransformer
0 likes · 6 min read
How Kronos Redefines Quantitative Analysis with a Financial‑Market Language Model
AI Tech Publishing
AI Tech Publishing
Apr 9, 2026 · Artificial Intelligence

Engineering‑Focused Guide to Training and Inference of Large Language Models

This article walks engineers through the full LLM stack—from tokenization and positional encoding to transformer blocks, efficient fine‑tuning, quantization, and production‑grade inference techniques such as KV‑cache, FlashAttention, PagedAttention, continuous batching, and speculative decoding—highlighting trade‑offs, toolchains, and practical workflow steps.

Fine-tuningInferenceLLM
0 likes · 13 min read
Engineering‑Focused Guide to Training and Inference of Large Language Models
Bighead's Algorithm Notes
Bighead's Algorithm Notes
Apr 6, 2026 · Artificial Intelligence

STORM: A Bidirectional Spatiotemporal Factor Model Achieving Sharpe Ratio >1

STORM introduces a bidirectional VQ‑VAE‑based spatiotemporal factor model that extracts fine‑grained time‑series and cross‑sectional features, uses discrete codebooks for orthogonal, diverse factor embeddings, and outperforms nine baselines on portfolio management and algorithmic trading tasks, delivering Sharpe ratios exceeding 1.

Algorithmic TradingPortfolio ManagementQuantitative Finance
0 likes · 17 min read
STORM: A Bidirectional Spatiotemporal Factor Model Achieving Sharpe Ratio >1
Data Party THU
Data Party THU
Apr 3, 2026 · Artificial Intelligence

Can Attention Replace Residuals? Inside the New Attention Residuals Breakthrough

The article reviews the Kimi team's Attention Residuals approach, which substitutes traditional ResNet additive shortcuts with learned attention‑based weighting, explains the theoretical motivation linking depth to time, details full‑attention and block‑wise implementations, presents experimental results showing up to 1.25× compute efficiency and improved performance on reasoning and knowledge tasks.

Attention MechanismDeep LearningResidual Networks
0 likes · 11 min read
Can Attention Replace Residuals? Inside the New Attention Residuals Breakthrough
ShiZhen AI
ShiZhen AI
Apr 2, 2026 · Artificial Intelligence

How KV Cache Works and Why Large Model Outputs Cost Five Times More Than Inputs

The article explains the KV Cache mechanism that stores previously computed key/value vectors to avoid redundant Transformer calculations, delivering roughly a 5× speedup, while also detailing why generating output tokens is far more expensive than processing input tokens due to serial generation and memory trade‑offs.

KV cacheLLM inferenceMemory Optimization
0 likes · 9 min read
How KV Cache Works and Why Large Model Outputs Cost Five Times More Than Inputs
ArcThink
ArcThink
Apr 2, 2026 · Artificial Intelligence

Why LLMs Forget You: Uncovering the Limits and Solutions for Long‑Term Memory

The article explains why large language models lack persistent memory due to the stateless Transformer architecture, breaks down the four dimensions of memory loss, surveys seven technical approaches, three product implementations, and emerging research, and discusses security and privacy implications.

AILLMLong-term Memory
0 likes · 22 min read
Why LLMs Forget You: Uncovering the Limits and Solutions for Long‑Term Memory
AI Explorer
AI Explorer
Apr 1, 2026 · Artificial Intelligence

Google Open‑Sources TimesFM: A Foundation Model for Plug‑and‑Play Time‑Series Forecasting

Google’s open‑source TimesFM is a decoder‑only Transformer foundation model that delivers plug‑and‑play time‑series forecasting with zero‑shot accuracy, larger context windows, quantile predictions, and a simple Hugging Face API, making it suitable for retail, energy, finance, monitoring, and IoT use cases.

Hugging FacePyTorchTimesFM
0 likes · 7 min read
Google Open‑Sources TimesFM: A Foundation Model for Plug‑and‑Play Time‑Series Forecasting
Data Party THU
Data Party THU
Mar 31, 2026 · Artificial Intelligence

Can Lookup-Based Memory Revolutionize Transformers? Inside the STEM Architecture

The STEM architecture replaces the Transformer feed‑forward network with a static token‑indexed embedding table, enabling lookup‑based memory that decouples capacity from compute, improves training stability, expands addressable memory, and delivers consistent performance gains on long‑context and knowledge‑intensive tasks.

Lookup MemorySTEM ArchitectureTransformer
0 likes · 8 min read
Can Lookup-Based Memory Revolutionize Transformers? Inside the STEM Architecture
AI Large-Model Wave and Transformation Guide
AI Large-Model Wave and Transformation Guide
Mar 28, 2026 · Artificial Intelligence

From RNNs to Multimodal Agents: A Decade of Transformer Evolution

This article traces the evolution of sequence models from early RNN/LSTM designs through the breakthrough Transformer, its major branches, dense scaling, efficiency‑focused variants, next‑generation linear‑complexity SSMs, and finally multimodal agent architectures, highlighting each stage's strengths, weaknesses, and typical use cases.

AI ArchitectureLLMTransformer
0 likes · 12 min read
From RNNs to Multimodal Agents: A Decade of Transformer Evolution
Data Party THU
Data Party THU
Mar 26, 2026 · Artificial Intelligence

How Mixture-of-Depths Attention Boosts Large Language Model Efficiency

This article examines the Mixture‑of‑Depths Attention (MoDA) mechanism, detailing its novel flash‑compatible KV layout, combined sequence‑depth attention, theoretical analysis, and extensive experiments that show significant reductions in validation loss and accuracy gains on downstream tasks compared to the OLMo2 baseline.

Attention MechanismDeep KVFlashAttention
0 likes · 9 min read
How Mixture-of-Depths Attention Boosts Large Language Model Efficiency
Full-Stack Cultivation Path
Full-Stack Cultivation Path
Mar 23, 2026 · Artificial Intelligence

What Exactly Is a Token in LLMs? A First‑Principles Explanation

The article explains that a token is the smallest discrete text unit a large language model processes, detailing why tokenization is essential, how tokenizers work, how tokens flow through the transformer, and how token counts affect context windows, cost, latency, and overall model behavior.

Context WindowCost ManagementEmbedding
0 likes · 20 min read
What Exactly Is a Token in LLMs? A First‑Principles Explanation
SuanNi
SuanNi
Mar 17, 2026 · Artificial Intelligence

How Attention Residuals Boost Transformer Efficiency and Scale

The article presents the Attention Residuals architecture, explains how it replaces uniform residual addition with learned attention‑based aggregation, details full and block variants, engineering tricks for distributed training, and shows extensive scaling‑law experiments where the new design consistently improves validation loss and training efficiency across model sizes.

Attention ResidualsDeep LearningModel Scaling
0 likes · 13 min read
How Attention Residuals Boost Transformer Efficiency and Scale
ShiZhen AI
ShiZhen AI
Mar 17, 2026 · Artificial Intelligence

Kimi’s Attention Residuals Swap a Decade-Old Residual Trick for 1.25× Faster 48B MoE

The Kimi team introduces Attention Residuals, a softmax‑based replacement for the uniform residual connections used in Transformers for a decade, enabling selective aggregation of layer histories, reducing hidden‑state growth, and achieving a 1.25× compute‑efficiency gain on a 48‑billion‑parameter MoE model with less than 2% inference latency increase.

Attention ResidualsCompute EfficiencyDeep Learning
0 likes · 10 min read
Kimi’s Attention Residuals Swap a Decade-Old Residual Trick for 1.25× Faster 48B MoE
Machine Learning Algorithms & Natural Language Processing
Machine Learning Algorithms & Natural Language Processing
Mar 15, 2026 · Artificial Intelligence

HY‑WU: Real‑Time Adaptive AI Model That Generates Parameters On‑The‑Fly

HY‑WU demonstrates that generating model parameters dynamically during inference enables a single foundation model to perform diverse image‑editing tasks, outperforming fixed‑parameter baselines in human and automatic evaluations, benchmark tests, and conflict‑task experiments, highlighting a practical real‑time adaptation approach for AI systems.

HY-WULoRATransformer
0 likes · 16 min read
HY‑WU: Real‑Time Adaptive AI Model That Generates Parameters On‑The‑Fly
Machine Learning Algorithms & Natural Language Processing
Machine Learning Algorithms & Natural Language Processing
Mar 14, 2026 · Artificial Intelligence

Can Large Language Models Get Stronger Without Human Language Training? A New Pre‑Pre‑Training Path

A recent study shows that pre‑training Transformers on synthetic, non‑language data generated by Neural Cellular Automata can boost language‑model performance by up to 6%, accelerate convergence by 40%, and improve downstream reasoning, even outperforming models trained on massive natural‑text corpora.

In-Context LearningNeural Cellular AutomataPre‑training
0 likes · 12 min read
Can Large Language Models Get Stronger Without Human Language Training? A New Pre‑Pre‑Training Path
Bighead's Algorithm Notes
Bighead's Algorithm Notes
Mar 14, 2026 · Artificial Intelligence

Quantitative Finance Paper Digest: AI‑Driven Market Prediction Studies (Mar 7‑13 2026)

This digest summarizes four recent research papers that apply advanced AI techniques—node‑transformer graphs with BERT sentiment analysis, a quantum‑classical LSTM‑Born machine hybrid, large‑language‑model benchmarking for portfolio optimization, and a conditional diffusion model—to improve stock market prediction, volatility forecasting, and investment decision making, providing detailed experimental results and statistical validation.

BERTQuantum ComputingTransformer
0 likes · 10 min read
Quantitative Finance Paper Digest: AI‑Driven Market Prediction Studies (Mar 7‑13 2026)
High Availability Architecture
High Availability Architecture
Mar 12, 2026 · Artificial Intelligence

How Claude Code Hits 92% Prompt Cache Rate and Slashes AI Agent Costs by 81%

This article explains the prompt‑caching mechanism used by Claude Code, showing how separating static prefixes from dynamic tails and leveraging KV‑tensor caching reduces the O(n²) complexity of transformer pre‑fill to O(n), achieving a 92% cache hit rate and up to 81% cost savings in long‑running AI agent sessions.

AI agentsClaudeCost reduction
0 likes · 12 min read
How Claude Code Hits 92% Prompt Cache Rate and Slashes AI Agent Costs by 81%
Machine Learning Algorithms & Natural Language Processing
Machine Learning Algorithms & Natural Language Processing
Mar 11, 2026 · Artificial Intelligence

Random Parameter Pruning Boosts Transferable Targeted Attacks Across Model Architectures

The RaPA method introduces random parameter pruning during adversarial generation, creating diverse model variants that markedly increase the success rate of targeted transfer attacks across CNN and Transformer architectures, even against defended models and with higher computational budgets, as demonstrated on ImageNet‑compatible benchmarks.

CNNTransformeradversarial attacks
0 likes · 14 min read
Random Parameter Pruning Boosts Transferable Targeted Attacks Across Model Architectures
Machine Learning Algorithms & Natural Language Processing
Machine Learning Algorithms & Natural Language Processing
Mar 10, 2026 · Artificial Intelligence

How InfLLM‑V2 Achieves Seamless Short‑to‑Long Context Upgrade with Minimal Structural Changes

InfLLM‑V2 introduces a dense‑sparse switchable attention framework that preserves the original dense‑attention parameters while enabling efficient long‑context training, matching full‑attention performance on benchmarks such as RULER, LongBench, and chain‑reasoning tasks, and delivering up to 2.3× end‑to‑end inference speedup without degrading short‑sequence abilities.

InfLLM-V2Transformerdense-sparse attention
0 likes · 16 min read
How InfLLM‑V2 Achieves Seamless Short‑to‑Long Context Upgrade with Minimal Structural Changes
Machine Learning Algorithms & Natural Language Processing
Machine Learning Algorithms & Natural Language Processing
Mar 10, 2026 · Artificial Intelligence

Why the First Token Becomes a Value Garbage Bin – LeCun Team Dissects Spike and Attention Sink Mechanics

The paper by Yann LeCun’s team reveals that massive activation spikes and attention sinks in Transformers are not inherently coupled; spikes arise from position‑0 token interactions and specific feed‑forward dynamics, while attention sinks emerge from Pre‑norm normalization and head dimension, offering practical insights for model quantization and long‑context inference.

Attention SinkLLMMassive Activations
0 likes · 9 min read
Why the First Token Becomes a Value Garbage Bin – LeCun Team Dissects Spike and Attention Sink Mechanics
Machine Learning Algorithms & Natural Language Processing
Machine Learning Algorithms & Natural Language Processing
Mar 9, 2026 · Artificial Intelligence

Instant LoRA Generation and Long‑Document Internalization: Cost‑Amortized Model Updates via 0.1‑Second Forward Pass

The article analyzes the quadratic attention and KV‑Cache bottlenecks of Transformers on ultra‑long inputs and the heavy compute cost of traditional supervised fine‑tuning, then presents Sakana AI's Cost Amortization framework—Doc‑to‑LoRA and Text‑to‑LoRA—that shifts weight updates to a meta‑training hypernetwork, achieving sub‑50 MB memory for 128K‑token inference, sub‑GB update memory for long‑document QA, and zero‑shot task adaptation with sub‑second latency.

Cost AmortizationCross-modalLoRA
0 likes · 13 min read
Instant LoRA Generation and Long‑Document Internalization: Cost‑Amortized Model Updates via 0.1‑Second Forward Pass
Machine Learning Algorithms & Natural Language Processing
Machine Learning Algorithms & Natural Language Processing
Mar 7, 2026 · Artificial Intelligence

Transformer Hidden States Can Reconstruct Input with 100% Accuracy – New Invertibility Study

A recent paper from Sapienza University's GLADIA Lab shows that mainstream Transformer language models are injective, enabling a novel SIPIT algorithm to recover original text from hidden states with perfect accuracy, while extensive experiments confirm the models retain all input information.

InjectiveInvertibilityLanguage Model
0 likes · 11 min read
Transformer Hidden States Can Reconstruct Input with 100% Accuracy – New Invertibility Study
Data Party THU
Data Party THU
Mar 6, 2026 · Artificial Intelligence

How Small Can a Transformer Get? Inside the 121‑Parameter AdderBoard Challenge

This article chronicles the AdderBoard competition, detailing how researchers compressed a Transformer for 10‑digit addition down to just 121 parameters, the experimental rules, the contrasting hand‑coded and data‑driven approaches, and the insights gained about model minimalism and discoverability.

AdderBoardTransformermodel compression
0 likes · 13 min read
How Small Can a Transformer Get? Inside the 121‑Parameter AdderBoard Challenge
Machine Learning Algorithms & Natural Language Processing
Machine Learning Algorithms & Natural Language Processing
Mar 3, 2026 · Artificial Intelligence

Identity Constraint Beats DeepSeek mHC After 150B Tokens: A Surprising Reversal

Extensive experiments on DeepSeek's 1.7B and 8B models reveal that replacing the manifold hyper‑connection (mHC) constraint with a simple identity matrix consistently outperforms the original mHC, improves signal flow stability, and avoids the collapse caused by repeated Sinkhorn‑Knopp projections.

DeepSeekHyper-ConnectionSinkhorn
0 likes · 12 min read
Identity Constraint Beats DeepSeek mHC After 150B Tokens: A Surprising Reversal
Machine Learning Algorithms & Natural Language Processing
Machine Learning Algorithms & Natural Language Processing
Mar 3, 2026 · Artificial Intelligence

Beyond Dense and MoE: JTok Module Cuts Compute by One‑Third as a New Scaling Path

The paper introduces JTok and its dynamic variant JTok‑M, a token‑indexed parameter scaling method that decouples model capacity from compute, achieving up to 35% compute reduction while delivering consistent performance gains across a wide range of downstream tasks and model sizes.

Compute EfficiencyJTokToken-indexed scaling
0 likes · 16 min read
Beyond Dense and MoE: JTok Module Cuts Compute by One‑Third as a New Scaling Path
Data STUDIO
Data STUDIO
Feb 25, 2026 · Artificial Intelligence

Build a Large Language Model from Scratch with PyTorch—No Libraries, No Shortcuts

This guide walks you through building, training, and fine‑tuning a Transformer‑based large language model entirely from scratch using PyTorch, covering tokenization, self‑attention, multi‑head attention, positional encoding, model architecture, data preparation, training loops, and fine‑tuning on custom lyrics.

Fine-tuningGPTLLM
0 likes · 43 min read
Build a Large Language Model from Scratch with PyTorch—No Libraries, No Shortcuts
Qborfy AI
Qborfy AI
Feb 21, 2026 · Artificial Intelligence

How Self-Attention Powers Modern AI: From Theory to Real-World Impact

This article explains the self‑attention mechanism behind transformers, detailing its core components, mathematical formulation, step‑by‑step example, multi‑head extension, industry use cases, and a thorough comparison with RNN and CNN approaches, all supported by concrete numbers and citations.

Attention MechanismDeep LearningSelf-Attention
0 likes · 8 min read
How Self-Attention Powers Modern AI: From Theory to Real-World Impact
Data Party THU
Data Party THU
Feb 21, 2026 · Artificial Intelligence

Unlocking Compositional Generalization: Meta‑Learning Strategies for Neural Networks

This article examines how meta‑learning combined with compositionality enables neural networks to rapidly adapt to new tasks by formalizing hierarchical optimization, leveraging modular architectures with hypernetworks, and exploiting Transformer latent codes for effective compositional generalization.

Bilevel OptimizationMeta LearningNeural Networks
0 likes · 5 min read
Unlocking Compositional Generalization: Meta‑Learning Strategies for Neural Networks
Bighead's Algorithm Notes
Bighead's Algorithm Notes
Feb 18, 2026 · Artificial Intelligence

Which Loss Function Ranks Stocks Best? An Empirical Study with Transformer Models

This paper evaluates point‑wise, pair‑wise, and list‑wise loss functions for Transformer‑based stock‑return prediction on 110 S&P 500 stocks, showing that Margin loss achieves the highest annual return (16.23%) and Sharpe ratio (0.75), ListNet delivers strong returns with low volatility, and BPR minimizes maximum drawdown, highlighting how loss design critically shapes ranking‑driven portfolio performance.

Loss FunctionsQuantitative TradingStock Ranking
0 likes · 15 min read
Which Loss Function Ranks Stocks Best? An Empirical Study with Transformer Models
AI Cyberspace
AI Cyberspace
Feb 15, 2026 · Artificial Intelligence

From GPT-1 to GPT-4o: A Deep Dive into the Evolution of Large Language Models

This article chronicles the rapid progression of GPT models from the 2018 GPT‑1 pre‑training breakthrough through GPT‑2’s multitask learning, GPT‑3’s scaling laws and few‑shot abilities, to GPT‑4’s multimodal capabilities and the 2024 GPT‑4 Turbo, Sora, and GPT‑4o releases, while also explaining core LLM abilities and the decoder‑only architecture of GPT‑2.

AI evolutionFew‑Shot LearningGPT
0 likes · 20 min read
From GPT-1 to GPT-4o: A Deep Dive into the Evolution of Large Language Models
AI Cyberspace
AI Cyberspace
Feb 14, 2026 · Artificial Intelligence

Unpacking the Transformer: From Embeddings to Multi‑Head Attention

This article provides a comprehensive, step‑by‑step walkthrough of the Transformer architecture, covering input embedding, positional encoding, the mechanics of Q‑K‑V attention, scaled dot‑product formulas, multi‑head and masked attention, feed‑forward networks, residual connections, layer normalization, decoder generation, and recent attention‑optimization techniques.

Deep LearningFeed-Forward NetworkPositional Encoding
0 likes · 39 min read
Unpacking the Transformer: From Embeddings to Multi‑Head Attention
AI Cyberspace
AI Cyberspace
Feb 13, 2026 · Artificial Intelligence

How Attention Mechanisms Revolutionized Computer Vision and Machine Translation

This article traces the evolution of attention mechanisms from their inaugural application in computer vision and machine translation to their central role in modern Transformer models, detailing the underlying RNN‑Attention designs, the breakthrough in sequence alignment, and the innovations that enabled high‑performance, parallelizable deep learning architectures.

Attention MechanismComputer VisionDeep Learning
0 likes · 14 min read
How Attention Mechanisms Revolutionized Computer Vision and Machine Translation
Data Party THU
Data Party THU
Feb 4, 2026 · Artificial Intelligence

How Sakana AI Redefines Long-Context Transformers: DroPE, REPO, and FwPKM Explained

This article analyzes Sakana AI's three recent papers that challenge traditional Transformer long‑sequence handling by removing positional embeddings, reconstructing position awareness, and adding a fast‑weight external memory, showing how each approach improves ultra‑long text understanding.

Memory MechanismPositional EmbeddingTransformer
0 likes · 12 min read
How Sakana AI Redefines Long-Context Transformers: DroPE, REPO, and FwPKM Explained
HyperAI Super Neural
HyperAI Super Neural
Feb 3, 2026 · Artificial Intelligence

Walrus: 1.3B Transformer Model Beats Prior Foundations Across 19 Physics Domains

Walrus, a 1.3 billion‑parameter Transformer built by Polymathic AI, is pretrained on 19 diverse physics scenarios—including astrophysics, geoscience, rheology, plasma physics and acoustics—using techniques like patch jittering, adaptive compute tokenization and space‑time factorized attention, and consistently outperforms earlier foundation models on both short‑ and long‑term continuum dynamics predictions.

TransformerWalruscontinuum dynamics
0 likes · 13 min read
Walrus: 1.3B Transformer Model Beats Prior Foundations Across 19 Physics Domains
Tencent Technical Engineering
Tencent Technical Engineering
Feb 2, 2026 · Artificial Intelligence

Why Neural Networks Are the Hidden Engine Behind Modern AI: From Basics to Large Language Models

This comprehensive guide walks through the fundamentals of neural networks, activation functions, training methods, and how they power large language models, while also covering tokenization, self‑attention, transformer architectures, AI infrastructure, and practical usage through agents and retrieval‑augmented generation.

Agent SystemsDeep LearningGPU infrastructure
0 likes · 75 min read
Why Neural Networks Are the Hidden Engine Behind Modern AI: From Basics to Large Language Models
Network Intelligence Research Center (NIRC)
Network Intelligence Research Center (NIRC)
Jan 31, 2026 · Artificial Intelligence

How Engram Lets Large Models Swap GPU Memory for Cheap RAM to ‘Look Up’ Knowledge

The article dissects DeepSeek’s new Engram architecture, which separates computation from memory by using a large, cheap‑RAM‑based lookup table to store factual knowledge, allowing the transformer’s compute layers to focus on reasoning, dramatically reducing GPU memory demand while improving code, math, and long‑context performance.

EngramGPU MemoryMemory-Compute Architecture
0 likes · 7 min read
How Engram Lets Large Models Swap GPU Memory for Cheap RAM to ‘Look Up’ Knowledge
HyperAI Super Neural
HyperAI Super Neural
Jan 23, 2026 · Artificial Intelligence

Weekly AI Paper Digest: New Transformer Advances in Sparsity, Memory, and Reasoning

This article reviews five recent Transformer papers—including Engram's conditional memory, STEM's embedding‑based scaling, SeedFold's biomolecular structure prediction, a critique of Transformers for time‑series forecasting, and reasoning models as societies of thought—highlighting their methods, datasets, and performance gains.

Biomolecular Structure PredictionMemory MechanismsReasoning Models
0 likes · 7 min read
Weekly AI Paper Digest: New Transformer Advances in Sparsity, Memory, and Reasoning
PaperAgent
PaperAgent
Jan 22, 2026 · Artificial Intelligence

How STEM Replaces MoE Routing with Simple Table Lookup for Faster Transformers

The article presents STEM, a method that transforms dense and MoE transformer architectures by converting the expert routing step into a static table‑lookup operation, achieving higher parameter efficiency, lower communication overhead, and improved interpretability while maintaining or boosting downstream task performance.

Embedding LookupInterpretabilityMixture of Experts
0 likes · 6 min read
How STEM Replaces MoE Routing with Simple Table Lookup for Faster Transformers
Java Tech Enthusiast
Java Tech Enthusiast
Jan 21, 2026 · Artificial Intelligence

Inside X’s Open‑Source Recommendation Engine: How the Grok‑Powered Transformer Works

X platform has open‑sourced its new "For You" recommendation system, revealing a Grok‑based Transformer architecture, detailed module breakdown, seven‑step content ranking pipeline, and the strategic motivations behind the unprecedented move toward algorithmic transparency and community‑driven improvement.

TransformerX Platformmachine learning
0 likes · 12 min read
Inside X’s Open‑Source Recommendation Engine: How the Grok‑Powered Transformer Works
PaperAgent
PaperAgent
Jan 20, 2026 · Artificial Intelligence

How X’s Open‑Source “For You” Recommendation Engine Works

X (formerly Twitter) has open‑sourced its “For You” recommendation algorithm, revealing a Grok‑based Transformer that merges on‑platform and off‑platform content, removes manual features, and scores posts through a multi‑stage pipeline with candidate sourcing, hydration, filtering, scoring, and selection.

TransformerX Platformgrok
0 likes · 5 min read
How X’s Open‑Source “For You” Recommendation Engine Works
Data Party THU
Data Party THU
Jan 19, 2026 · Artificial Intelligence

How VersatileFFN Cuts Memory Use While Boosting LLM Performance

The article introduces Huawei's VersatileFFN, an adaptive wide‑and‑deep feed‑forward design for large language models that reuses parameters to slash memory consumption while delivering stronger inference, detailing its dual‑system inspiration, technical mechanisms, experimental gains, and implications for efficient LLM deployment.

Adaptive ComputationLLMTransformer
0 likes · 8 min read
How VersatileFFN Cuts Memory Use While Boosting LLM Performance
AI Architecture Hub
AI Architecture Hub
Jan 19, 2026 · Artificial Intelligence

Demystifying the Transformer: From Input Embedding to Multi‑Head Attention

This article breaks down the core components of the Transformer architecture—including input embedding, positional encoding, multi‑head self‑attention, residual connections with layer normalization, position‑wise feed‑forward networks, and the rationale behind stacking multiple encoder layers—using clear explanations and illustrative diagrams.

Add&NormDeep LearningFeed Forward
0 likes · 12 min read
Demystifying the Transformer: From Input Embedding to Multi‑Head Attention
AI Cyberspace
AI Cyberspace
Jan 13, 2026 · Artificial Intelligence

From Symbolic AI to LLMs: A Complete NLP History and Model Guide

This article provides a comprehensive overview of natural language processing, tracing its evolution from early symbolic and statistical stages through deep learning breakthroughs, detailing sequence models, key NLP tasks, text representation methods, and the development of modern architectures like RNN, LSTM, GRU, Transformer, and GPT series.

Deep LearningGPTLSTM
0 likes · 60 min read
From Symbolic AI to LLMs: A Complete NLP History and Model Guide
PaperAgent
PaperAgent
Jan 13, 2026 · Artificial Intelligence

How Engram’s Conditional Memory Redefines Sparsity in Large Language Models

DeepSeek’s newly released Engram module introduces a conditional memory mechanism that leverages O(1) N‑gram lookup to create a new sparsity axis for large language models, reducing early‑layer compute, improving inference efficiency, and delivering notable performance gains across reasoning and knowledge tasks, as demonstrated by extensive experiments on 27‑billion‑parameter models.

EngramLLM SparsityTransformer
0 likes · 8 min read
How Engram’s Conditional Memory Redefines Sparsity in Large Language Models
AI Architecture Hub
AI Architecture Hub
Jan 7, 2026 · Artificial Intelligence

Why “Attention Is All You Need” Still Shapes AI: A Beginner’s Deep Dive

This article provides a comprehensive, beginner‑friendly walkthrough of the landmark 2017 paper “Attention Is All You Need,” covering its authors, historical context, the shortcomings of RNNs and CNNs, the birth of self‑attention, the Transformer architecture, and its transformative impact on modern AI.

AI historyAttention MechanismDeep Learning
0 likes · 9 min read
Why “Attention Is All You Need” Still Shapes AI: A Beginner’s Deep Dive
Network Intelligence Research Center (NIRC)
Network Intelligence Research Center (NIRC)
Jan 4, 2026 · Artificial Intelligence

How UniCodebook’s Unified 2D‑3D Discrete Priors Boost Noise‑Robust, Calibration‑Free 3D Human Pose Estimation

UniCodebook introduces a unified 2D‑3D discrete prior that combines continuous and discrete representations, enabling calibration‑free multiview 3D human pose estimation with superior noise robustness and higher accuracy, as demonstrated by state‑of‑the‑art results on Human3.6M and MPI‑INF‑3DHP.

3D pose estimationNeurIPS 2025Transformer
0 likes · 8 min read
How UniCodebook’s Unified 2D‑3D Discrete Priors Boost Noise‑Robust, Calibration‑Free 3D Human Pose Estimation
Tencent Technical Engineering
Tencent Technical Engineering
Dec 24, 2025 · Artificial Intelligence

Build a Mini LLM from Scratch: Step‑by‑Step Guide to Tokenizer, Attention, and Transformer

This article walks through constructing a small large‑language model from the ground up, covering model architecture, tokenization methods, BPE vocabulary building, embedding, positional encoding, attention mechanisms, multi‑head attention, transformer blocks, training pipelines, inference, and sampling strategies, all with runnable Python code.

Deep LearningLLMPython
0 likes · 34 min read
Build a Mini LLM from Scratch: Step‑by‑Step Guide to Tokenizer, Attention, and Transformer
HyperAI Super Neural
HyperAI Super Neural
Dec 22, 2025 · Artificial Intelligence

DA3 Enables Arbitrary‑View 3D Reconstruction with a Single Transformer

The ByteDance‑Seed team introduces Depth Anything 3 (DA3), a minimalist visual‑geometry model that uses a vanilla Transformer backbone and depth‑ray representation to jointly predict depth and camera pose from any number of images, achieving state‑of‑the‑art performance with a 35.7% gain in pose accuracy and a 23.6% improvement in geometric precision over prior methods.

3D visionDA3Depth estimation
0 likes · 6 min read
DA3 Enables Arbitrary‑View 3D Reconstruction with a Single Transformer
AI2ML AI to Machine Learning
AI2ML AI to Machine Learning
Dec 21, 2025 · Artificial Intelligence

Why KV Caching Is Critical for Efficient LLM Inference

The article breaks down the principles of KV caching in large language models, explaining how Q/K/V behavior differs between training and inference, the role of prompts, cache size trade‑offs, and the complexities of concurrent inference, all backed by concrete examples and references.

Concurrent InferenceKV cacheLLM inference
0 likes · 7 min read
Why KV Caching Is Critical for Efficient LLM Inference
AI2ML AI to Machine Learning
AI2ML AI to Machine Learning
Dec 19, 2025 · Artificial Intelligence

The 9 Key Ideas Behind FlashAttention

FlashAttention accelerates transformer inference by combining nine techniques—including loss‑less attention, GPU memory‑pyramid optimization, SRAM‑reusing tiling, safe softmax scaling, online buffering, tile‑size constraints, parallel multiplication, reduced KV slicing, and integrated backward‑pass caching—to achieve efficient, high‑throughput computation on modern GPUs.

Attention MechanismFlashAttentionGPU Optimization
0 likes · 8 min read
The 9 Key Ideas Behind FlashAttention
AI Frontier Lectures
AI Frontier Lectures
Dec 17, 2025 · Artificial Intelligence

Can OmniVGGT Unlock Multi‑Modal 3D Vision with Any Number of Inputs?

OmniVGGT introduces a flexible omni‑modality driven transformer that can ingest arbitrary numbers of geometric cues such as depth maps and camera parameters, achieving state‑of‑the‑art performance on diverse 3D tasks while keeping inference speed comparable to its RGB‑only predecessor.

3D visionGeometryOmniVGGT
0 likes · 13 min read
Can OmniVGGT Unlock Multi‑Modal 3D Vision with Any Number of Inputs?
Architect
Architect
Dec 15, 2025 · Artificial Intelligence

Demystifying LLM Architecture: From Transformers to Modern MoE Designs

This comprehensive guide explains the fundamentals of large language model (LLM) architectures, covering the original Transformer, tokenization, embeddings, positional encoding, attention mechanisms, feed‑forward networks, layer stacking, a step‑by‑step translation example, and the latest open‑source and hybrid LLM designs shaping the field.

EmbeddingLLMMoE
0 likes · 41 min read
Demystifying LLM Architecture: From Transformers to Modern MoE Designs
Tencent Cloud Developer
Tencent Cloud Developer
Dec 9, 2025 · Artificial Intelligence

How Do Large Language Models Turn Text into Math? A Deep Dive into Transformers

This article walks through the complete workflow of AI large language models, from turning user queries into token matrices via tokenization and embedding, through the Transformer’s self‑attention and multi‑head mechanisms, to decoding logits into human‑readable text, while also covering position encoding, long‑context strategies, generation parameters, and practical engineering tips.

Inference OptimizationSelf-AttentionTransformer
0 likes · 29 min read
How Do Large Language Models Turn Text into Math? A Deep Dive into Transformers
ShiZhen AI
ShiZhen AI
Dec 5, 2025 · Artificial Intelligence

Can AI Achieve Human‑Like Long‑Term Memory? Inside Google’s Titans Architecture

Google’s newly unveiled Titans architecture tackles AI’s “forgetfulness” by embedding a Neural Long‑Term Memory (LMM) module that updates model weights during inference using a test‑time training approach and a MIRAS surprise metric, enabling over 2 million‑token context with linear O(N) computation and superior benchmark results versus GPT‑4 RAG.

AI ArchitectureGoogle TitansLong-term Memory
0 likes · 5 min read
Can AI Achieve Human‑Like Long‑Term Memory? Inside Google’s Titans Architecture
Tencent Technical Engineering
Tencent Technical Engineering
Dec 3, 2025 · Artificial Intelligence

Why Transformers Power Modern LLMs: A Deep Dive into Architecture and Mechanics

This article provides a comprehensive, step‑by‑step explanation of the Transformer architecture that underpins large language models, covering tokenization, embeddings, positional encoding, attention mechanisms, feed‑forward networks, layer stacking, a detailed translation example, visualized attention weights, and a survey of recent open‑source LLM designs such as DeepSeek V3, OLMo 2, and Gemma 3.

EmbeddingLLMNeural Network
0 likes · 38 min read
Why Transformers Power Modern LLMs: A Deep Dive into Architecture and Mechanics
Wuming AI
Wuming AI
Nov 30, 2025 · Artificial Intelligence

What Exactly Is a Large Language Model? A Simple Guide to AI, Transformers, and How They Work

This article explains the relationship between AI, machine learning, deep learning, and large language models, detailing their evolution, training stages, transformer architecture, attention mechanisms, inference APIs, and practical usage examples, while demystifying common misconceptions about LLM capabilities.

AI fundamentalsDeep LearningRLHF
0 likes · 10 min read
What Exactly Is a Large Language Model? A Simple Guide to AI, Transformers, and How They Work
Huawei Cloud Developer Alliance
Huawei Cloud Developer Alliance
Nov 24, 2025 · Artificial Intelligence

How to Supercharge Transformer AI Agents with Model Compression and Inference Acceleration

This article explains why Transformer models dominate modern AI agents, outlines the challenges of large parameter counts and latency, and presents a comprehensive guide to model compression (parameter sharing, knowledge distillation, quantization, pruning) and inference acceleration (parallel computing, optimized attention, TensorRT deployment), complete with PyTorch code examples and a real‑world case study showing speed‑up and storage savings.

AI AgentInference AccelerationPyTorch
0 likes · 34 min read
How to Supercharge Transformer AI Agents with Model Compression and Inference Acceleration
Data Party THU
Data Party THU
Nov 13, 2025 · Artificial Intelligence

What Makes the Free Transformer a Game‑Changer in AI Decoding?

The Free Transformer paper introduces a decoder architecture that injects random latent variables to condition generation, breaking traditional GPT constraints and achieving notable performance gains on reasoning‑heavy benchmarks such as HumanEval+, MBPP, GSM8K, MMLU, and CSQA.

AI researchFree TransformerTransformer
0 likes · 10 min read
What Makes the Free Transformer a Game‑Changer in AI Decoding?
HyperAI Super Neural
HyperAI Super Neural
Nov 7, 2025 · Artificial Intelligence

QiankunNet: A Transformer‑Based Framework for Solving the Many‑Electron Schrödinger Equation

Researchers at the University of Science and Technology of China have introduced QiankunNet, a Transformer‑based network that integrates attention mechanisms with quantum wave‑function construction to solve many‑electron Schrödinger equations, achieving near‑FCI accuracy and outperforming traditional coupled‑cluster methods on benchmark molecules.

Many‑Electron SystemsNature CommunicationsQiankunNet
0 likes · 6 min read
QiankunNet: A Transformer‑Based Framework for Solving the Many‑Electron Schrödinger Equation
Tencent Cloud Developer
Tencent Cloud Developer
Nov 4, 2025 · Artificial Intelligence

From Functions to Transformers: Mastering Neural Networks Step by Step

This article walks you through the evolution from basic mathematical functions to modern large‑scale models, explaining activation functions, forward and backward propagation, loss calculation, gradient descent, regularization, dropout, word embeddings, RNNs, and the core mechanics of the Transformer architecture.

Attention MechanismDeep LearningNeural Networks
0 likes · 15 min read
From Functions to Transformers: Mastering Neural Networks Step by Step
Data Party THU
Data Party THU
Nov 2, 2025 · Artificial Intelligence

From RNN to LLM: How Transformers Power Modern Language Models

This article explains the evolution from RNNs through Encoder‑Decoder models to Transformers, detailing self‑attention, multi‑head attention, and masked attention, and then describes what Large Language Models are, their key components, capabilities, limitations, and common applications.

AIDeep LearningLLM
0 likes · 9 min read
From RNN to LLM: How Transformers Power Modern Language Models
Kuaishou Large Model
Kuaishou Large Model
Oct 31, 2025 · Artificial Intelligence

EMER: End-to-End Multi-Objective Ranking That Transforms Short-Video Recommendations

EMER, Kuaishou’s end‑to‑end multi‑objective ensemble ranking framework, replaces handcrafted scoring formulas with a transformer‑based model that learns comparative preferences, integrates normalized rank features, optimizes relative satisfaction and multi‑dimensional proxy metrics, and dynamically balances objectives via a self‑evolving advantage evaluator, delivering significant online gains.

Recommendation SystemsTransformermachine learning
0 likes · 17 min read
EMER: End-to-End Multi-Objective Ranking That Transforms Short-Video Recommendations
Huawei Cloud Developer Alliance
Huawei Cloud Developer Alliance
Oct 31, 2025 · Artificial Intelligence

Beyond Transformers: Exploring Post‑Transformer Architectures for Long‑Sequence Modeling

This article reviews the emerging post‑Transformer research landscape, covering linear state‑space models, efficient attention approximations, MLP/conv/RNN hybrids, sparse and causal attention mechanisms, and outlines future trends that may complement or replace the classic Transformer architecture for handling ultra‑long sequences.

AIHybrid ArchitectureState Space Model
0 likes · 17 min read
Beyond Transformers: Exploring Post‑Transformer Architectures for Long‑Sequence Modeling
HyperAI Super Neural
HyperAI Super Neural
Oct 30, 2025 · Artificial Intelligence

OmniCast Achieves 20× Speed Boost and Eliminates Autoregressive Error Accumulation in S2S Weather Forecasting

OmniCast, a novel latent diffusion model from UCLA and Argonne Lab, combines VAE and Transformer to generate high‑precision probabilistic sub‑seasonal to seasonal forecasts, dramatically reducing error accumulation of autoregressive methods and delivering 10‑20× faster inference while surpassing state‑of‑the‑art baselines across accuracy, physical consistency, and probabilistic metrics.

Deep LearningLatent DiffusionOmniCast
0 likes · 15 min read
OmniCast Achieves 20× Speed Boost and Eliminates Autoregressive Error Accumulation in S2S Weather Forecasting
Bighead's Algorithm Notes
Bighead's Algorithm Notes
Oct 23, 2025 · Artificial Intelligence

FinCast: A Foundation Model for Financial Time‑Series Forecasting

FinCast introduces a decoder‑only Transformer foundation model for financial time‑series forecasting that tackles non‑stationarity, multi‑domain diversity, and multi‑resolution challenges through input chunking with frequency embeddings, a sparse MoE decoder, and a PQ‑loss, achieving zero‑shot and supervised gains over state‑of‑the‑art baselines while running five times faster on consumer GPUs.

PQ lossTransformerfinancial time series
0 likes · 12 min read
FinCast: A Foundation Model for Financial Time‑Series Forecasting
Wu Shixiong's Large Model Academy
Wu Shixiong's Large Model Academy
Oct 23, 2025 · Artificial Intelligence

Why the Transformer Core Structure Is the Key to AI Interview Success

This article explains the fundamental purpose, architecture, and variants of the Transformer model—including Encoder‑Decoder, Encoder‑only, and Decoder‑only designs—while detailing how attention mechanisms work and why modern large‑language models favor the Decoder‑only approach, providing a concise framework for answering interview questions.

AI InterviewEncoder-DecoderSelf-Attention
0 likes · 10 min read
Why the Transformer Core Structure Is the Key to AI Interview Success
Bighead's Algorithm Notes
Bighead's Algorithm Notes
Oct 21, 2025 · Artificial Intelligence

KANMixer: A New KAN‑Centric Paradigm for Long‑Term Time Series Forecasting

This article reviews the KANMixer model, which places Kolmogorov‑Arnold Networks at the core of a lightweight architecture for long‑term time series forecasting, detailing its design, extensive benchmark experiments on seven real‑world datasets, ablation analyses, and its computational trade‑offs versus MLP and Transformer baselines.

Ablation StudyKANLong-term Time Series Forecasting
0 likes · 8 min read
KANMixer: A New KAN‑Centric Paradigm for Long‑Term Time Series Forecasting
Data Party THU
Data Party THU
Oct 21, 2025 · Artificial Intelligence

Can Linear‑Time LSTMs Beat Transformers? Scaling Laws Reveal the Answer

The paper presents a systematic scaling‑law study of the linear‑time xLSTM architecture versus quadratic‑time Transformers, evaluating parameter‑data loss surfaces, optimal model size under equal FLOP budgets, and inference latency components, and shows that xLSTM consistently offers better cost‑effectiveness across diverse contexts and budgets.

Inference OptimizationLinear Time ComplexityTransformer
0 likes · 11 min read
Can Linear‑Time LSTMs Beat Transformers? Scaling Laws Reveal the Answer
AI2ML AI to Machine Learning
AI2ML AI to Machine Learning
Oct 19, 2025 · Artificial Intelligence

Deep Dive into nanochat: Source Code, Model Size Calculations, and Optimization Techniques

This article provides a thorough analysis of nanochat’s source code, detailing transformer component differences, precise parameter‑size formulas, FlashNorm and ReLU² innovations, scaling‑law insights, memory‑usage estimations, and the distributed optimizer and training pipelines used to build the model.

Distributed TrainingLLMTransformer
0 likes · 20 min read
Deep Dive into nanochat: Source Code, Model Size Calculations, and Optimization Techniques
Network Intelligence Research Center (NIRC)
Network Intelligence Research Center (NIRC)
Oct 17, 2025 · Artificial Intelligence

LucaOne: Unified Nucleic Acid & Protein Language Model Surpasses Other Models

Researchers present LucaOne, a Transformer‑based foundation model that unifies DNA/RNA and protein sequences using a 39‑token vocabulary, rotary positional encoding, and molecule‑type embeddings, and demonstrate through extensive multi‑task benchmarks that it outperforms domain‑specific models across seven biological tasks.

DNATransformerbioinformatics
0 likes · 5 min read
LucaOne: Unified Nucleic Acid & Protein Language Model Surpasses Other Models
Data Party THU
Data Party THU
Oct 16, 2025 · Artificial Intelligence

How Tensor Product Attention Redefines Long‑Context Transformers

The article analyzes the Tensor Product Attention (TPA) method presented at NeurIPS 2025, explaining how it factorizes Q, K, V tensors to drastically reduce KV cache size and attention complexity, and demonstrates superior convergence, lower perplexity, and faster inference on long‑sequence tasks compared with existing attention variants.

KV cacheRoPETensor Product Attention
0 likes · 11 min read
How Tensor Product Attention Redefines Long‑Context Transformers
Bighead's Algorithm Notes
Bighead's Algorithm Notes
Oct 11, 2025 · Artificial Intelligence

Recent Advances in Multivariate Time Series Forecasting: Paper Summaries (Sep 27 – Oct 10 2025)

This article summarizes eight newly released AI papers on multivariate time‑series forecasting and anomaly detection, detailing each work's motivation, proposed methodology, key innovations such as CRIB, TS‑JEPA, DSAT‑HD, DIMIGNN, ASTGI, IndexNet, TsLLM, Moon, TimeSeriesScientist, MLG‑4TS, and Augur, and reports their experimental validation on real‑world datasets.

Deep LearningTransformeranomaly detection
0 likes · 23 min read
Recent Advances in Multivariate Time Series Forecasting: Paper Summaries (Sep 27 – Oct 10 2025)
HyperAI Super Neural
HyperAI Super Neural
Oct 11, 2025 · Artificial Intelligence

Apple’s Flow‑Matching SimpleFold Slashes Compute Cost While Matching AlphaFold2 Accuracy

Apple’s newly released SimpleFold model leverages flow‑matching and a pure Transformer architecture to eliminate costly MSA and triangular updates, achieving performance comparable to AlphaFold2 and RoseTTAFold2 on CAMEO22 and CASP14 benchmarks while dramatically reducing computational requirements, and a step‑by‑step tutorial lets users run it on HyperAI’s platform.

AI modelHyperAISimpleFold
0 likes · 4 min read
Apple’s Flow‑Matching SimpleFold Slashes Compute Cost While Matching AlphaFold2 Accuracy
Data Party THU
Data Party THU
Oct 5, 2025 · Artificial Intelligence

How ImageDDI Boosts Drug‑Drug Interaction Prediction with Motif Sequences and Molecular Images

The ImageDDI framework, introduced by a team from Hunan University, combines molecular motif sequences with 2D/3D molecular images using a Transformer encoder and adaptive feature fusion, achieving significantly higher accuracy and macro‑F1 scores than existing methods on multiple DDI datasets, while also providing interpretable visual explanations.

Deep LearningDrug InteractionImage Fusion
0 likes · 10 min read
How ImageDDI Boosts Drug‑Drug Interaction Prediction with Motif Sequences and Molecular Images
Data Party THU
Data Party THU
Oct 4, 2025 · Artificial Intelligence

Unveiling Transformer Internals: From Theory to PyTorch Code

This article deeply explores the Transformer architecture by combining original paper principles with PyTorch source code, covering encoder‑decoder design, positional encoding assumptions, core parameters, residual connections, attention mechanisms, and detailed implementation snippets to help readers understand and reproduce the model.

Deep LearningNeural NetworksPositional Encoding
0 likes · 22 min read
Unveiling Transformer Internals: From Theory to PyTorch Code
MoonWebTeam
MoonWebTeam
Oct 1, 2025 · Artificial Intelligence

Unlocking ChatGPT: A Deep Dive into Transformers, Tokenization, and Self‑Attention

This tutorial walks through the fundamentals of ChatGPT by explaining language modeling, character‑level tokenization, data preprocessing pipelines, the evolution from simple bigram models to scaled dot‑product self‑attention, multi‑head mechanisms, full Transformer blocks, and how to train and generate Shakespeare‑style text with a GPT model.

ChatGPTGPTLanguage Modeling
0 likes · 50 min read
Unlocking ChatGPT: A Deep Dive into Transformers, Tokenization, and Self‑Attention