Tagged articles

Multi-Head Attention

15 articles · Page 1 of 1

Jun 13, 2026 · Artificial Intelligence

What Interviewers Expect: Understanding Transformers Beyond Codex and AI Code Generation

The article explains why modern interviewers ask about Transformer fundamentals, breaks down its core components such as self‑attention, multi‑head attention, feed‑forward networks, residual connections and positional encodings, and demonstrates a complete PyTorch toy model that predicts the sum‑mod‑10 of integer sequences while visualizing loss curves, attention heatmaps, embedding PCA and early‑stage gradient norms.

Gradient AnalysisModel VisualizationMulti-Head Attention

0 likes · 20 min read

What Interviewers Expect: Understanding Transformers Beyond Codex and AI Code Generation

CodePath

Jun 3, 2026 · Artificial Intelligence

A Deliberate Paradigm Shift: How “Attention Is All You Need” Reshaped Deep Learning

The article dissects how the 2017 "Attention Is All You Need" paper sparked a fundamental redesign of sequence modeling by replacing recurrent and convolutional approaches with self‑attention, detailing its mathematical foundations, architectural components, training tricks, limitations, and emerging alternatives such as Mamba.

Attention MechanismMambaMulti-Head Attention

0 likes · 24 min read

A Deliberate Paradigm Shift: How “Attention Is All You Need” Reshaped Deep Learning

Shi's AI Notebook

Mar 16, 2026 · Artificial Intelligence

What Attention Actually Does in MiniMind: Tracing Q/K/V, Shape Changes, and Context Fusion

This article walks through MiniMind's Attention.forward implementation, explaining why Q, K, and V are created, how tensors are reshaped for multi‑head attention, the role of masks, KV cache, GQA, and how each token aggregates information from the entire context.

KV cacheMulti-Head AttentionTransformer

0 likes · 21 min read

What Attention Actually Does in MiniMind: Tracing Q/K/V, Shape Changes, and Context Fusion

AI Cyberspace

Feb 14, 2026 · Artificial Intelligence

Unpacking the Transformer: From Embeddings to Multi‑Head Attention

This article provides a comprehensive, step‑by‑step walkthrough of the Transformer architecture, covering input embedding, positional encoding, the mechanics of Q‑K‑V attention, scaled dot‑product formulas, multi‑head and masked attention, feed‑forward networks, residual connections, layer normalization, decoder generation, and recent attention‑optimization techniques.

Feed-Forward NetworkMulti-Head AttentionPositional Encoding

0 likes · 39 min read

Unpacking the Transformer: From Embeddings to Multi‑Head Attention

AI Architecture Hub

Jan 19, 2026 · Artificial Intelligence

Demystifying the Transformer: From Input Embedding to Multi‑Head Attention

This article breaks down the core components of the Transformer architecture—including input embedding, positional encoding, multi‑head self‑attention, residual connections with layer normalization, position‑wise feed‑forward networks, and the rationale behind stacking multiple encoder layers—using clear explanations and illustrative diagrams.

Add&NormFeed ForwardInput Embedding

0 likes · 12 min read

Demystifying the Transformer: From Input Embedding to Multi‑Head Attention

Wu Shixiong's Large Model Academy

Sep 25, 2025 · Artificial Intelligence

Master Self-Attention & Multi-Head Attention for Large Model Interviews

This guide breaks down the core logic, computation steps, formulas, and common interview questions about Self‑Attention and Multi‑Head Attention in Transformers, offering concrete explanations, dimensional examples, and practical answering techniques to help candidates ace large‑model algorithm interviews.

Interview TipsMulti-Head AttentionSelf-Attention

0 likes · 8 min read

Master Self-Attention & Multi-Head Attention for Large Model Interviews

Cognitive Technology Team

Jun 29, 2025 · Artificial Intelligence

Understanding Transformers: Core Mechanics Behind Modern AI Models

This article demystifies the Transformer architecture for beginners, explaining its relationship to large models, the self‑attention and multi‑head attention mechanisms, positional encoding, and the roles of Encoder and Decoder components, using clear analogies and visual diagrams to aid comprehension.

Encoder-DecoderMulti-Head AttentionPositional Encoding

0 likes · 20 min read

Understanding Transformers: Core Mechanics Behind Modern AI Models

Network Intelligence Research Center (NIRC)

Mar 26, 2025 · Artificial Intelligence

Enable Traditional LLMs to Use DeepSeek’s Multi‑Head Latent Attention Without Retraining

The paper introduces MHA2MLA, a data‑efficient fine‑tuning framework that converts pre‑trained multi‑head attention LLMs to DeepSeek’s Multi‑Head Latent Attention architecture, achieving up to 92% KV‑cache compression with less than 0.5% performance loss on long‑context tasks.

LLMLow-Rank ApproximationMulti-Head Attention

0 likes · 8 min read

Enable Traditional LLMs to Use DeepSeek’s Multi‑Head Latent Attention Without Retraining

Tencent Cloud Developer

Mar 5, 2025 · Artificial Intelligence

DeepSeek Series Overview: Core Technologies, Model Innovations, and Product Highlights

The article delivers a PPT‑style deep dive into the DeepSeek series—from the original LLM through DeepSeek‑MoE, Math, V2, V3 and R1—highlighting core innovations such as Multi‑Head Latent Attention, fine‑grained MoE, GRPO reinforcement learning, Multi‑Token Prediction, DualPipe parallelism and FP8 training that together achieve high performance at a fraction of traditional costs, and notes their integration into Tencent’s OlaChat intelligent assistant.

AIDeepSeekFP8 training

0 likes · 21 min read

DeepSeek Series Overview: Core Technologies, Model Innovations, and Product Highlights

Architect

Mar 19, 2024 · Artificial Intelligence

How Transformers Power Modern NLP: A Deep Dive into Encoder‑Decoder Mechanics

This article explains the core principles of Transformer models—covering input embeddings, self‑attention, multi‑head attention, positional encoding, feed‑forward networks, and decoder strategies—using concrete examples like "The cat sat on the mat" and "The quick brown fox jumps over the lazy dog" to illustrate each step.

Encoder-DecoderFeed-Forward NetworkMulti-Head Attention

0 likes · 13 min read

How Transformers Power Modern NLP: A Deep Dive into Encoder‑Decoder Mechanics

Code DAO

Dec 8, 2021 · Artificial Intelligence

Understanding Compact Transformers: Build and Train Vision & NLP Models on a Personal PC

This article walks through the design of Compact Transformers, explaining scaled dot‑product self‑attention, positional embeddings, multi‑head attention, and Vision Transformer architecture, and provides full PyTorch code so readers can train lightweight CV and NLP classifiers on a single PC.

Compact TransformersMulti-Head AttentionPatch Embedding

0 likes · 19 min read

Understanding Compact Transformers: Build and Train Vision & NLP Models on a Personal PC

Douyu Streaming

Oct 20, 2021 · Artificial Intelligence

How DeepXi and MHANet Revolutionize Speech Enhancement with Multi‑Head Attention

DeepXi introduces a two‑stage deep learning framework for speech enhancement, using prior SNR estimation and MMSE gain, while the MHANet extension leverages multi‑head attention to model long‑range dependencies, with detailed training strategies, model compression to GRU, deployment via TFLite, and impressive low‑latency results.

GRUMulti-Head AttentionTFLite

0 likes · 8 min read

How DeepXi and MHANet Revolutionize Speech Enhancement with Multi‑Head Attention

TiPaiPai Technical Team

May 31, 2021 · Artificial Intelligence

Understanding Transformers: Self‑Attention, Multi‑Head Mechanisms, and Positional Encoding

This article explains the Transformer architecture—its self‑attention core, multi‑head attention, positional encoding, encoder‑decoder structure, and how it overcomes RNN limitations, providing a foundation for its use in NLP, image detection, and OCR.

Multi-Head AttentionNLPPositional Encoding

0 likes · 7 min read

Understanding Transformers: Self‑Attention, Multi‑Head Mechanisms, and Positional Encoding

Sohu Tech Products

Nov 11, 2020 · Artificial Intelligence

Illustrated Transformer: Comprehensive Explanation and Code Implementation

This article provides a step‑by‑step illustrated guide to the Transformer architecture, covering its macro structure, detailed self‑attention mechanisms, multi‑head attention, positional encoding, residual connections, decoder operation, training process, loss functions, and includes complete PyTorch and custom Python code examples.

Multi-Head AttentionNLPPyTorch

0 likes · 33 min read

Illustrated Transformer: Comprehensive Explanation and Code Implementation

Sohu Tech Products

Jan 9, 2019 · Artificial Intelligence

Understanding the Transformer Model: Attention, Self‑Attention, and Multi‑Head Mechanisms

This article provides a comprehensive, step‑by‑step explanation of the Transformer architecture, covering its encoder‑decoder structure, self‑attention, multi‑head attention, positional encoding, residual connections, and training processes, illustrated with diagrams and code snippets to aid readers new to neural machine translation.

Multi-Head AttentionNeural Machine TranslationPositional Encoding

0 likes · 16 min read

Understanding the Transformer Model: Attention, Self‑Attention, and Multi‑Head Mechanisms