Tagged articles

Self-Attention

61 articles · Page 1 of 1

Jun 13, 2026 · Artificial Intelligence

What Interviewers Expect: Understanding Transformers Beyond Codex and AI Code Generation

The article explains why modern interviewers ask about Transformer fundamentals, breaks down its core components such as self‑attention, multi‑head attention, feed‑forward networks, residual connections and positional encodings, and demonstrates a complete PyTorch toy model that predicts the sum‑mod‑10 of integer sequences while visualizing loss curves, attention heatmaps, embedding PCA and early‑stage gradient norms.

Deep LearningGradient AnalysisModel Visualization

0 likes · 20 min read

What Interviewers Expect: Understanding Transformers Beyond Codex and AI Code Generation

CodePath

Jun 3, 2026 · Artificial Intelligence

A Deliberate Paradigm Shift: How “Attention Is All You Need” Reshaped Deep Learning

The article dissects how the 2017 "Attention Is All You Need" paper sparked a fundamental redesign of sequence modeling by replacing recurrent and convolutional approaches with self‑attention, detailing its mathematical foundations, architectural components, training tricks, limitations, and emerging alternatives such as Mamba.

Attention MechanismMambaMulti-Head Attention

0 likes · 24 min read

A Deliberate Paradigm Shift: How “Attention Is All You Need” Reshaped Deep Learning

Mike Chen's Internet Architecture

May 21, 2026 · Artificial Intelligence

Demystifying AI Large Models: Architecture, Principles, and Workflow

The article explains that large language models are massive probability engines built on the Transformer architecture with self‑attention, trained through costly pre‑training on trillions of tokens, then refined by instruction fine‑tuning and RLHF, ultimately predicting the next token to generate text.

Large Language ModelRLHFSelf-Attention

0 likes · 5 min read

Demystifying AI Large Models: Architecture, Principles, and Workflow

Machine Heart

Apr 14, 2026 · Artificial Intelligence

Training a Transformer on a 1970s PDP‑11 Takes Only 5.5 Minutes

A developer recreated a 1970s PDP‑11 environment, wrote a single‑layer, single‑head Transformer in assembly, and trained it on a sequence‑reversal task, achieving 100% accuracy after about 350 steps and a total training time of roughly 5.5 minutes.

AssemblyLow-resource AIPDP-11

0 likes · 16 min read

Training a Transformer on a 1970s PDP‑11 Takes Only 5.5 Minutes

AI Frontier Lectures

Mar 19, 2026 · Artificial Intelligence

Why Sharing Parameters in Vision Transformers Hurts Performance—and How Layer Specialization Fixes It

The article analyzes the hidden conflict between [CLS] and patch tokens in Vision Transformers, reveals how shared normalization and linear layers cause computational friction, and demonstrates that layer‑specific parameters dramatically improve dense prediction tasks without increasing inference FLOPs.

Dense PredictionLayer SpecializationSelf-Attention

0 likes · 9 min read

Why Sharing Parameters in Vision Transformers Hurts Performance—and How Layer Specialization Fixes It

Data STUDIO

Feb 25, 2026 · Artificial Intelligence

Build a Large Language Model from Scratch with PyTorch—No Libraries, No Shortcuts

This guide walks you through building, training, and fine‑tuning a Transformer‑based large language model entirely from scratch using PyTorch, covering tokenization, self‑attention, multi‑head attention, positional encoding, model architecture, data preparation, training loops, and fine‑tuning on custom lyrics.

GPTLLMPyTorch

0 likes · 43 min read

Build a Large Language Model from Scratch with PyTorch—No Libraries, No Shortcuts

Qborfy AI

Feb 21, 2026 · Artificial Intelligence

How Self-Attention Powers Modern AI: From Theory to Real-World Impact

This article explains the self‑attention mechanism behind transformers, detailing its core components, mathematical formulation, step‑by‑step example, multi‑head extension, industry use cases, and a thorough comparison with RNN and CNN approaches, all supported by concrete numbers and citations.

Attention MechanismDeep LearningSelf-Attention

0 likes · 8 min read

How Self-Attention Powers Modern AI: From Theory to Real-World Impact

AI Algorithm Path

Feb 16, 2026 · Artificial Intelligence

Why Visual Tokenizers Bridge the Gap Between Pixels and Meaning

Vision‑language models turn continuous images into discrete tokens through patch extraction, encoding, and projection, enabling Transformers to reason jointly over vision and text, but this compression introduces limits in spatial reasoning, counting, and resolution sensitivity that users must understand.

Self-Attentioncountingmultimodal fusion

0 likes · 22 min read

Why Visual Tokenizers Bridge the Gap Between Pixels and Meaning

AI Cyberspace

Feb 14, 2026 · Artificial Intelligence

Unpacking the Transformer: From Embeddings to Multi‑Head Attention

This article provides a comprehensive, step‑by‑step walkthrough of the Transformer architecture, covering input embedding, positional encoding, the mechanics of Q‑K‑V attention, scaled dot‑product formulas, multi‑head and masked attention, feed‑forward networks, residual connections, layer normalization, decoder generation, and recent attention‑optimization techniques.

Deep LearningFeed-Forward NetworkMulti-Head Attention

0 likes · 39 min read

Unpacking the Transformer: From Embeddings to Multi‑Head Attention

Bighead's Algorithm Notes

Dec 28, 2025 · Artificial Intelligence

Paper Reading: Multi‑Cycle Learning Framework (MLF) for Financial Time‑Series Forecasting

The paper introduces MLF, a multi‑cycle learning framework that integrates three novel modules—inter‑cycle redundancy filtering (IRF), learnable weighted integration (LWI), and multi‑cycle adaptive patch (MAP)—plus a patch‑squeeze component, achieving higher accuracy and efficiency on financial time‑series tasks such as fund‑sales prediction and outperforming strong single‑ and multi‑cycle baselines, with successful deployment in Alipay’s fund inventory system.

Alipay deploymentSelf-AttentionTime Series Forecasting

0 likes · 16 min read

Paper Reading: Multi‑Cycle Learning Framework (MLF) for Financial Time‑Series Forecasting

Tencent Cloud Developer

Dec 9, 2025 · Artificial Intelligence

How Do Large Language Models Turn Text into Math? A Deep Dive into Transformers

This article walks through the complete workflow of AI large language models, from turning user queries into token matrices via tokenization and embedding, through the Transformer’s self‑attention and multi‑head mechanisms, to decoding logits into human‑readable text, while also covering position encoding, long‑context strategies, generation parameters, and practical engineering tips.

Inference OptimizationLarge Language ModelsSelf-Attention

0 likes · 29 min read

How Do Large Language Models Turn Text into Math? A Deep Dive into Transformers

Wu Shixiong's Large Model Academy

Oct 23, 2025 · Artificial Intelligence

Why the Transformer Core Structure Is the Key to AI Interview Success

This article explains the fundamental purpose, architecture, and variants of the Transformer model—including Encoder‑Decoder, Encoder‑only, and Decoder‑only designs—while detailing how attention mechanisms work and why modern large‑language models favor the Decoder‑only approach, providing a concise framework for answering interview questions.

AI interviewEncoder-DecoderLarge Language Model

0 likes · 10 min read

Why the Transformer Core Structure Is the Key to AI Interview Success

MoonWebTeam

Oct 1, 2025 · Artificial Intelligence

Unlocking ChatGPT: A Deep Dive into Transformers, Tokenization, and Self‑Attention

This tutorial walks through the fundamentals of ChatGPT by explaining language modeling, character‑level tokenization, data preprocessing pipelines, the evolution from simple bigram models to scaled dot‑product self‑attention, multi‑head mechanisms, full Transformer blocks, and how to train and generate Shakespeare‑style text with a GPT model.

ChatGPTGPTLanguage Modeling

0 likes · 50 min read

Unlocking ChatGPT: A Deep Dive into Transformers, Tokenization, and Self‑Attention

Volcano Engine Developer Services

Sep 28, 2025 · Artificial Intelligence

Demystifying AI Jargon: A Beginner’s Guide to Large Language Models

This guide breaks down the complex terminology of large language models—explaining tokens, transformers, self‑attention, RAG, scaling laws, dense vs. sparse architectures, and training stages—using clear analogies and step‑by‑step explanations so readers can confidently understand and work with modern AI systems.

AI FundamentalsLarge Language ModelsModel Training

0 likes · 35 min read

Demystifying AI Jargon: A Beginner’s Guide to Large Language Models

Wu Shixiong's Large Model Academy

Sep 25, 2025 · Artificial Intelligence

Master Self-Attention & Multi-Head Attention for Large Model Interviews

This guide breaks down the core logic, computation steps, formulas, and common interview questions about Self‑Attention and Multi‑Head Attention in Transformers, offering concrete explanations, dimensional examples, and practical answering techniques to help candidates ace large‑model algorithm interviews.

Deep LearningInterview TipsMulti-Head Attention

0 likes · 8 min read

Master Self-Attention & Multi-Head Attention for Large Model Interviews

Qborfy AI

Aug 12, 2025 · Artificial Intelligence

What Powers Large Language Models? A Deep Dive into LLM Architecture and Scaling

This article explains how massive Transformer‑based large language models compress text data into mathematical representations, why scale, self‑attention, and training paradigms enable emergent general intelligence, and walks through tokenization, embedding, multi‑layer attention, architecture choices, energy costs, and hallucination mitigation.

AIEmbeddingLLM

0 likes · 6 min read

What Powers Large Language Models? A Deep Dive into LLM Architecture and Scaling

Data Party THU

Aug 9, 2025 · Artificial Intelligence

How the MPCT Multiscale Point Cloud Transformer Boosts 3D Classification Accuracy

This article reviews the MPCT framework—a multiscale point‑cloud transformer built on a residual network that leverages permutation‑invariant self‑attention, point‑enhancement, and hierarchical feature aggregation to achieve state‑of‑the‑art results on ModelNet40 and ScanObjectNN datasets.

3D classificationSelf-Attentionmultiscale

0 likes · 14 min read

How the MPCT Multiscale Point Cloud Transformer Boosts 3D Classification Accuracy

Qborfy AI

Aug 8, 2025 · Artificial Intelligence

Why Transformers Revolutionized AI: A Deep Dive into Self‑Attention

This article explains how the Transformer model replaces sequential RNN processing with parallel self‑attention, detailing its core components, positional encoding, encoder‑decoder workflow, industry impact, and surprising facts such as training speed gains and energy efficiency.

AIDeep LearningSelf-Attention

0 likes · 5 min read

Why Transformers Revolutionized AI: A Deep Dive into Self‑Attention

Alibaba Cloud Developer

Aug 6, 2025 · Artificial Intelligence

How Transformers Revolutionize Sequence Modeling: From RNN Limits to Self‑Attention Mastery

This article explains why Transformer models surpass traditional RNN‑based seq2seq architectures by introducing self‑attention, multi‑head attention, and positional encoding, detailing the inner workings of encoders, decoders, and attention mechanisms, and comparing their advantages and limitations across NLP and vision tasks.

GRULSTMRNN

0 likes · 30 min read

How Transformers Revolutionize Sequence Modeling: From RNN Limits to Self‑Attention Mastery

Cognitive Technology Team

Jun 29, 2025 · Artificial Intelligence

Understanding Transformers: Core Mechanics Behind Modern AI Models

This article demystifies the Transformer architecture for beginners, explaining its relationship to large models, the self‑attention and multi‑head attention mechanisms, positional encoding, and the roles of Encoder and Decoder components, using clear analogies and visual diagrams to aid comprehension.

Artificial IntelligenceDeep LearningEncoder-Decoder

0 likes · 20 min read

Understanding Transformers: Core Mechanics Behind Modern AI Models

MaGe Linux Operations

Jun 15, 2025 · Artificial Intelligence

Mastering Transformers: Key Extensions and Optimization Techniques Explained

This comprehensive guide walks you through the Transformer architecture—from its encoder‑decoder structure and self‑attention mechanism to multi‑head attention, positional embeddings, and practical PyTorch implementations—providing clear visualizations and code examples for deep learning practitioners.

Deep LearningPyTorchSelf-Attention

0 likes · 22 min read

Mastering Transformers: Key Extensions and Optimization Techniques Explained

Tencent Technical Engineering

Apr 16, 2025 · Artificial Intelligence

Understanding Transformer Architecture for Chinese‑English Translation: A Practical Guide

This practical guide walks through the full Transformer architecture for Chinese‑to‑English translation, detailing encoder‑decoder structure, tokenization and embeddings, batch handling with padding and masks, positional encodings, parallel teacher‑forcing, self‑ and multi‑head attention, and the complete forward and back‑propagation training steps.

Machine TranslationPositional EncodingPyTorch

0 likes · 26 min read

Understanding Transformer Architecture for Chinese‑English Translation: A Practical Guide

AI Algorithm Path

Apr 10, 2025 · Artificial Intelligence

Beginner-Friendly Guide to Understanding Large Language Models

This article walks readers through the fundamentals of large language models, covering what tokens are, how tokenization works, the conversion of tokens to numeric IDs, the transformer architecture—including positional encoding, self‑attention, feed‑forward networks and softmax—and explains how these components enable next‑token prediction.

Artificial IntelligenceEmbeddingLLM

0 likes · 9 min read

Beginner-Friendly Guide to Understanding Large Language Models

Cognitive Technology Team

Mar 10, 2025 · Artificial Intelligence

Understanding Transformers: From NLP Challenges to Architecture and Core Mechanisms

This article explains the evolution of natural language processing, the limitations of rule‑based, statistical, and recurrent neural network models, and then introduces the Transformer architecture—covering word and position embeddings, self‑attention, multi‑head attention, Add & Norm, feed‑forward layers, and encoder‑decoder design—to help beginners grasp why Transformers solve key NLP problems.

AINLPSelf-Attention

0 likes · 15 min read

Understanding Transformers: From NLP Challenges to Architecture and Core Mechanisms

Alibaba Cloud Developer

Mar 10, 2025 · Artificial Intelligence

Why Transformers Revolutionized NLP: From Problems to Solutions

This article explains the historical challenges of natural language processing, from rule‑based and statistical models to recurrent networks and their limitations, then introduces the Transformer architecture, its self‑attention mechanism, multi‑head attention, and supporting layers, illustrating how it overcomes previous issues and enables efficient parallel training.

Artificial IntelligenceNLPSelf-Attention

0 likes · 16 min read

Why Transformers Revolutionized NLP: From Problems to Solutions

AI Large Model Application Practice

Feb 28, 2025 · Artificial Intelligence

How Self-Attention Powers LLMs: A Step‑by‑Step Deep Dive

This article explains the self‑attention mechanism behind large language models, detailing why static word importance fails, how queries, keys, and values are generated, how attention scores are computed, scaled, softmaxed, and used to produce context‑aware word vectors, while noting computational costs.

AILLMSelf-Attention

0 likes · 9 min read

How Self-Attention Powers LLMs: A Step‑by‑Step Deep Dive

AIWalker

Jan 11, 2025 · Artificial Intelligence

CAS-ViT: The Fastest, Strongest Vision Transformer for Mobile Image Classification & Detection

CAS‑ViT introduces a convolutional additive self‑attention mechanism that dramatically reduces the computational cost of Vision Transformers, achieving state‑of‑the‑art accuracy on image classification, object detection, and segmentation while being deployable on mobile devices.

Efficient ModelsSelf-AttentionVision Transformer

0 likes · 19 min read

CAS-ViT: The Fastest, Strongest Vision Transformer for Mobile Image Classification & Detection

Architect's Alchemy Furnace

Sep 16, 2024 · Artificial Intelligence

Why Transformers Revolutionize AI: From Basics to Advanced Applications

This article explains what AI Transformers are, why they matter, their key components and mechanisms, various applications ranging from language processing to bioinformatics, and how they differ from traditional neural networks, providing a comprehensive overview of Transformer architecture and its impact on modern AI research.

AIDeep LearningSelf-Attention

0 likes · 20 min read

Why Transformers Revolutionize AI: From Basics to Advanced Applications

JavaEdge

Jul 22, 2024 · Artificial Intelligence

What Is a Transformer and Why It’s Transforming AI?

This article explains the fundamentals of transformer models, why they outperform earlier neural networks, their core components such as self‑attention and positional encoding, practical use cases across language and biology, and how they differ from RNNs, CNNs, and other architectures.

AIDeep LearningSelf-Attention

0 likes · 20 min read

What Is a Transformer and Why It’s Transforming AI?

JD Cloud Developers

Jun 25, 2024 · Artificial Intelligence

Why Do Large Language Models Output Text Word‑by‑Word? Inside the Transformer Mechanics

This article explains the fundamental architecture of large language models, from the dual file nature of parameters and code, through neural network basics, perceptrons, and weight training, to the Transformer’s tokenization, positional encoding, self‑attention, and inference processes, illustrated with diagrams and examples.

Large Language ModelNeural NetworkSelf-Attention

0 likes · 22 min read

Why Do Large Language Models Output Text Word‑by‑Word? Inside the Transformer Mechanics

JD Tech Talk

Jun 25, 2024 · Artificial Intelligence

Understanding Large Language Models: From Parameters to Transformer Architecture

This article explains the fundamental concepts behind large language models, including their two-file structure, training process, neural network basics, perceptron examples, weight and threshold calculations, the TensorFlow Playground, and a detailed walkthrough of the Transformer architecture with tokenization, positional encoding, self‑attention, normalization, and feed‑forward layers.

AILarge Language ModelsSelf-Attention

0 likes · 20 min read

Understanding Large Language Models: From Parameters to Transformer Architecture

Rare Earth Juejin Tech Community

Jun 12, 2024 · Artificial Intelligence

A Simple Introduction to the Transformer Model

This article provides a comprehensive, beginner-friendly explanation of the Transformer architecture, covering its encoder‑decoder structure, self‑attention, multi‑head attention, positional encoding, residual connections, decoding process, final linear and softmax layers, and training considerations, illustrated with numerous diagrams and code snippets.

Deep LearningMachine TranslationSelf-Attention

0 likes · 24 min read

A Simple Introduction to the Transformer Model

JD Tech

Jun 7, 2024 · Artificial Intelligence

Understanding Attention Mechanisms, Self‑Attention, and Multi‑Head Attention in Transformers

This article explains the fundamentals of attention mechanisms, including biological inspiration, the evolution from early visual attention to modern self‑attention in Transformers, details the scaled dot‑product calculations, positional encoding, and multi‑head attention, illustrating how these concepts enable efficient parallel processing of sequence data.

AIPositional EncodingSelf-Attention

0 likes · 12 min read

Understanding Attention Mechanisms, Self‑Attention, and Multi‑Head Attention in Transformers

NewBeeNLP

Apr 26, 2024 · Artificial Intelligence

Self-Attention vs Virtual Nodes in Graph Neural Networks: What Really Works?

This article reviews the paper “Distinguished in Uniform: Self-Attention vs. Virtual Nodes,” comparing graph Transformers and MPGNNs with virtual nodes on theoretical consistency and experimental performance, revealing that neither approach universally dominates the other.

Graph Neural NetworksMPGNNSelf-Attention

0 likes · 9 min read

Self-Attention vs Virtual Nodes in Graph Neural Networks: What Really Works?

Architect

Mar 19, 2024 · Artificial Intelligence

How Transformers Power Modern NLP: A Deep Dive into Encoder‑Decoder Mechanics

This article explains the core principles of Transformer models—covering input embeddings, self‑attention, multi‑head attention, positional encoding, feed‑forward networks, and decoder strategies—using concrete examples like "The cat sat on the mat" and "The quick brown fox jumps over the lazy dog" to illustrate each step.

Encoder-DecoderFeed-Forward NetworkMulti-Head Attention

0 likes · 13 min read

How Transformers Power Modern NLP: A Deep Dive into Encoder‑Decoder Mechanics

Ops Development & AI Practice

Mar 17, 2024 · Artificial Intelligence

Why the Transformer Model Revolutionized AI and How It Works

This article explains the Transformer architecture, its self‑attention mechanism, encoder‑decoder design, and the profound impact it has had on natural language processing, computer vision, and large‑scale language models like GPT.

AI ArchitectureDeep LearningNLP

0 likes · 6 min read

Why the Transformer Model Revolutionized AI and How It Works

Sohu Tech Products

Jul 26, 2023 · Artificial Intelligence

Attention Mechanism, Transformer Architecture, and BERT: An In-Depth Overview

This article provides a comprehensive overview of the attention mechanism, its mathematical foundations, the transformer model architecture—including encoder and decoder components—and the BERT pre‑training model, detailing their principles, implementations, and applications in natural language processing.

Attention MechanismBERTEncoder-Decoder

0 likes · 13 min read

Attention Mechanism, Transformer Architecture, and BERT: An In-Depth Overview

21CTO

Apr 27, 2023 · Artificial Intelligence

Demystifying Transformers: A Step‑by‑Step Guide to Self‑Attention and Architecture

This article explains the Transformer model—from its encoder‑decoder structure and self‑attention mechanism to multi‑head attention, positional encoding, residual connections, training loss, and inference strategies—providing a clear, visual walkthrough for readers new to modern NLP architectures.

Deep LearningMachine TranslationSelf-Attention

0 likes · 21 min read

Demystifying Transformers: A Step‑by‑Step Guide to Self‑Attention and Architecture

IT Services Circle

Mar 2, 2023 · Artificial Intelligence

Understanding GPT: Word Vectors, Transformers, and Model Architectures (GPT‑2, GPT‑3)

This article provides a concise technical overview of GPT, explaining how word vectors are constructed, how the Transformer architecture with self‑attention and feed‑forward layers processes these vectors, and how GPT‑2 and GPT‑3 extend the model with decoder‑only and large‑scale designs.

AIGPTSelf-Attention

0 likes · 8 min read

Understanding GPT: Word Vectors, Transformers, and Model Architectures (GPT‑2, GPT‑3)

DataFunSummit

Feb 16, 2023 · Artificial Intelligence

Understanding the Transformer Model and Self‑Attention Mechanism with a Complete PyTorch Implementation

This article introduces the Transformer architecture, explains the self‑attention mechanism with visual illustrations, and provides a full, runnable PyTorch code example that implements the encoder‑decoder structure for sequence‑to‑sequence tasks.

NLPPyTorchSelf-Attention

0 likes · 11 min read

Understanding the Transformer Model and Self‑Attention Mechanism with a Complete PyTorch Implementation

Rare Earth Juejin Tech Community

Oct 10, 2022 · Artificial Intelligence

A Beginner’s Journey into Vision Transformers (ViT) for Computer Vision Engineers

This article introduces the fundamentals of Vision Transformers (ViT) for computer‑vision developers, starting with an overview of the transformer architecture, detailed explanation of self‑attention and multi‑head attention, and step‑by‑step PyTorch code examples that illustrate query, key, value computation and attention scoring.

PyTorchSelf-AttentionTransformer

0 likes · 12 min read

A Beginner’s Journey into Vision Transformers (ViT) for Computer Vision Engineers

vivo Internet Technology

Aug 24, 2022 · Frontend Development

Applying Self-Attention Based Machine Learning Model to Design-to-Code Layout Prediction

Vivo’s frontend team built a self‑attention‑based machine‑learning model that predicts web‑page layout types (column, row, or absolute) from node dimensions and positions, solving parent‑child and sibling relationships for design‑to‑code conversion, achieving 99.4% accuracy using over 20 k labeled, crawled, and generated samples, while outlining further enhancements.

D2CNeural NetworkSelf-Attention

0 likes · 11 min read

Applying Self-Attention Based Machine Learning Model to Design-to-Code Layout Prediction

JD Cloud Developers

Aug 15, 2022 · Artificial Intelligence

How FCA Doubles BERT’s Inference Speed with Less Than 1% Accuracy Loss

This article explains how the Fine‑ and Coarse‑Granularity Hybrid Self‑Attention (FCA) mechanism reduces BERT’s computational cost by over 50% while keeping accuracy loss under 1%, detailing the method, experimental results, and its significance for efficient large‑scale language models.

BERTDeep LearningFCA

0 likes · 8 min read

How FCA Doubles BERT’s Inference Speed with Less Than 1% Accuracy Loss

Baidu Geek Talk

Mar 28, 2022 · Artificial Intelligence

Robust Input Visualization Methods for Vision Transformers

The paper proposes a robust Grad‑CAM‑inspired visualization for Vision Transformers that combines attention weights and gradients to generate class‑specific saliency maps, demonstrates superior alignment with discriminative regions across ViT, Swin and Volo models, and shows a 76% false‑positive reduction in Baidu’s porn‑content risk control system.

Deep LearningGrad-CAMInput Visualization

0 likes · 11 min read

Robust Input Visualization Methods for Vision Transformers

Baobao Algorithm Notes

Jan 14, 2022 · Artificial Intelligence

BERT Interview Q&A: Decoding CLS, Masks, Complexity, and More

An in‑depth Q&A breaks down core BERT concepts—from the purpose of the [CLS] token and masking strategies to self‑attention complexity, sparse attention tricks, subword handling of OOV words, warm‑up learning rates, GPT’s unidirectional nature, and ALBERT’s parameter sharing—providing concise explanations for each.

BERTMaskingSelf-Attention

0 likes · 7 min read

BERT Interview Q&A: Decoding CLS, Masks, Complexity, and More

Code DAO

Dec 29, 2021 · Artificial Intelligence

Understanding Stand-Alone Axial-Attention for Panoptic Segmentation

The paper proposes a stand‑alone axial‑attention mechanism that converts 2‑D attention into 1‑D to lower computational cost while preserving global context, introduces position‑sensitive self‑attention, integrates it into Axial‑ResNet and Axial‑DeepLab, and demonstrates strong results on four large segmentation datasets.

Axial AttentionDeepLabPanoptic Segmentation

0 likes · 7 min read

Understanding Stand-Alone Axial-Attention for Panoptic Segmentation

Code DAO

Dec 8, 2021 · Artificial Intelligence

Understanding Compact Transformers: Build and Train Vision & NLP Models on a Personal PC

This article walks through the design of Compact Transformers, explaining scaled dot‑product self‑attention, positional embeddings, multi‑head attention, and Vision Transformer architecture, and provides full PyTorch code so readers can train lightweight CV and NLP classifiers on a single PC.

Compact TransformersMulti-Head AttentionPatch Embedding

0 likes · 19 min read

Understanding Compact Transformers: Build and Train Vision & NLP Models on a Personal PC

AntTech

Oct 29, 2021 · Artificial Intelligence

Ant Insurance Technology and CASIA Win Two Tracks at MuSe2021 Multimodal Sentiment Challenge (ACM MM 2021)

The Ant Insurance Technology team, together with the Institute of Automation of the Chinese Academy of Sciences, secured first place in both the MuSe‑Wilder and MuSe‑Sent tracks of the MuSe2021 Multimodal Sentiment Challenge held at the 29th ACM International Conference on Multimedia in Chengdu, showcasing advanced multimodal AI techniques.

BiLSTMDeep LearningMuSe2021

0 likes · 4 min read

Ant Insurance Technology and CASIA Win Two Tracks at MuSe2021 Multimodal Sentiment Challenge (ACM MM 2021)

TiPaiPai Technical Team

Jun 11, 2021 · Artificial Intelligence

How Transformers Revolutionize Vision: From DETR to GCNet

This article explores how Transformer architectures, originally designed for NLP, are adapted for visual tasks, detailing pioneering models such as DETR, CBAM, NLNet, SENet, and GCNet, and explains their structures, attention mechanisms, advantages, and experimental findings for image processing.

DETRSelf-Attentionattention mechanisms

0 likes · 13 min read

How Transformers Revolutionize Vision: From DETR to GCNet

TiPaiPai Technical Team

May 31, 2021 · Artificial Intelligence

Understanding Transformers: Self‑Attention, Multi‑Head Mechanisms, and Positional Encoding

This article explains the Transformer architecture—its self‑attention core, multi‑head attention, positional encoding, encoder‑decoder structure, and how it overcomes RNN limitations, providing a foundation for its use in NLP, image detection, and OCR.

Multi-Head AttentionNLPPositional Encoding

0 likes · 7 min read

Understanding Transformers: Self‑Attention, Multi‑Head Mechanisms, and Positional Encoding

Cyber Elephant Tech Team

Apr 28, 2021 · Artificial Intelligence

Understanding BERT: From Encoder-Decoder to Transformer and Attention

This article explains the BERT model by first reviewing the Encoder-Decoder framework, then detailing the attention mechanism—including self-attention and multi-head attention—before describing the full Transformer architecture and finally outlining BERT’s encoder-only design, training stages, and fine-tuning applications.

BERTEncoder-DecoderNLP

0 likes · 15 min read

Understanding BERT: From Encoder-Decoder to Transformer and Attention

Sohu Tech Products

Nov 25, 2020 · Artificial Intelligence

Illustrated Guide to GPT-2: Detailed Explanation of the Decoder‑Only Transformer Model

This article provides a comprehensive, illustrated walkthrough of OpenAI's GPT‑2 language model, covering its decoder‑only Transformer architecture, self‑attention mechanisms, token processing, training data, differences from BERT, and applications beyond language modeling, enriched with visual diagrams and code snippets for deeper understanding.

AIGPT-2Language Model

0 likes · 24 min read

Illustrated Guide to GPT-2: Detailed Explanation of the Decoder‑Only Transformer Model

Sohu Tech Products

Nov 11, 2020 · Artificial Intelligence

Illustrated Transformer: Comprehensive Explanation and Code Implementation

This article provides a step‑by‑step illustrated guide to the Transformer architecture, covering its macro structure, detailed self‑attention mechanisms, multi‑head attention, positional encoding, residual connections, decoder operation, training process, loss functions, and includes complete PyTorch and custom Python code examples.

Multi-Head AttentionNLPPyTorch

0 likes · 33 min read

Illustrated Transformer: Comprehensive Explanation and Code Implementation

DataFunTalk

Oct 23, 2020 · Artificial Intelligence

Feedback‑Aware Deep Matching Model for Music Recommendation in Tmall Genie

This article presents DeepMatch, a behavior‑sequence based deep learning recall model enhanced with play‑rate and intent‑type embeddings, describes its self‑attention architecture, factorized embedding parameterization, multitask loss design, distributed TensorFlow training tricks, and demonstrates significant offline and online improvements in music recommendation performance.

Deep LearningSelf-AttentionTensorFlow

0 likes · 15 min read

Feedback‑Aware Deep Matching Model for Music Recommendation in Tmall Genie

Alibaba Cloud Developer

May 21, 2020 · Artificial Intelligence

How DeepMatch Boosts Music Recommendations with Play Rate and Intent Signals

This article examines the DeepMatch retrieval model for Tmall Genie music recommendation, detailing how incorporating user feedback such as play‑rate and query intent signals via multi‑task learning and feedback‑aware self‑attention improves recall accuracy and reduces negative recommendations, while also discussing embedding factorization, loss functions, and distributed training optimizations.

Deep LearningRecommendation SystemsSelf-Attention

0 likes · 18 min read

How DeepMatch Boosts Music Recommendations with Play Rate and Intent Signals

Alibaba Cloud Developer

Jan 7, 2020 · Artificial Intelligence

How Alibaba Boosts Search Relevance with Advanced User Modeling and Self‑Attention

This article details Alibaba’s Taobao search CTR/CVR user modeling approach, covering background, model architecture with self‑attention and attention pooling, handling short‑term, long‑term, and on‑device behavior sequences, experimental results showing AUC improvements, and future directions.

CTR PredictionSelf-AttentionUser Modeling

0 likes · 20 min read

How Alibaba Boosts Search Relevance with Advanced User Modeling and Self‑Attention

Qunar Tech Salon

Sep 12, 2019 · Artificial Intelligence

A Comprehensive Overview of Attention Mechanisms in Deep Learning

This article systematically reviews the history, core concepts, variants, and practical implementations of attention mechanisms—from early additive and multiplicative forms to self‑attention, multi‑head attention, and recent transformer‑based models—highlighting why attention has become fundamental in modern AI research.

Deep LearningMachine TranslationNLP

0 likes · 16 min read

A Comprehensive Overview of Attention Mechanisms in Deep Learning

Alibaba Cloud Developer

Aug 9, 2019 · Artificial Intelligence

How GATNE Advances Heterogeneous Graph Embedding with Edge Types and Node Features

This article introduces GATNE, a graph embedding framework that jointly models heterogeneous nodes, multiple edge types, and rich node attributes using base and edge embeddings, self‑attention, and inductive learning, and demonstrates its superior performance on several real‑world datasets.

GATNESelf-Attentiongraph embedding

0 likes · 8 min read

How GATNE Advances Heterogeneous Graph Embedding with Edge Types and Node Features

Alibaba Cloud Developer

Jul 9, 2019 · Artificial Intelligence

Demystifying Attention: A Clear Guide to Its History, Types, and Why It Works

This article systematically reviews the evolution of attention mechanisms—from early additive and multiplicative forms to self‑attention and multi‑head variants—explaining their core three‑step framework, key differences, and why they have become essential across NLP, vision, and broader AI applications.

Deep LearningNLPSelf-Attention

0 likes · 19 min read

Demystifying Attention: A Clear Guide to Its History, Types, and Why It Works

Sohu Tech Products

Jan 9, 2019 · Artificial Intelligence

Understanding the Transformer Model: Attention, Self‑Attention, and Multi‑Head Mechanisms

This article provides a comprehensive, step‑by‑step explanation of the Transformer architecture, covering its encoder‑decoder structure, self‑attention, multi‑head attention, positional encoding, residual connections, and training processes, illustrated with diagrams and code snippets to aid readers new to neural machine translation.

Deep LearningMulti-Head AttentionNeural Machine Translation

0 likes · 16 min read

Understanding the Transformer Model: Attention, Self‑Attention, and Multi‑Head Mechanisms

Alibaba Cloud Developer

Jun 14, 2018 · Artificial Intelligence

Self-Attention Boosts Heterogeneous User Behavior Modeling for Recommendations

This paper proposes a novel attention‑based framework that groups and encodes heterogeneous user behavior sequences into separate semantic subspaces, applies self‑attention to capture inter‑behavior influences, and demonstrates faster training and comparable or improved recommendation performance across multiple tasks and datasets.

Multi-Task LearningSelf-AttentionUser Modeling

0 likes · 12 min read

Self-Attention Boosts Heterogeneous User Behavior Modeling for Recommendations