Tagged articles
58 articles
Page 1 of 1
Machine Heart
Machine Heart
Apr 14, 2026 · Artificial Intelligence

Training a Transformer on a 1970s PDP‑11 Takes Only 5.5 Minutes

A developer recreated a 1970s PDP‑11 environment, wrote a single‑layer, single‑head Transformer in assembly, and trained it on a sequence‑reversal task, achieving 100% accuracy after about 350 steps and a total training time of roughly 5.5 minutes.

AssemblyLow-resource AIPDP-11
0 likes · 16 min read
Training a Transformer on a 1970s PDP‑11 Takes Only 5.5 Minutes
AI Frontier Lectures
AI Frontier Lectures
Mar 19, 2026 · Artificial Intelligence

Why Sharing Parameters in Vision Transformers Hurts Performance—and How Layer Specialization Fixes It

The article analyzes the hidden conflict between [CLS] and patch tokens in Vision Transformers, reveals how shared normalization and linear layers cause computational friction, and demonstrates that layer‑specific parameters dramatically improve dense prediction tasks without increasing inference FLOPs.

Computer VisionDense PredictionLayer Specialization
0 likes · 9 min read
Why Sharing Parameters in Vision Transformers Hurts Performance—and How Layer Specialization Fixes It
Data STUDIO
Data STUDIO
Feb 25, 2026 · Artificial Intelligence

Build a Large Language Model from Scratch with PyTorch—No Libraries, No Shortcuts

This guide walks you through building, training, and fine‑tuning a Transformer‑based large language model entirely from scratch using PyTorch, covering tokenization, self‑attention, multi‑head attention, positional encoding, model architecture, data preparation, training loops, and fine‑tuning on custom lyrics.

Fine-tuningGPTLLM
0 likes · 43 min read
Build a Large Language Model from Scratch with PyTorch—No Libraries, No Shortcuts
Qborfy AI
Qborfy AI
Feb 21, 2026 · Artificial Intelligence

How Self-Attention Powers Modern AI: From Theory to Real-World Impact

This article explains the self‑attention mechanism behind transformers, detailing its core components, mathematical formulation, step‑by‑step example, multi‑head extension, industry use cases, and a thorough comparison with RNN and CNN approaches, all supported by concrete numbers and citations.

Attention MechanismDeep LearningSelf-Attention
0 likes · 8 min read
How Self-Attention Powers Modern AI: From Theory to Real-World Impact
AI Algorithm Path
AI Algorithm Path
Feb 16, 2026 · Artificial Intelligence

Why Visual Tokenizers Bridge the Gap Between Pixels and Meaning

Vision‑language models turn continuous images into discrete tokens through patch extraction, encoding, and projection, enabling Transformers to reason jointly over vision and text, but this compression introduces limits in spatial reasoning, counting, and resolution sensitivity that users must understand.

Self-AttentionVision-Language Modelscounting
0 likes · 22 min read
Why Visual Tokenizers Bridge the Gap Between Pixels and Meaning
AI Cyberspace
AI Cyberspace
Feb 14, 2026 · Artificial Intelligence

Unpacking the Transformer: From Embeddings to Multi‑Head Attention

This article provides a comprehensive, step‑by‑step walkthrough of the Transformer architecture, covering input embedding, positional encoding, the mechanics of Q‑K‑V attention, scaled dot‑product formulas, multi‑head and masked attention, feed‑forward networks, residual connections, layer normalization, decoder generation, and recent attention‑optimization techniques.

Deep LearningFeed-Forward NetworkPositional Encoding
0 likes · 39 min read
Unpacking the Transformer: From Embeddings to Multi‑Head Attention
Bighead's Algorithm Notes
Bighead's Algorithm Notes
Dec 28, 2025 · Artificial Intelligence

Paper Reading: Multi‑Cycle Learning Framework (MLF) for Financial Time‑Series Forecasting

The paper introduces MLF, a multi‑cycle learning framework that integrates three novel modules—inter‑cycle redundancy filtering (IRF), learnable weighted integration (LWI), and multi‑cycle adaptive patch (MAP)—plus a patch‑squeeze component, achieving higher accuracy and efficiency on financial time‑series tasks such as fund‑sales prediction and outperforming strong single‑ and multi‑cycle baselines, with successful deployment in Alipay’s fund inventory system.

Alipay deploymentFinancial AISelf-Attention
0 likes · 16 min read
Paper Reading: Multi‑Cycle Learning Framework (MLF) for Financial Time‑Series Forecasting
Tencent Cloud Developer
Tencent Cloud Developer
Dec 9, 2025 · Artificial Intelligence

How Do Large Language Models Turn Text into Math? A Deep Dive into Transformers

This article walks through the complete workflow of AI large language models, from turning user queries into token matrices via tokenization and embedding, through the Transformer’s self‑attention and multi‑head mechanisms, to decoding logits into human‑readable text, while also covering position encoding, long‑context strategies, generation parameters, and practical engineering tips.

Inference OptimizationSelf-AttentionTransformer
0 likes · 29 min read
How Do Large Language Models Turn Text into Math? A Deep Dive into Transformers
Wu Shixiong's Large Model Academy
Wu Shixiong's Large Model Academy
Oct 23, 2025 · Artificial Intelligence

Why the Transformer Core Structure Is the Key to AI Interview Success

This article explains the fundamental purpose, architecture, and variants of the Transformer model—including Encoder‑Decoder, Encoder‑only, and Decoder‑only designs—while detailing how attention mechanisms work and why modern large‑language models favor the Decoder‑only approach, providing a concise framework for answering interview questions.

AI InterviewEncoder-DecoderSelf-Attention
0 likes · 10 min read
Why the Transformer Core Structure Is the Key to AI Interview Success
MoonWebTeam
MoonWebTeam
Oct 1, 2025 · Artificial Intelligence

Unlocking ChatGPT: A Deep Dive into Transformers, Tokenization, and Self‑Attention

This tutorial walks through the fundamentals of ChatGPT by explaining language modeling, character‑level tokenization, data preprocessing pipelines, the evolution from simple bigram models to scaled dot‑product self‑attention, multi‑head mechanisms, full Transformer blocks, and how to train and generate Shakespeare‑style text with a GPT model.

ChatGPTGPTLanguage Modeling
0 likes · 50 min read
Unlocking ChatGPT: A Deep Dive into Transformers, Tokenization, and Self‑Attention
Volcano Engine Developer Services
Volcano Engine Developer Services
Sep 28, 2025 · Artificial Intelligence

Demystifying AI Jargon: A Beginner’s Guide to Large Language Models

This guide breaks down the complex terminology of large language models—explaining tokens, transformers, self‑attention, RAG, scaling laws, dense vs. sparse architectures, and training stages—using clear analogies and step‑by‑step explanations so readers can confidently understand and work with modern AI systems.

AI fundamentalsModel TrainingRAG
0 likes · 35 min read
Demystifying AI Jargon: A Beginner’s Guide to Large Language Models
Wu Shixiong's Large Model Academy
Wu Shixiong's Large Model Academy
Sep 25, 2025 · Artificial Intelligence

Master Self-Attention & Multi-Head Attention for Large Model Interviews

This guide breaks down the core logic, computation steps, formulas, and common interview questions about Self‑Attention and Multi‑Head Attention in Transformers, offering concrete explanations, dimensional examples, and practical answering techniques to help candidates ace large‑model algorithm interviews.

Deep LearningInterview TipsSelf-Attention
0 likes · 8 min read
Master Self-Attention & Multi-Head Attention for Large Model Interviews
Qborfy AI
Qborfy AI
Aug 12, 2025 · Artificial Intelligence

What Powers Large Language Models? A Deep Dive into LLM Architecture and Scaling

This article explains how massive Transformer‑based large language models compress text data into mathematical representations, why scale, self‑attention, and training paradigms enable emergent general intelligence, and walks through tokenization, embedding, multi‑layer attention, architecture choices, energy costs, and hallucination mitigation.

AIEmbeddingLLM
0 likes · 6 min read
What Powers Large Language Models? A Deep Dive into LLM Architecture and Scaling
Data Party THU
Data Party THU
Aug 9, 2025 · Artificial Intelligence

How the MPCT Multiscale Point Cloud Transformer Boosts 3D Classification Accuracy

This article reviews the MPCT framework—a multiscale point‑cloud transformer built on a residual network that leverages permutation‑invariant self‑attention, point‑enhancement, and hierarchical feature aggregation to achieve state‑of‑the‑art results on ModelNet40 and ScanObjectNN datasets.

3D classificationSelf-Attentionmultiscale
0 likes · 14 min read
How the MPCT Multiscale Point Cloud Transformer Boosts 3D Classification Accuracy
Qborfy AI
Qborfy AI
Aug 8, 2025 · Artificial Intelligence

Why Transformers Revolutionized AI: A Deep Dive into Self‑Attention

This article explains how the Transformer model replaces sequential RNN processing with parallel self‑attention, detailing its core components, positional encoding, encoder‑decoder workflow, industry impact, and surprising facts such as training speed gains and energy efficiency.

AIDeep LearningModel architecture
0 likes · 5 min read
Why Transformers Revolutionized AI: A Deep Dive into Self‑Attention
Alibaba Cloud Developer
Alibaba Cloud Developer
Aug 6, 2025 · Artificial Intelligence

How Transformers Revolutionize Sequence Modeling: From RNN Limits to Self‑Attention Mastery

This article explains why Transformer models surpass traditional RNN‑based seq2seq architectures by introducing self‑attention, multi‑head attention, and positional encoding, detailing the inner workings of encoders, decoders, and attention mechanisms, and comparing their advantages and limitations across NLP and vision tasks.

GRULSTMRNN
0 likes · 30 min read
How Transformers Revolutionize Sequence Modeling: From RNN Limits to Self‑Attention Mastery
Cognitive Technology Team
Cognitive Technology Team
Jun 29, 2025 · Artificial Intelligence

Understanding Transformers: Core Mechanics Behind Modern AI Models

This article demystifies the Transformer architecture for beginners, explaining its relationship to large models, the self‑attention and multi‑head attention mechanisms, positional encoding, and the roles of Encoder and Decoder components, using clear analogies and visual diagrams to aid comprehension.

Deep LearningEncoder-DecoderPositional Encoding
0 likes · 20 min read
Understanding Transformers: Core Mechanics Behind Modern AI Models
MaGe Linux Operations
MaGe Linux Operations
Jun 15, 2025 · Artificial Intelligence

Mastering Transformers: Key Extensions and Optimization Techniques Explained

This comprehensive guide walks you through the Transformer architecture—from its encoder‑decoder structure and self‑attention mechanism to multi‑head attention, positional embeddings, and practical PyTorch implementations—providing clear visualizations and code examples for deep learning practitioners.

Deep LearningPyTorchSelf-Attention
0 likes · 22 min read
Mastering Transformers: Key Extensions and Optimization Techniques Explained
Tencent Technical Engineering
Tencent Technical Engineering
Apr 16, 2025 · Artificial Intelligence

Understanding Transformer Architecture for Chinese‑English Translation: A Practical Guide

This practical guide walks through the full Transformer architecture for Chinese‑to‑English translation, detailing encoder‑decoder structure, tokenization and embeddings, batch handling with padding and masks, positional encodings, parallel teacher‑forcing, self‑ and multi‑head attention, and the complete forward and back‑propagation training steps.

Positional EncodingPyTorchSelf-Attention
0 likes · 26 min read
Understanding Transformer Architecture for Chinese‑English Translation: A Practical Guide
AI Algorithm Path
AI Algorithm Path
Apr 10, 2025 · Artificial Intelligence

Beginner-Friendly Guide to Understanding Large Language Models

This article walks readers through the fundamentals of large language models, covering what tokens are, how tokenization works, the conversion of tokens to numeric IDs, the transformer architecture—including positional encoding, self‑attention, feed‑forward networks and softmax—and explains how these components enable next‑token prediction.

EmbeddingLLMSelf-Attention
0 likes · 9 min read
Beginner-Friendly Guide to Understanding Large Language Models
Cognitive Technology Team
Cognitive Technology Team
Mar 10, 2025 · Artificial Intelligence

Understanding Transformers: From NLP Challenges to Architecture and Core Mechanisms

This article explains the evolution of natural language processing, the limitations of rule‑based, statistical, and recurrent neural network models, and then introduces the Transformer architecture—covering word and position embeddings, self‑attention, multi‑head attention, Add & Norm, feed‑forward layers, and encoder‑decoder design—to help beginners grasp why Transformers solve key NLP problems.

AINLPSelf-Attention
0 likes · 15 min read
Understanding Transformers: From NLP Challenges to Architecture and Core Mechanisms
Alibaba Cloud Developer
Alibaba Cloud Developer
Mar 10, 2025 · Artificial Intelligence

Why Transformers Revolutionized NLP: From Problems to Solutions

This article explains the historical challenges of natural language processing, from rule‑based and statistical models to recurrent networks and their limitations, then introduces the Transformer architecture, its self‑attention mechanism, multi‑head attention, and supporting layers, illustrating how it overcomes previous issues and enables efficient parallel training.

NLPSelf-AttentionTransformer
0 likes · 16 min read
Why Transformers Revolutionized NLP: From Problems to Solutions
AI Large Model Application Practice
AI Large Model Application Practice
Feb 28, 2025 · Artificial Intelligence

How Self-Attention Powers LLMs: A Step‑by‑Step Deep Dive

This article explains the self‑attention mechanism behind large language models, detailing why static word importance fails, how queries, keys, and values are generated, how attention scores are computed, scaled, softmaxed, and used to produce context‑aware word vectors, while noting computational costs.

AILLMSelf-Attention
0 likes · 9 min read
How Self-Attention Powers LLMs: A Step‑by‑Step Deep Dive
Architect's Alchemy Furnace
Architect's Alchemy Furnace
Sep 16, 2024 · Artificial Intelligence

Why Transformers Revolutionize AI: From Basics to Advanced Applications

This article explains what AI Transformers are, why they matter, their key components and mechanisms, various applications ranging from language processing to bioinformatics, and how they differ from traditional neural networks, providing a comprehensive overview of Transformer architecture and its impact on modern AI research.

AIDeep LearningSelf-Attention
0 likes · 20 min read
Why Transformers Revolutionize AI: From Basics to Advanced Applications
JavaEdge
JavaEdge
Jul 22, 2024 · Artificial Intelligence

What Is a Transformer and Why It’s Transforming AI?

This article explains the fundamentals of transformer models, why they outperform earlier neural networks, their core components such as self‑attention and positional encoding, practical use cases across language and biology, and how they differ from RNNs, CNNs, and other architectures.

AIDeep LearningSelf-Attention
0 likes · 20 min read
What Is a Transformer and Why It’s Transforming AI?
JD Cloud Developers
JD Cloud Developers
Jun 25, 2024 · Artificial Intelligence

Why Do Large Language Models Output Text Word‑by‑Word? Inside the Transformer Mechanics

This article explains the fundamental architecture of large language models, from the dual file nature of parameters and code, through neural network basics, perceptrons, and weight training, to the Transformer’s tokenization, positional encoding, self‑attention, and inference processes, illustrated with diagrams and examples.

Neural NetworkSelf-AttentionTransformer
0 likes · 22 min read
Why Do Large Language Models Output Text Word‑by‑Word? Inside the Transformer Mechanics
JD Tech Talk
JD Tech Talk
Jun 25, 2024 · Artificial Intelligence

Understanding Large Language Models: From Parameters to Transformer Architecture

This article explains the fundamental concepts behind large language models, including their two-file structure, training process, neural network basics, perceptron examples, weight and threshold calculations, the TensorFlow Playground, and a detailed walkthrough of the Transformer architecture with tokenization, positional encoding, self‑attention, normalization, and feed‑forward layers.

AINeural NetworksSelf-Attention
0 likes · 20 min read
Understanding Large Language Models: From Parameters to Transformer Architecture
Rare Earth Juejin Tech Community
Rare Earth Juejin Tech Community
Jun 12, 2024 · Artificial Intelligence

A Simple Introduction to the Transformer Model

This article provides a comprehensive, beginner-friendly explanation of the Transformer architecture, covering its encoder‑decoder structure, self‑attention, multi‑head attention, positional encoding, residual connections, decoding process, final linear and softmax layers, and training considerations, illustrated with numerous diagrams and code snippets.

Deep LearningNeural NetworksSelf-Attention
0 likes · 24 min read
A Simple Introduction to the Transformer Model
JD Tech
JD Tech
Jun 7, 2024 · Artificial Intelligence

Understanding Attention Mechanisms, Self‑Attention, and Multi‑Head Attention in Transformers

This article explains the fundamentals of attention mechanisms, including biological inspiration, the evolution from early visual attention to modern self‑attention in Transformers, details the scaled dot‑product calculations, positional encoding, and multi‑head attention, illustrating how these concepts enable efficient parallel processing of sequence data.

AIPositional EncodingSelf-Attention
0 likes · 12 min read
Understanding Attention Mechanisms, Self‑Attention, and Multi‑Head Attention in Transformers
NewBeeNLP
NewBeeNLP
Apr 26, 2024 · Artificial Intelligence

Self-Attention vs Virtual Nodes in Graph Neural Networks: What Really Works?

This article reviews the paper “Distinguished in Uniform: Self-Attention vs. Virtual Nodes,” comparing graph Transformers and MPGNNs with virtual nodes on theoretical consistency and experimental performance, revealing that neither approach universally dominates the other.

MPGNNSelf-Attentiongraph neural networks
0 likes · 9 min read
Self-Attention vs Virtual Nodes in Graph Neural Networks: What Really Works?
Architect
Architect
Mar 19, 2024 · Artificial Intelligence

How Transformers Power Modern NLP: A Deep Dive into Encoder‑Decoder Mechanics

This article explains the core principles of Transformer models—covering input embeddings, self‑attention, multi‑head attention, positional encoding, feed‑forward networks, and decoder strategies—using concrete examples like "The cat sat on the mat" and "The quick brown fox jumps over the lazy dog" to illustrate each step.

Encoder-DecoderFeed-Forward NetworkNLP
0 likes · 13 min read
How Transformers Power Modern NLP: A Deep Dive into Encoder‑Decoder Mechanics
Sohu Tech Products
Sohu Tech Products
Jul 26, 2023 · Artificial Intelligence

Attention Mechanism, Transformer Architecture, and BERT: An In-Depth Overview

This article provides a comprehensive overview of the attention mechanism, its mathematical foundations, the transformer model architecture—including encoder and decoder components—and the BERT pre‑training model, detailing their principles, implementations, and applications in natural language processing.

Attention MechanismBERTEncoder-Decoder
0 likes · 13 min read
Attention Mechanism, Transformer Architecture, and BERT: An In-Depth Overview
21CTO
21CTO
Apr 27, 2023 · Artificial Intelligence

Demystifying Transformers: A Step‑by‑Step Guide to Self‑Attention and Architecture

This article explains the Transformer model—from its encoder‑decoder structure and self‑attention mechanism to multi‑head attention, positional encoding, residual connections, training loss, and inference strategies—providing a clear, visual walkthrough for readers new to modern NLP architectures.

Deep LearningSelf-AttentionTransformer
0 likes · 21 min read
Demystifying Transformers: A Step‑by‑Step Guide to Self‑Attention and Architecture
Rare Earth Juejin Tech Community
Rare Earth Juejin Tech Community
Oct 10, 2022 · Artificial Intelligence

A Beginner’s Journey into Vision Transformers (ViT) for Computer Vision Engineers

This article introduces the fundamentals of Vision Transformers (ViT) for computer‑vision developers, starting with an overview of the transformer architecture, detailed explanation of self‑attention and multi‑head attention, and step‑by‑step PyTorch code examples that illustrate query, key, value computation and attention scoring.

PyTorchSelf-AttentionTransformer
0 likes · 12 min read
A Beginner’s Journey into Vision Transformers (ViT) for Computer Vision Engineers
vivo Internet Technology
vivo Internet Technology
Aug 24, 2022 · Frontend Development

Applying Self-Attention Based Machine Learning Model to Design-to-Code Layout Prediction

Vivo’s frontend team built a self‑attention‑based machine‑learning model that predicts web‑page layout types (column, row, or absolute) from node dimensions and positions, solving parent‑child and sibling relationships for design‑to‑code conversion, achieving 99.4% accuracy using over 20 k labeled, crawled, and generated samples, while outlining further enhancements.

D2CNeural NetworkSelf-Attention
0 likes · 11 min read
Applying Self-Attention Based Machine Learning Model to Design-to-Code Layout Prediction
JD Cloud Developers
JD Cloud Developers
Aug 15, 2022 · Artificial Intelligence

How FCA Doubles BERT’s Inference Speed with Less Than 1% Accuracy Loss

This article explains how the Fine‑ and Coarse‑Granularity Hybrid Self‑Attention (FCA) mechanism reduces BERT’s computational cost by over 50% while keeping accuracy loss under 1%, detailing the method, experimental results, and its significance for efficient large‑scale language models.

BERTDeep LearningFCA
0 likes · 8 min read
How FCA Doubles BERT’s Inference Speed with Less Than 1% Accuracy Loss
Baidu Geek Talk
Baidu Geek Talk
Mar 28, 2022 · Artificial Intelligence

Robust Input Visualization Methods for Vision Transformers

The paper proposes a robust Grad‑CAM‑inspired visualization for Vision Transformers that combines attention weights and gradients to generate class‑specific saliency maps, demonstrates superior alignment with discriminative regions across ViT, Swin and Volo models, and shows a 76% false‑positive reduction in Baidu’s porn‑content risk control system.

Deep LearningGrad-CAMInput Visualization
0 likes · 11 min read
Robust Input Visualization Methods for Vision Transformers
Baobao Algorithm Notes
Baobao Algorithm Notes
Jan 14, 2022 · Artificial Intelligence

BERT Interview Q&A: Decoding CLS, Masks, Complexity, and More

An in‑depth Q&A breaks down core BERT concepts—from the purpose of the [CLS] token and masking strategies to self‑attention complexity, sparse attention tricks, subword handling of OOV words, warm‑up learning rates, GPT’s unidirectional nature, and ALBERT’s parameter sharing—providing concise explanations for each.

BERTMaskingSelf-Attention
0 likes · 7 min read
BERT Interview Q&A: Decoding CLS, Masks, Complexity, and More
Code DAO
Code DAO
Dec 29, 2021 · Artificial Intelligence

Understanding Stand-Alone Axial-Attention for Panoptic Segmentation

The paper proposes a stand‑alone axial‑attention mechanism that converts 2‑D attention into 1‑D to lower computational cost while preserving global context, introduces position‑sensitive self‑attention, integrates it into Axial‑ResNet and Axial‑DeepLab, and demonstrates strong results on four large segmentation datasets.

Axial AttentionComputer VisionDeepLab
0 likes · 7 min read
Understanding Stand-Alone Axial-Attention for Panoptic Segmentation
Code DAO
Code DAO
Dec 8, 2021 · Artificial Intelligence

Understanding Compact Transformers: Build and Train Vision & NLP Models on a Personal PC

This article walks through the design of Compact Transformers, explaining scaled dot‑product self‑attention, positional embeddings, multi‑head attention, and Vision Transformer architecture, and provides full PyTorch code so readers can train lightweight CV and NLP classifiers on a single PC.

Compact TransformersPatch EmbeddingPositional Embedding
0 likes · 19 min read
Understanding Compact Transformers: Build and Train Vision & NLP Models on a Personal PC
AntTech
AntTech
Oct 29, 2021 · Artificial Intelligence

Ant Insurance Technology and CASIA Win Two Tracks at MuSe2021 Multimodal Sentiment Challenge (ACM MM 2021)

The Ant Insurance Technology team, together with the Institute of Automation of the Chinese Academy of Sciences, secured first place in both the MuSe‑Wilder and MuSe‑Sent tracks of the MuSe2021 Multimodal Sentiment Challenge held at the 29th ACM International Conference on Multimedia in Chengdu, showcasing advanced multimodal AI techniques.

BiLSTMDeep LearningMuSe2021
0 likes · 4 min read
Ant Insurance Technology and CASIA Win Two Tracks at MuSe2021 Multimodal Sentiment Challenge (ACM MM 2021)
TiPaiPai Technical Team
TiPaiPai Technical Team
Jun 11, 2021 · Artificial Intelligence

How Transformers Revolutionize Vision: From DETR to GCNet

This article explores how Transformer architectures, originally designed for NLP, are adapted for visual tasks, detailing pioneering models such as DETR, CBAM, NLNet, SENet, and GCNet, and explains their structures, attention mechanisms, advantages, and experimental findings for image processing.

DETRSelf-Attentionattention mechanisms
0 likes · 13 min read
How Transformers Revolutionize Vision: From DETR to GCNet
Cyber Elephant Tech Team
Cyber Elephant Tech Team
Apr 28, 2021 · Artificial Intelligence

Understanding BERT: From Encoder-Decoder to Transformer and Attention

This article explains the BERT model by first reviewing the Encoder-Decoder framework, then detailing the attention mechanism—including self-attention and multi-head attention—before describing the full Transformer architecture and finally outlining BERT’s encoder-only design, training stages, and fine-tuning applications.

BERTEncoder-DecoderNLP
0 likes · 15 min read
Understanding BERT: From Encoder-Decoder to Transformer and Attention
Sohu Tech Products
Sohu Tech Products
Nov 25, 2020 · Artificial Intelligence

Illustrated Guide to GPT-2: Detailed Explanation of the Decoder‑Only Transformer Model

This article provides a comprehensive, illustrated walkthrough of OpenAI's GPT‑2 language model, covering its decoder‑only Transformer architecture, self‑attention mechanisms, token processing, training data, differences from BERT, and applications beyond language modeling, enriched with visual diagrams and code snippets for deeper understanding.

AIGPT-2Language Model
0 likes · 24 min read
Illustrated Guide to GPT-2: Detailed Explanation of the Decoder‑Only Transformer Model
Sohu Tech Products
Sohu Tech Products
Nov 11, 2020 · Artificial Intelligence

Illustrated Transformer: Comprehensive Explanation and Code Implementation

This article provides a step‑by‑step illustrated guide to the Transformer architecture, covering its macro structure, detailed self‑attention mechanisms, multi‑head attention, positional encoding, residual connections, decoder operation, training process, loss functions, and includes complete PyTorch and custom Python code examples.

NLPPyTorchSelf-Attention
0 likes · 33 min read
Illustrated Transformer: Comprehensive Explanation and Code Implementation
DataFunTalk
DataFunTalk
Oct 23, 2020 · Artificial Intelligence

Feedback‑Aware Deep Matching Model for Music Recommendation in Tmall Genie

This article presents DeepMatch, a behavior‑sequence based deep learning recall model enhanced with play‑rate and intent‑type embeddings, describes its self‑attention architecture, factorized embedding parameterization, multitask loss design, distributed TensorFlow training tricks, and demonstrates significant offline and online improvements in music recommendation performance.

Deep LearningSelf-AttentionTensorFlow
0 likes · 15 min read
Feedback‑Aware Deep Matching Model for Music Recommendation in Tmall Genie
Alibaba Cloud Developer
Alibaba Cloud Developer
May 21, 2020 · Artificial Intelligence

How DeepMatch Boosts Music Recommendations with Play Rate and Intent Signals

This article examines the DeepMatch retrieval model for Tmall Genie music recommendation, detailing how incorporating user feedback such as play‑rate and query intent signals via multi‑task learning and feedback‑aware self‑attention improves recall accuracy and reduces negative recommendations, while also discussing embedding factorization, loss functions, and distributed training optimizations.

Deep LearningRecommendation SystemsSelf-Attention
0 likes · 18 min read
How DeepMatch Boosts Music Recommendations with Play Rate and Intent Signals
Alibaba Cloud Developer
Alibaba Cloud Developer
Jan 7, 2020 · Artificial Intelligence

How Alibaba Boosts Search Relevance with Advanced User Modeling and Self‑Attention

This article details Alibaba’s Taobao search CTR/CVR user modeling approach, covering background, model architecture with self‑attention and attention pooling, handling short‑term, long‑term, and on‑device behavior sequences, experimental results showing AUC improvements, and future directions.

CTR predictionSelf-Attentionbehavior sequence
0 likes · 20 min read
How Alibaba Boosts Search Relevance with Advanced User Modeling and Self‑Attention
Qunar Tech Salon
Qunar Tech Salon
Sep 12, 2019 · Artificial Intelligence

A Comprehensive Overview of Attention Mechanisms in Deep Learning

This article systematically reviews the history, core concepts, variants, and practical implementations of attention mechanisms—from early additive and multiplicative forms to self‑attention, multi‑head attention, and recent transformer‑based models—highlighting why attention has become fundamental in modern AI research.

Deep LearningNLPSelf-Attention
0 likes · 16 min read
A Comprehensive Overview of Attention Mechanisms in Deep Learning
Alibaba Cloud Developer
Alibaba Cloud Developer
Jul 9, 2019 · Artificial Intelligence

Demystifying Attention: A Clear Guide to Its History, Types, and Why It Works

This article systematically reviews the evolution of attention mechanisms—from early additive and multiplicative forms to self‑attention and multi‑head variants—explaining their core three‑step framework, key differences, and why they have become essential across NLP, vision, and broader AI applications.

Deep LearningNLPSelf-Attention
0 likes · 19 min read
Demystifying Attention: A Clear Guide to Its History, Types, and Why It Works
Sohu Tech Products
Sohu Tech Products
Jan 9, 2019 · Artificial Intelligence

Understanding the Transformer Model: Attention, Self‑Attention, and Multi‑Head Mechanisms

This article provides a comprehensive, step‑by‑step explanation of the Transformer architecture, covering its encoder‑decoder structure, self‑attention, multi‑head attention, positional encoding, residual connections, and training processes, illustrated with diagrams and code snippets to aid readers new to neural machine translation.

Deep LearningNeural Machine TranslationPositional Encoding
0 likes · 16 min read
Understanding the Transformer Model: Attention, Self‑Attention, and Multi‑Head Mechanisms
Alibaba Cloud Developer
Alibaba Cloud Developer
Jun 14, 2018 · Artificial Intelligence

Self-Attention Boosts Heterogeneous User Behavior Modeling for Recommendations

This paper proposes a novel attention‑based framework that groups and encodes heterogeneous user behavior sequences into separate semantic subspaces, applies self‑attention to capture inter‑behavior influences, and demonstrates faster training and comparable or improved recommendation performance across multiple tasks and datasets.

Self-Attentionheterogeneous behaviormulti-task learning
0 likes · 12 min read
Self-Attention Boosts Heterogeneous User Behavior Modeling for Recommendations