Artificial Intelligence 11 min read

Understanding Sequence‑to‑Sequence (seq2seq) Models and Attention Mechanisms

This article explains the fundamentals of seq2seq neural machine translation models, covering encoder‑decoder architecture, word embeddings, context vectors, RNN processing, and the attention mechanism introduced by Bahdanau and Luong, with visual illustrations and reference links for deeper study.

Sohu Tech Products
Sohu Tech Products
Sohu Tech Products
Understanding Sequence‑to‑Sequence (seq2seq) Models and Attention Mechanisms

Datawhale presents a detailed tutorial (translated by Zhang Xian, Harbin Engineering University) that is about 4,000 Chinese characters long and takes roughly 11 minutes to read; the reviewer is Jepson, a Datawhale member working on recommendation algorithms at Tencent.

Sequence‑to‑sequence (seq2seq) models are deep‑learning architectures that have achieved success in tasks such as machine translation, text summarization, and image captioning; Google Translate adopted this model at the end of 2016. Two pioneering papers introduced the concept: Sutskever et al. (2014) and Cho et al. (2014).

The article’s goal is to make the layered concepts of seq2seq models easier to grasp by visualizing them, assuming the reader has some basic deep‑learning knowledge.

A seq2seq model receives an input sequence (words, characters, or image features) and produces an output sequence. The trained model is illustrated with animated GIFs showing the flow of information.

Further Understanding Details

The model consists of an encoder and a decoder. The encoder processes each element of the input sequence and compresses the information into a context vector. After the entire input is processed, the context vector is passed to the decoder, which generates the output sequence step by step.

Both encoder and decoder are typically implemented with recurrent neural networks (RNNs). The context vector is a floating‑point vector whose length depends on the hidden‑state size of the encoder RNN (commonly 256, 512, or 1024; the illustration uses length 4 for simplicity).

At each time step an RNN receives two inputs: the current element of the input sequence (e.g., a word) and the previous hidden state.

One element from the input sequence (a word in the decoder example).

A hidden state from the previous time step.

Each word must be represented as a vector; this is done via word‑embedding techniques that map words into a continuous vector space, capturing semantic relationships such as king - man + woman = queen . Pre‑trained embeddings or embeddings trained on the target dataset can be used; typical dimensionalities are 200‑300, but the article shows vectors of length 4 for illustration.

After covering vectors and tensors, the article revisits the RNN mechanism and provides visualizations of the encoder‑decoder interaction.

In the visualizations, the encoder’s final hidden state becomes the context vector that is fed to the decoder. The decoder also maintains its own hidden state, which is passed from one time step to the next.

Attention Explanation

It has been observed that the fixed‑size context vector becomes a bottleneck for long sentences, prompting the development of attention mechanisms.

Bahdanau et al. (2014) and Luong et al. (2015) introduced attention, allowing the decoder to focus on relevant parts of the input sequence when generating each output token.

Key differences of attention‑based models compared to classic seq2seq:

The encoder passes all hidden states (one per input token) to the decoder instead of only the final hidden state.

Before producing an output, the decoder computes a weighted sum of the encoder hidden states, where the weights are attention scores.

The weighting process (softmax over scores) is performed at every decoding step.

The article outlines the full attention decoding pipeline:

The decoder RNN receives an embedding vector and an initialized decoder hidden state.

The RNN processes these inputs, producing a new hidden state (e.g., h₄) and an output that is ignored.

Attention scores are computed using the encoder hidden states and h₄ to produce a context vector C₄.

h₄ and C₄ are concatenated into a single vector.

This vector is fed into a feed‑forward neural network trained jointly with the whole model.

The network’s output is interpreted as the next target word.

The process repeats for the next time step.

Additional visualizations show how attention aligns source and target words (e.g., focusing on the French word "étudiant" when generating its English translation) and how attention distributions differ for phrases with reordered word order.

For readers ready to implement attention, the article points to the TensorFlow Neural Machine Translation (seq2seq) guide:

神经机器翻译 (seq2seq) 指南 – https://github.com/tensorflow/nmt

The translation is credited to the original author @JayAlammmar ( https://twitter.com/JayAlammar ).

Deep LearningAttentionEmbeddingseq2seqRNNneural machine translation
Sohu Tech Products
Written by

Sohu Tech Products

A knowledge-sharing platform for Sohu's technology products. As a leading Chinese internet brand with media, video, search, and gaming services and over 700 million users, Sohu continuously drives tech innovation and practice. We’ll share practical insights and tech news here.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.