Artificial Intelligence 20 min read

Image Captioning with Attention in TensorFlow 2.0: An End-to-End Encoder-Decoder Tutorial

This article walks through building an image‑captioning system using a TensorFlow 2.0 encoder‑decoder with Bahdanau attention, covering dataset preparation, feature extraction with InceptionV3, model architecture, training with teacher forcing, and inference on the Flickr8K dataset.

Code DAO

Dec 25, 2021

Image Captioning with Attention in TensorFlow 2.0: An End-to-End Encoder-Decoder Tutorial

Image captioning generates a textual description for a given image. The article implements an end‑to‑end encoder‑decoder with attention in Keras/TensorFlow 2.0, using the Flickr8K dataset (≈8,000 images, each with five captions).

Data preparation : The caption file Flickr8k.token.txt is read, each line split into <image_file>#i and caption, and stored in a dictionary mapping image names to a list of captions. Captions are cleaned (lower‑cased, punctuation removed, short or numeric words discarded), then wrapped with startseq and endseq tokens. A Tokenizer learns the vocabulary, providing vocab_size and the maximum caption length.

Image feature extraction : A pre‑trained InceptionV3 model ( include_top=False) extracts a (64, 2048) feature map for each image. The classification head is discarded, and the feature map is saved as a .npy file named after the image (e.g., 1000268201_693b08cb0e.npy).

Model architecture consists of four components:

Encoder: a dense layer that projects the extracted features to the desired embedding dimension.

BahdanauAttention: computes attention weights from encoder features and the decoder hidden state, returning a context vector.

RNN_Decoder: an embedding layer, a GRU, and two dense layers; it receives the context vector, concatenates it with the embedded input token, and outputs word probabilities.

Sequence Generator: the final dense layer maps decoder outputs to the vocabulary size.

class BahdanauAttention(tf.keras.Model):
    def __init__(self, units):
        super().__init__()
        self.W1 = tf.keras.layers.Dense(units)
        self.W2 = tf.keras.layers.Dense(units)
        self.V = tf.keras.layers.Dense(1)
    def call(self, features, hidden):
        hidden_with_time_axis = tf.expand_dims(hidden, 1)
        score = self.V(tf.nn.tanh(self.W1(features) + self.W2(hidden_with_time_axis)))
        attention_weights = tf.nn.softmax(score, axis=1)
        context_vector = tf.reduce_sum(attention_weights * features, axis=1)
        return context_vector, attention_weights

class CNN_Encoder(tf.keras.Model):
    def __init__(self, embedding_dim):
        super().__init__()
        self.fc = tf.keras.layers.Dense(embedding_dim)
    def call(self, x):
        return tf.nn.relu(self.fc(x))

class RNN_Decoder(tf.keras.Model):
    def __init__(self, embedding_dim, units, vocab_size):
        super().__init__()
        self.embedding = tf.keras.layers.Embedding(vocab_size, embedding_dim)
        self.gru = tf.keras.layers.GRU(units, return_sequences=True, return_state=True, recurrent_initializer='glorot_uniform')
        self.fc1 = tf.keras.layers.Dense(units)
        self.fc2 = tf.keras.layers.Dense(vocab_size)
        self.attention = BahdanauAttention(units)
    def call(self, x, features, hidden):
        context_vector, attention_weights = self.attention(features, hidden)
        x = self.embedding(x)
        x = tf.concat([tf.expand_dims(context_vector, 1), x], axis=-1)
        output, state = self.gru(x)
        x = self.fc1(output)
        x = tf.reshape(x, (-1, x.shape[2]))
        x = self.fc2(x)
        return x, state, attention_weights
    def reset_state(self, batch_size):
        return tf.zeros((batch_size, self.gru.units))

Training pipeline : Image paths and padded caption sequences are paired in a tf.data.Dataset. A mapping function loads the saved .npy feature vectors. The dataset is shuffled, batched (size 64), and prefetched.

BATCH_SIZE = 64
BUFFER_SIZE = 1000

def map_func(img_name, cap):
    img_tensor = np.load(img_name.decode('utf-8') + '.npy')
    return img_tensor, cap

dataset = tf.data.Dataset.from_tensor_slices((train_X, train_y))
 dataset = dataset.map(lambda i, c: tf.numpy_function(map_func, [i, c], [tf.float32, tf.int32]),
                     num_parallel_calls=tf.data.experimental.AUTOTUNE)
 dataset = dataset.shuffle(BUFFER_SIZE).batch(BATCH_SIZE).prefetch(tf.data.experimental.AUTOTUNE)

Training loop : The loss uses SparseCategoricalCrossentropy with masking for padded tokens. Teacher forcing feeds the ground‑truth token at each time step.

loss_object = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True, reduction='none')

def loss_function(real, pred):
    mask = tf.math.logical_not(tf.math.equal(real, 0))
    loss_ = loss_object(real, pred)
    mask = tf.cast(mask, dtype=loss_.dtype)
    loss_ *= mask
    return tf.reduce_mean(loss_)

@tf.function
def train_step(img_tensor, target):
    loss = 0
    hidden = decoder.reset_state(batch_size=target.shape[0])
    dec_input = tf.expand_dims([tokenizer.word_index['startseq']] * target.shape[0], 1)
    with tf.GradientTape() as tape:
        features = encoder(img_tensor)
        for i in range(1, target.shape[1]):
            predictions, hidden, _ = decoder(dec_input, features, hidden)
            loss += loss_function(target[:, i], predictions)
            dec_input = tf.expand_dims(target[:, i], 1)
    total_loss = loss / int(target.shape[1])
    trainable_vars = encoder.trainable_variables + decoder.trainable_variables
    gradients = tape.gradient(loss, trainable_vars)
    optimizer.apply_gradients(zip(gradients, trainable_vars))
    return loss, total_loss

The model is trained for 20 epochs, printing batch and epoch losses. Teacher forcing explanation : During training the correct next word from the caption is supplied to the decoder, preventing error accumulation that would occur if the model’s own predictions were used. Inference : For a test image, features are extracted, the decoder is run step‑by‑step with greedy search (selecting the highest‑probability token). No teacher forcing is used, and the loop stops at endseq or the maximum length. <code>def evaluate(image, max_length): attention_plot = np.zeros((max_length, attention_features_shape)) hidden = decoder.reset_state(batch_size=1) temp_input = tf.expand_dims(load_image(image)[0], 0) img_tensor_val = image_features_extract_model(temp_input) img_tensor_val = tf.reshape(img_tensor_val, (img_tensor_val.shape[0], -1, img_tensor_val.shape[3])) features = encoder(img_tensor_val) dec_input = tf.expand_dims([tokenizer.word_index['startseq']], 0) result = [] for i in range(max_length): predictions, hidden, attention_weights = decoder(dec_input, features, hidden) attention_plot[i] = tf.reshape(attention_weights, (-1,)).numpy() predicted_id = tf.random.categorical(predictions, 1)[0][0].numpy() result.append(tokenizer.index_word[predicted_id]) if tokenizer.index_word[predicted_id] == 'endseq': return result, attention_plot dec_input = tf.expand_dims([predicted_id], 0) return result, attention_plot[:len(result), :] </code> An example on a test image yields the predicted caption “person climbing from rock to big rock” while the ground‑truth caption is “person climbing between two big cliffs”. Overall, the article demonstrates how to construct, train, and evaluate an attention‑based image captioning model in TensorFlow 2.0.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Python attention TensorFlow Keras image captioning Encoder-Decoder Flickr8k

Written by

Code DAO

We deliver AI algorithm tutorials and the latest news, curated by a team of researchers from Peking University, Shanghai Jiao Tong University, Central South University, and leading AI companies such as Huawei, Kuaishou, and SenseTime. Join us in the AI alchemy—making life better!

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.