Image Captioning with Attention in TensorFlow 2.0: An End-to-End Encoder-Decoder Tutorial
This article walks through building an image‑captioning system using a TensorFlow 2.0 encoder‑decoder with Bahdanau attention, covering dataset preparation, feature extraction with InceptionV3, model architecture, training with teacher forcing, and inference on the Flickr8K dataset.
Image captioning generates a textual description for a given image. The article implements an end‑to‑end encoder‑decoder with attention in Keras/TensorFlow 2.0, using the Flickr8K dataset (≈8,000 images, each with five captions).
Data preparation : The caption file Flickr8k.token.txt is read, each line split into <image_file>#i and caption, and stored in a dictionary mapping image names to a list of captions. Captions are cleaned (lower‑cased, punctuation removed, short or numeric words discarded), then wrapped with startseq and endseq tokens. A Tokenizer learns the vocabulary, providing vocab_size and the maximum caption length.
Image feature extraction : A pre‑trained InceptionV3 model ( include_top=False) extracts a (64, 2048) feature map for each image. The classification head is discarded, and the feature map is saved as a .npy file named after the image (e.g., 1000268201_693b08cb0e.npy).
Model architecture consists of four components:
Encoder: a dense layer that projects the extracted features to the desired embedding dimension.
BahdanauAttention: computes attention weights from encoder features and the decoder hidden state, returning a context vector.
RNN_Decoder: an embedding layer, a GRU, and two dense layers; it receives the context vector, concatenates it with the embedded input token, and outputs word probabilities.
Sequence Generator: the final dense layer maps decoder outputs to the vocabulary size.
class BahdanauAttention(tf.keras.Model):
def __init__(self, units):
super().__init__()
self.W1 = tf.keras.layers.Dense(units)
self.W2 = tf.keras.layers.Dense(units)
self.V = tf.keras.layers.Dense(1)
def call(self, features, hidden):
hidden_with_time_axis = tf.expand_dims(hidden, 1)
score = self.V(tf.nn.tanh(self.W1(features) + self.W2(hidden_with_time_axis)))
attention_weights = tf.nn.softmax(score, axis=1)
context_vector = tf.reduce_sum(attention_weights * features, axis=1)
return context_vector, attention_weights
class CNN_Encoder(tf.keras.Model):
def __init__(self, embedding_dim):
super().__init__()
self.fc = tf.keras.layers.Dense(embedding_dim)
def call(self, x):
return tf.nn.relu(self.fc(x))
class RNN_Decoder(tf.keras.Model):
def __init__(self, embedding_dim, units, vocab_size):
super().__init__()
self.embedding = tf.keras.layers.Embedding(vocab_size, embedding_dim)
self.gru = tf.keras.layers.GRU(units, return_sequences=True, return_state=True, recurrent_initializer='glorot_uniform')
self.fc1 = tf.keras.layers.Dense(units)
self.fc2 = tf.keras.layers.Dense(vocab_size)
self.attention = BahdanauAttention(units)
def call(self, x, features, hidden):
context_vector, attention_weights = self.attention(features, hidden)
x = self.embedding(x)
x = tf.concat([tf.expand_dims(context_vector, 1), x], axis=-1)
output, state = self.gru(x)
x = self.fc1(output)
x = tf.reshape(x, (-1, x.shape[2]))
x = self.fc2(x)
return x, state, attention_weights
def reset_state(self, batch_size):
return tf.zeros((batch_size, self.gru.units))Training pipeline : Image paths and padded caption sequences are paired in a tf.data.Dataset. A mapping function loads the saved .npy feature vectors. The dataset is shuffled, batched (size 64), and prefetched.
BATCH_SIZE = 64
BUFFER_SIZE = 1000
def map_func(img_name, cap):
img_tensor = np.load(img_name.decode('utf-8') + '.npy')
return img_tensor, cap
dataset = tf.data.Dataset.from_tensor_slices((train_X, train_y))
dataset = dataset.map(lambda i, c: tf.numpy_function(map_func, [i, c], [tf.float32, tf.int32]),
num_parallel_calls=tf.data.experimental.AUTOTUNE)
dataset = dataset.shuffle(BUFFER_SIZE).batch(BATCH_SIZE).prefetch(tf.data.experimental.AUTOTUNE)Training loop : The loss uses SparseCategoricalCrossentropy with masking for padded tokens. Teacher forcing feeds the ground‑truth token at each time step.
loss_object = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True, reduction='none')
def loss_function(real, pred):
mask = tf.math.logical_not(tf.math.equal(real, 0))
loss_ = loss_object(real, pred)
mask = tf.cast(mask, dtype=loss_.dtype)
loss_ *= mask
return tf.reduce_mean(loss_)
@tf.function
def train_step(img_tensor, target):
loss = 0
hidden = decoder.reset_state(batch_size=target.shape[0])
dec_input = tf.expand_dims([tokenizer.word_index['startseq']] * target.shape[0], 1)
with tf.GradientTape() as tape:
features = encoder(img_tensor)
for i in range(1, target.shape[1]):
predictions, hidden, _ = decoder(dec_input, features, hidden)
loss += loss_function(target[:, i], predictions)
dec_input = tf.expand_dims(target[:, i], 1)
total_loss = loss / int(target.shape[1])
trainable_vars = encoder.trainable_variables + decoder.trainable_variables
gradients = tape.gradient(loss, trainable_vars)
optimizer.apply_gradients(zip(gradients, trainable_vars))
return loss, total_lossThe model is trained for 20 epochs, printing batch and epoch losses. Teacher forcing explanation : During training the correct next word from the caption is supplied to the decoder, preventing error accumulation that would occur if the model’s own predictions were used. Inference : For a test image, features are extracted, the decoder is run step‑by‑step with greedy search (selecting the highest‑probability token). No teacher forcing is used, and the loop stops at endseq or the maximum length. <code>def evaluate(image, max_length): attention_plot = np.zeros((max_length, attention_features_shape)) hidden = decoder.reset_state(batch_size=1) temp_input = tf.expand_dims(load_image(image)[0], 0) img_tensor_val = image_features_extract_model(temp_input) img_tensor_val = tf.reshape(img_tensor_val, (img_tensor_val.shape[0], -1, img_tensor_val.shape[3])) features = encoder(img_tensor_val) dec_input = tf.expand_dims([tokenizer.word_index['startseq']], 0) result = [] for i in range(max_length): predictions, hidden, attention_weights = decoder(dec_input, features, hidden) attention_plot[i] = tf.reshape(attention_weights, (-1,)).numpy() predicted_id = tf.random.categorical(predictions, 1)[0][0].numpy() result.append(tokenizer.index_word[predicted_id]) if tokenizer.index_word[predicted_id] == 'endseq': return result, attention_plot dec_input = tf.expand_dims([predicted_id], 0) return result, attention_plot[:len(result), :] </code> An example on a test image yields the predicted caption “person climbing from rock to big rock” while the ground‑truth caption is “person climbing between two big cliffs”. Overall, the article demonstrates how to construct, train, and evaluate an attention‑based image captioning model in TensorFlow 2.0.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Code DAO
We deliver AI algorithm tutorials and the latest news, curated by a team of researchers from Peking University, Shanghai Jiao Tong University, Central South University, and leading AI companies such as Huawei, Kuaishou, and SenseTime. Join us in the AI alchemy—making life better!
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
