Artificial Intelligence 9 min read

Building Efficient Data Pipelines with TensorFlow’s tf.data API

This article explains how to use TensorFlow’s tf.data API to construct high‑performance, flexible data pipelines—from loading images or tensors, applying transformations and data augmentation, to batching, shuffling, caching, prefetching, and feeding the pipeline directly into model.fit for training.

Code DAO

Dec 20, 2021

Building Efficient Data Pipelines with TensorFlow’s tf.data API

The article introduces TensorFlow’s built‑in tf.data API, emphasizing that data handling is the backbone of a machine‑learning pipeline and that modern hardware such as TPUs and GPUs can process data in parallel only if the input pipeline keeps up.

tf.data Overview

tf.data

focuses on three goals: performance, flexibility, and ease of use. Its design lets the input pipeline exploit hardware accelerators during both forward and backward passes, supports many data formats without external tools, and provides an ETL‑style flow from source to model.

Data Extraction

Typical datasets reside on disk, in memory, or across distributed file systems. tf.data offers several ways to create a Dataset object:

From a list of file paths using tf.data.Dataset.list_files.

train_path = '/content/ICLR/train/train/'
train_list = glob.glob(train_path + "*")
train_images_list = glob.glob(train_path + "*/*")
train_ds = tf.data.Dataset.list_files(train_images_list)

From tensors, NumPy arrays, or Python lists via from_tensor_slices.

# from a NumPy array
np_array = np.array([[1,2,3,4,5],[1,2,3,4,6]])
ds = tf.data.Dataset.from_tensor_slices(np_array)
# from a Python list
py_list = [1,2,3,4,4]
ds = tf.data.Dataset.from_tensor_slices(py_list)
# from a SparseTensor
ragged_dataset = tf.data.Dataset.from_tensors(tf.SparseTensor(indices=[[0,0],[1,2]], values=[1,2], dense_shape=[3,4]))

From a Python generator using from_generator.

def count(stop):
    i = 0
    while i < stop:
        yield i
        i += 1

ds_counter = tf.data.Dataset.from_generator(count, args=[25], output_types=tf.int32, output_shapes=())

Transformation Stage

After creating a dataset, the transformation stage converts raw data into tensors suitable for a neural network. The article provides an image‑preprocessing pipeline that decodes JPEG files, normalizes pixel values, resizes, and applies random flips and brightness adjustments.

IMG_WIDTH = 224
IMG_HEIGHT = 224

def decode_img(img):
    img = tf.image.decode_jpeg(img, channels=3)
    img = tf.image.convert_image_dtype(img, tf.float32)
    img = tf.image.resize(img, [IMG_WIDTH, IMG_HEIGHT])
    img = tf.image.random_flip_left_right(img)
    img = tf.image.random_flip_up_down(img)
    img = tf.image.random_brightness(img, 0.3)
    return img

def get_label(path):
    part_list = tf.strings.split(path, "/")
    return part_list[-2] == class_names

def process_path(file_path):
    label = get_label(file_path)
    img = tf.io.read_file(file_path)
    img = decode_img(img)
    return img, label

The dataset is then mapped with this function, enabling parallel execution:

num_threads = 5
train_ds = train_ds.map(process_path, num_parallel_calls=num_threads)

Performance Optimizations

Typical pipeline steps include caching, shuffling, repeating, batching, and prefetching. The article notes that caching the entire dataset in memory is not advisable for very large datasets.

train_ds = train_ds.cache()
train_ds = train_ds.shuffle(10000)
train_ds = train_ds.repeat(num_epochs)
train_ds = train_ds.batch(128)
train_ds = train_ds.prefetch(tf.data.experimental.AUTOTUNE)

Prefetching drives the CPU to prepare the next batch while the GPU/TPU processes the current one, maximizing hardware utilization. Two illustrative diagrams (kept as images) show the execution order of CPU and GPU/TPU with prefetch.

Feeding the Pipeline to the Model

There are two ways to consume the dataset during training:

Iterate manually with a for loop over epochs and batches.

steps_per_epc = len(train_images_list) / batch_size
for epoch in range(num_epochs):
    for images, labels in train_ds.take(steps_per_epc):
        execute_model(images, labels)
        ...

Pass the dataset directly to model.fit, letting the tf.keras API manage the whole pipeline.

model.fit(train_ds, epochs=num_epochs, steps_per_epoch=steps_per_epc)

A diagram (image) illustrates the tf.keras training process.

Conclusion

The article wraps up by reminding readers that the tf.data API offers many additional built‑in features for constructing robust data input pipelines in TensorFlow.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

data pipeline machine learning Python TensorFlow prefetch data loading tf.data

Written by

Code DAO

We deliver AI algorithm tutorials and the latest news, curated by a team of researchers from Peking University, Shanghai Jiao Tong University, Central South University, and leading AI companies such as Huawei, Kuaishou, and SenseTime. Join us in the AI alchemy—making life better!

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.