Artificial Intelligence 10 min read

How Gradient Descent Trains Neural Networks: A Blind Hiker’s Journey

This article uses a blindfolded mountain‑climbing analogy to explain how gradient descent trains neural networks, covering cost functions, learning rates, iterative updates, and provides a Python implementation for a simple three‑layer network example.

Model Perspective
Model Perspective
Model Perspective
How Gradient Descent Trains Neural Networks: A Blind Hiker’s Journey
Imagine you are climbing a mountain blindfolded, feeling the slope under your feet and always stepping in the steepest direction until you think you have reached the foot of the mountain. This process mirrors the key algorithm in neural networks—gradient descent.

1. Neural Networks: A Digital Brain

First, understand what a neural network is. Imagine a machine’s brain composed of thousands of tiny units, similar to neurons in our brain. These units connect, transmit, and process information, enabling the machine to make decisions.

To teach the machine a task (e.g., recognizing handwritten digits), we adjust the strength of these connections so it better understands the data, akin to tuning audio knobs for harmonious sound.

2. Why Descend?

Returning to the climbing metaphor, why move from the mountain top to the foot? In neural networks, the top represents a poor network and the foot a good one. Our goal is to find a method that transforms a "bad" network into a "good" one.

But how do we measure "bad" versus "good"? This is the role of the cost function, which acts like a compass indicating the network’s current position and the direction to move. Mathematically, the difference is expressed as:

where ŷ is the network’s predicted output and y is the true label.

3. Finding the Best Descent Path

We know we need to reach the foot, but how? We must understand the slope (gradient). At each step we choose the steepest direction.

In neural networks we use a mathematical tool called the "gradient" to find this direction. The gradient indicates how to adjust the network’s connections for improvement. Mathematically, the gradient is computed as:

where ∇J is the gradient of the cost function and θ are the network’s weights.

4. Take One Step After Another

Each adjustment is a small step. We don’t want to move too fast, or we might miss the optimal path, similar to not turning audio knobs too quickly.

The size of each step is called the "learning rate". It determines how large each update is. The update rule is:

where α is the learning rate that decides the step size.

5. Repeat Until Perfect

We repeatedly adjust until the network becomes sufficiently good. This may take a long time because we handle large data and complex models.

Ultimately, gradient descent enables us to train powerful neural networks that make accurate decisions.

6. Summary

Gradient descent in neural networks is like a blind hiker feeling the slope to find the best downhill path. By adjusting connections, machines better understand data and make more accurate decisions. Though it sounds complex, it is a simple iterative process aiming for the optimal network.

Appendix – Designing Gradient Descent from Scratch

<code>import numpy as np

def sigmoid(x):
    """Sigmoid activation function."""
    return 1 / (1 + np.exp(-x))

def sigmoid_derivative(x):
    """Derivative of the sigmoid function."""
    return x * (1 - x)

class NeuralNetwork:
    def __init__(self, input_size, hidden_size, output_size):
        # Initialize weights and biases
        self.weights_input_to_hidden = np.random.rand(input_size, hidden_size)
        self.weights_hidden_to_output = np.random.rand(hidden_size, output_size)
        self.bias_hidden = np.random.rand(1, hidden_size)
        self.bias_output = np.random.rand(1, output_size)
        self.learning_rate = 0.01
    def forward(self, x):
        """Forward pass through the network."""
        self.input = np.array([x])
        self.hidden_activation = sigmoid(np.dot(self.input, self.weights_input_to_hidden) + self.bias_hidden)
        self.output = sigmoid(np.dot(self.hidden_activation, self.weights_hidden_to_output) + self.bias_output)
        return self.output
    def backward(self, y):
        """Backward pass to update weights and biases."""
        output_error = y - self.output
        output_delta = output_error * sigmoid_derivative(self.output)
        hidden_error = output_delta.dot(self.weights_hidden_to_output.T)
        hidden_delta = hidden_error * sigmoid_derivative(self.hidden_activation)
        # Update weights
        self.weights_input_to_hidden += self.input.T.dot(hidden_delta) * self.learning_rate
        self.weights_hidden_to_output += self.hidden_activation.T.dot(output_delta) * self.learning_rate
        # Update biases
        self.bias_hidden += np.sum(hidden_delta, axis=0) * self.learning_rate
        self.bias_output += np.sum(output_delta, axis=0) * self.learning_rate
    def train(self, X, y, epochs):
        """Training the neural network using gradient descent."""
        for epoch in range(epochs):
            for xi, yi in zip(X, y):
                self.forward(xi)
                self.backward(yi)
    def predict(self, x):
        """Predicting using the trained neural network."""
        return self.forward(x)

# Sample usage with dummy data
X = np.array([[0, 0],
              [0, 1],
              [1, 0],
              [1, 1]])

y = np.array([[0], [1], [1], [0]])

nn = NeuralNetwork(input_size=2, hidden_size=4, output_size=1)
nn.train(X, y, epochs=10000)

predictions = nn.predict(X)
predictions</code>

Result:

<code>array([[[0.48384395],
        [0.50758334],
        [0.50515687],
        [0.51856939]]])
</code>

The following sections explain the code design and functionality, covering the sigmoid activation and its derivative, the neural network class definition, initialization of weights and biases, the forward and backward propagation methods, the training loop, and the prediction method.

Reference: Sanderson, G. (2017, October 16). Gradient descent, how neural networks learn. Adapted by J. Pullen. 3Blue1Brown. Updated 2023, August 30. https://www.3blue1brown.com/lessons/gradient-descent

machine learningneural networkPythonAIgradient descentbackpropagation
Model Perspective
Written by

Model Perspective

Insights, knowledge, and enjoyment from a mathematical modeling researcher and educator. Hosted by Haihua Wang, a modeling instructor and author of "Clever Use of Chat for Mathematical Modeling", "Modeling: The Mathematics of Thinking", "Mathematical Modeling Practice: A Hands‑On Guide to Competitions", and co‑author of "Mathematical Modeling: Teaching Design and Cases".

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.