How Gradient Descent Trains Neural Networks: A Blind Hiker’s Journey
This article uses a blindfolded mountain‑climbing analogy to explain how gradient descent trains neural networks, covering cost functions, learning rates, iterative updates, and provides a Python implementation for a simple three‑layer network example.
Imagine you are climbing a mountain blindfolded, feeling the slope under your feet and always stepping in the steepest direction until you think you have reached the foot of the mountain. This process mirrors the key algorithm in neural networks—gradient descent.
1. Neural Networks: A Digital Brain
First, understand what a neural network is. Imagine a machine’s brain composed of thousands of tiny units, similar to neurons in our brain. These units connect, transmit, and process information, enabling the machine to make decisions.
To teach the machine a task (e.g., recognizing handwritten digits), we adjust the strength of these connections so it better understands the data, akin to tuning audio knobs for harmonious sound.
2. Why Descend?
Returning to the climbing metaphor, why move from the mountain top to the foot? In neural networks, the top represents a poor network and the foot a good one. Our goal is to find a method that transforms a "bad" network into a "good" one.
But how do we measure "bad" versus "good"? This is the role of the cost function, which acts like a compass indicating the network’s current position and the direction to move. Mathematically, the difference is expressed as:
where ŷ is the network’s predicted output and y is the true label.
3. Finding the Best Descent Path
We know we need to reach the foot, but how? We must understand the slope (gradient). At each step we choose the steepest direction.
In neural networks we use a mathematical tool called the "gradient" to find this direction. The gradient indicates how to adjust the network’s connections for improvement. Mathematically, the gradient is computed as:
where ∇J is the gradient of the cost function and θ are the network’s weights.
4. Take One Step After Another
Each adjustment is a small step. We don’t want to move too fast, or we might miss the optimal path, similar to not turning audio knobs too quickly.
The size of each step is called the "learning rate". It determines how large each update is. The update rule is:
where α is the learning rate that decides the step size.
5. Repeat Until Perfect
We repeatedly adjust until the network becomes sufficiently good. This may take a long time because we handle large data and complex models.
Ultimately, gradient descent enables us to train powerful neural networks that make accurate decisions.
6. Summary
Gradient descent in neural networks is like a blind hiker feeling the slope to find the best downhill path. By adjusting connections, machines better understand data and make more accurate decisions. Though it sounds complex, it is a simple iterative process aiming for the optimal network.
Appendix – Designing Gradient Descent from Scratch
<code>import numpy as np
def sigmoid(x):
"""Sigmoid activation function."""
return 1 / (1 + np.exp(-x))
def sigmoid_derivative(x):
"""Derivative of the sigmoid function."""
return x * (1 - x)
class NeuralNetwork:
def __init__(self, input_size, hidden_size, output_size):
# Initialize weights and biases
self.weights_input_to_hidden = np.random.rand(input_size, hidden_size)
self.weights_hidden_to_output = np.random.rand(hidden_size, output_size)
self.bias_hidden = np.random.rand(1, hidden_size)
self.bias_output = np.random.rand(1, output_size)
self.learning_rate = 0.01
def forward(self, x):
"""Forward pass through the network."""
self.input = np.array([x])
self.hidden_activation = sigmoid(np.dot(self.input, self.weights_input_to_hidden) + self.bias_hidden)
self.output = sigmoid(np.dot(self.hidden_activation, self.weights_hidden_to_output) + self.bias_output)
return self.output
def backward(self, y):
"""Backward pass to update weights and biases."""
output_error = y - self.output
output_delta = output_error * sigmoid_derivative(self.output)
hidden_error = output_delta.dot(self.weights_hidden_to_output.T)
hidden_delta = hidden_error * sigmoid_derivative(self.hidden_activation)
# Update weights
self.weights_input_to_hidden += self.input.T.dot(hidden_delta) * self.learning_rate
self.weights_hidden_to_output += self.hidden_activation.T.dot(output_delta) * self.learning_rate
# Update biases
self.bias_hidden += np.sum(hidden_delta, axis=0) * self.learning_rate
self.bias_output += np.sum(output_delta, axis=0) * self.learning_rate
def train(self, X, y, epochs):
"""Training the neural network using gradient descent."""
for epoch in range(epochs):
for xi, yi in zip(X, y):
self.forward(xi)
self.backward(yi)
def predict(self, x):
"""Predicting using the trained neural network."""
return self.forward(x)
# Sample usage with dummy data
X = np.array([[0, 0],
[0, 1],
[1, 0],
[1, 1]])
y = np.array([[0], [1], [1], [0]])
nn = NeuralNetwork(input_size=2, hidden_size=4, output_size=1)
nn.train(X, y, epochs=10000)
predictions = nn.predict(X)
predictions</code>Result:
<code>array([[[0.48384395],
[0.50758334],
[0.50515687],
[0.51856939]]])
</code>The following sections explain the code design and functionality, covering the sigmoid activation and its derivative, the neural network class definition, initialization of weights and biases, the forward and backward propagation methods, the training loop, and the prediction method.
Reference: Sanderson, G. (2017, October 16). Gradient descent, how neural networks learn. Adapted by J. Pullen. 3Blue1Brown. Updated 2023, August 30. https://www.3blue1brown.com/lessons/gradient-descent
Model Perspective
Insights, knowledge, and enjoyment from a mathematical modeling researcher and educator. Hosted by Haihua Wang, a modeling instructor and author of "Clever Use of Chat for Mathematical Modeling", "Modeling: The Mathematics of Thinking", "Mathematical Modeling Practice: A Hands‑On Guide to Competitions", and co‑author of "Mathematical Modeling: Teaching Design and Cases".
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.