Understanding Backpropagation: From Simple to Advanced Neural Network Implementations in Python
This article explains the back‑propagation algorithm in neural networks, starting with a simple single‑neuron example using ReLU, Sigmoid and MSE, then extending to multi‑layer matrix‑based networks, providing detailed Python code, gradient calculations, and comparisons with TensorFlow implementations.
The article begins with a brief introduction to the concept of back‑propagation, outlining the four main steps: forward pass to compute predictions, error calculation, backward error propagation, and parameter updates.
Definition of Activation and Loss Functions
ReLU, Sigmoid, and Mean Squared Error (MSE) are defined along with their derivatives to facilitate gradient computation. The implementations are provided as Python classes with __call__ and diff methods.
import random
import numpy as np
import tensorflow as tf
from matplotlib import pyplot as plt
# ReLU activation
class ReLU:
def __call__(self, x):
return np.maximum(0, x)
def diff(self, x):
x_temp = x.copy()
x_temp[x_temp > 0] = 1
return x_temp
# Sigmoid activation
class Sigmoid:
def __call__(self, x):
return 1/(1+np.exp(-x))
def diff(self, x):
return x*(1-x)
# MSE loss
class MSE:
def __call__(self, true, pred):
return np.mean(np.power(pred-true, 2), keepdims=True)
def diff(self, true, pred):
return pred-true
relu = ReLU()
sigmoid = Sigmoid()
mse = MSE()Simple Backpropagation Example
A single‑neuron network with a sigmoid activation is trained on randomly generated scalar inputs x , weight w , bias b , and target true . The forward computation follows x → w·x+b → sigmoid(w·x+b) → MSE(true, sigmoid(w·x+b)) . The backward pass uses the chain rule to compute gradients of the loss with respect to w and b , and updates them with w -= lr * x * sigmoid.diff(pred) * mse.diff(true, pred) and b -= lr * sigmoid.diff(pred) * mse.diff(true, pred) . Training over 520 epochs shows a decreasing loss curve and predictions converging toward the target.
x = random.random()
w = random.random()
b = random.random()
true = random.random()
print(f'x={x} true={true}')
lr = 0.3
epochs = 520
loss_hisory = []
for epoch in range(epochs):
pred = sigmoid(w * x + b)
loss = mse(true, pred)
w -= lr * x * sigmoid.diff(pred) * mse.diff(true, pred)
b -= lr * sigmoid.diff(pred) * mse.diff(true, pred)
if epoch % 100 == 0:
print(f'epoch {epoch}, loss={loss}, pred={pred}')
loss_hisory.append(loss)
print(f'epoch {epoch+1}, loss={loss}, pred={pred}')
plt.plot(loss_hisory)
plt.show()Advanced Backpropagation with Matrices
The article extends the concept to a three‑neuron hidden layer, using matrix multiplication ( @ ) instead of scalar products. The forward pass becomes x → x@w1+b1 → sigmoid → y@w2+b2 → sigmoid → MSE . Gradient updates for both layers are derived using the chain rule and implemented with NumPy operations such as w2 -= lr * x.T @ sigmoid.diff(pred) * mse.diff(true, pred) .
x = np.random.rand(1, 1)
# weights and biases for two layers
w1 = np.random.rand(1, 3)
b1 = np.random.rand(1, 3)
w2 = np.random.rand(3, 1)
b2 = np.random.rand(1, 1)
true = np.array([[0.1]])
lr = 0.1
epochs = 520
loss_hisory = []
for epoch in range(epochs):
y = sigmoid(x @ w1 + b1)
pred = sigmoid(y @ w2 + b2)
loss = mse(true, pred)
# update output layer
w2 -= lr * y.T @ sigmoid.diff(pred) * mse.diff(true, pred)
b2 -= lr * sigmoid.diff(pred) * mse.diff(true, pred)
# update hidden layer
w1 -= lr * x.T @ (sigmoid.diff(y) * ((sigmoid.diff(pred) * mse.diff(true, pred)) @ w2.T))
b1 -= lr * (sigmoid.diff(y) * ((sigmoid.diff(pred) * mse.diff(true, pred)) @ w2.T))
if epoch % 100 == 0:
print(f'epoch {epoch}, loss={loss}, pred={pred}')
loss_hisory.append(loss[0])
print(f'epoch {epoch+1}, loss={mse(true, pred)}, pred={pred}')
plt.plot(loss_hisory)
plt.show()Hand‑crafted Neural Network Framework
A minimal neural‑network library is built from scratch. A Linear layer stores weights, bias, and activation, records intermediate values for back‑propagation, and provides an update method that applies gradient descent. The NetWork class assembles three layers (two ReLU, one Sigmoid) and implements fit and backward methods to train on a small dataset.
class Linear:
def __init__(self, inputs, outputs, activation):
self.weight = np.random.rand(inputs, outputs) / 10
self.weight = self.weight / self.weight.sum()
self.bias = np.random.rand(outputs) / 10
self.bias = self.bias / self.bias.sum()
self.activation = activation
self.x_temp = None
self.t_temp = None
def __call__(self, x, parent):
self.x_temp = x
self.t_temp = self.activation(x @ self.weight + self.bias)
if self not in parent.layers:
parent.layers.append(self)
return self.t_temp
def update(self, grad):
activation_diff_grad = self.activation.diff(self.t_temp) * grad
new_grad = activation_diff_grad @ self.weight.T
self.weight -= lr * self.x_temp.T @ activation_diff_grad
self.bias -= lr * activation_diff_grad.mean(axis=0)
return new_grad
class NetWork:
def __init__(self):
self.layers = []
self.linear_1 = Linear(4, 16, activation=relu)
self.linear_2 = Linear(16, 8, activation=relu)
self.linear_3 = Linear(8, 3, activation=sigmoid)
def __call__(self, x):
x = self.linear_1(x, self)
x = self.linear_2(x, self)
x = self.linear_3(x, self)
return x
def fit(self, x, y, epochs, step=100):
for epoch in range(epochs):
pred = self(x)
self.backward(y, pred)
if epoch % step == 0:
print(f'epoch {epoch}, loss={mse(y, pred)}, pred={pred}')
print(f'epoch {epoch+1}, loss={mse(y, pred)}, pred={pred}')
def backward(self, true, pred):
grad = mse.diff(true, pred)
for layer in reversed(self.layers):
grad = layer.update(grad)TensorFlow Verification
The custom implementation is validated against TensorFlow. Random inputs and parameters are used to compute a forward pass with tf.nn.relu and tf.nn.sigmoid , followed by tf.keras.losses.mse . Gradients of the loss with respect to the first‑layer weights are obtained via tf.GradientTape and shown to match the gradients produced by the hand‑crafted network.
with tf.GradientTape() as tape_1:
tape_1.watch(w1)
y = tf.nn.relu(x @ w1 + b1)
y = tf.nn.sigmoid(y @ w2 + b2)
y = tf.nn.sigmoid(y @ w3 + b3)
loss = tf.keras.losses.mse(true, y)
dLoss_dW1 = tape_1.gradient(loss, w1)
print('loss on w1 gradient:', dLoss_dW1.numpy())Both the TensorFlow and the custom network produce identical gradient values, confirming the correctness of the manual back‑propagation implementation.
Rare Earth Juejin Tech Community
Juejin, a tech community that helps developers grow.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.