Why Static Graphs Outperform Dynamic Graphs in AutoDiff: A Deep Dive
This article explains the fundamental differences between static and dynamic computation graphs, compares their memory and performance characteristics, shows how automatic differentiation works in each paradigm, and provides a step‑by‑step implementation of a toy static‑graph AutoDiff engine with Python code examples.
Static vs. Dynamic Graphs
Static graphs (e.g., TensorFlow, Paddle Fluid) use declarative programming: the computation graph is constructed once before any data is fed, and the same graph is executed repeatedly. Dynamic graphs (e.g., PyTorch, Paddle) use imperative programming: the graph is built on‑the‑fly during each forward pass. Dynamic graphs are easier to prototype but typically consume more GPU memory because the system cannot predict which intermediate tensors will become obsolete. They also require rebuilding the graph for every iteration, which reduces computational efficiency and makes deployment on embedded devices difficult; conversion to a static format such as ONNX or TensorRT is usually required.
Dynamic‑Graph AutoDiff
In a dynamic graph the forward and backward computations happen concurrently. During the forward pass a forward cache stores intermediate results, while a backward cache records gradients that can be computed immediately. When output.backward() is called, the runtime traverses the graph, stitches together the cached gradients, and produces the final gradients for each leaf node.
Static‑Graph AutoDiff
Static graphs construct both the forward sub‑graph and the backward sub‑graph during the build phase. Placeholders represent inputs that are fed at runtime, and any node (forward value or gradient) can be fetched via sess.run(). The following TensorFlow example builds a simple graph, computes the output, and obtains the gradients with respect to the inputs:
import tensorflow as tf
X1 = tf.placeholder(tf.float32, shape=(1,), name="X1")
X2 = tf.placeholder(tf.float32, shape=(1,), name="X2")
h1 = tf.multiply(X1, X2)
h2 = tf.add(h1, X1)
output = tf.div(h2, X2)
grad = tf.gradients(output, [X1, X2])
feed_dict = {"X1": 0.6, "X2": 0.2}
sess = tf.Session()
output_v = sess.run(output, feed_dict)
grad_v = sess.run(grad, feed_dict)Because the static graph knows the entire topology beforehand, it can perform memory‑usage optimizations, node fusion, and parallel execution planning, resulting in higher computational efficiency and lower memory consumption. Debugging is harder because ordinary Python print statements cannot be inserted; TensorFlow provides tf.Print nodes or fetching intermediate tensors via the fetch_list argument.
Control Flow in Static Graphs
All possible branches must be declared before graph construction. Conditional logic therefore requires building sub‑graphs for every branch. The following C‑style pseudo‑code illustrates this requirement:
if (X > 2) {
return X * X3;
} else {
return X4 - X;
}Toy Static‑Graph AutoDiff Engine
The minimal AutoDiff framework is open‑sourced at https://github.com/FesianXu/ToyAutoDiff. Its core consists of two classes:
Node : stores a list of input nodes, an optional constant attribute, and a name. It represents a vertex in the directed acyclic graph (DAG).
Op (abstract): defines compute for forward evaluation and gradient for building the backward sub‑graph.
Example implementation of Node:
class Node(object):
def __init__(self):
self.inputs = []
self.op = None
self.const_attr = None
self.name = ""
def __add__(self, other):
if isinstance(other, Node):
new_node = add_op(self, other)
else:
new_node = add_byconst_op(self, other)
return new_node
def __mul__(self, other):
if isinstance(other, Node):
new_node = mul_op(self, other)
else:
new_node = mul_byconst_op(self, other)
return new_node
__radd_ = __add_
__rmul_ = __mul_
def __str__(self):
return self.name
__repr_ = __str_Abstract Op class:
class Op(object):
def __call__(self):
new_node = Node()
new_node.op = self
return new_node
def compute(self, node, input_vals):
raise NotImplementedError
def gradient(self, node, output_grad):
raise NotImplementedErrorConcrete matrix‑multiplication operator:
class MatMulOp(Op):
"""Op to matrix multiply two nodes."""
def __call__(self, node_A, node_B, trans_A=False, trans_B=False):
new_node = Op.__call__(self)
new_node.matmul_attr_trans_A = trans_A
new_node.matmul_attr_trans_B = trans_B
new_node.inputs = [node_A, node_B]
new_node.name = "MatMul(%s,%s,%s,%s)" % (node_A.name, node_B.name, str(trans_A), str(trans_B))
return new_node
def compute(self, node, input_vals):
assert len(input_vals) == 2
assert isinstance(input_vals[0], np.ndarray) and isinstance(input_vals[1], np.ndarray)
return np.dot(input_vals[0], input_vals[1])
def gradient(self, node, output_grad):
"""If Y = A B, then dA = dY B^T, dB = A^T dY"""
return [matmul_op(output_grad, transpose_op(node.inputs[1])),
matmul_op(transpose_op(node.inputs[0]), output_grad)]Gradient computation traverses the graph in reverse topological order, invoking each node’s gradient method and accumulating results:
def gradients(output_node, node_list):
node_to_output_grads = {output_node: [oneslike_op(output_node)]}
reverse_topo = reversed(find_topo_sort([output_node]))
for idx, node in enumerate(reverse_topo):
if idx == 0:
grads = node.op.gradient(node, oneslike_op(output_node))
else:
grads = node.op.gradient(node, node_to_output_grads[node])
if grads is None:
continue
for i, g in enumerate(grads):
inp = node.inputs[i]
if inp in node_to_output_grads:
node_to_output_grads[inp] += g
else:
node_to_output_grads[inp] = g
return [node_to_output_grads[n] for n in node_list]
def find_topo_sort(node_list):
visited = set()
order = []
for n in node_list:
topo_sort_dfs(n, visited, order)
return order
def topo_sort_dfs(node, visited, order):
if node in visited:
return
visited.add(node)
for inp in node.inputs:
topo_sort_dfs(inp, visited, order)
order.append(node)The Executor evaluates a list of target nodes by performing a topological traversal and calling each node’s op.compute method:
class Executor:
"""Executor computes values for a given subset of nodes in a computation graph."""
def __init__(self, eval_node_list):
self.eval_node_list = eval_node_list
def run(self, feed_dict):
node_to_val = dict(feed_dict)
topo_order = find_topo_sort(self.eval_node_list)
for node in topo_order:
if node.inputs:
input_vals = [node_to_val[inp] for inp in node.inputs]
node_to_val[node] = node.op.compute(node=node, input_vals=input_vals)
return [node_to_val[n] for n in self.eval_node_list]With these components a functional static‑graph AutoDiff system is demonstrated. Possible future extensions include shape inference, layer abstractions, parameter initializers, optimizers, loss functions, and ultimately a toy TensorFlow‑like framework.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
