Artificial Intelligence 17 min read

Understanding PyTorch's Backward Propagation Engine (BP Engine)

This article explains how PyTorch's BP Engine dynamically builds the computation graph for back‑propagation, detailing its C++ class structure, thread management, task queues, and key functions such as start_threads, compute_dependencies, execute, and evaluate_function, with illustrative code examples.

Python Programming Learning Circle
Python Programming Learning Circle
Python Programming Learning Circle
Understanding PyTorch's Backward Propagation Engine (BP Engine)

After introducing the three pillars of PyTorch autograd (Variable, Function, Engine), the article focuses on the BP Engine class that dynamically constructs the computation graph used during back‑propagation.

An example Python snippet demonstrates tensor operations and a backward call:

gemfield = torch.ones(2, 2, requires_grad=True)
syszux = gemfield + 2
civilnet = syszux * syszux * 3
gemfieldout = civilnet.mean()
gemfieldout.backward()

The Engine class is defined in C++ as a struct with several type aliases and virtual methods:

struct Engine {
  using ready_queue_type = std::deque<std::pair<std::shared_ptr<Function>, InputBuffer>>;
  using dependencies_type = std::unordered_map<Function*, int>;
  virtual variable_list execute(const edge_list& roots, const variable_list& inputs, ... const edge_list& outputs = {});
  void queue_callback(std::function<void()> callback);
protected:
  void compute_dependencies(Function* root, GraphTask& task);
  void evaluate_function(FunctionTask& task);
  void start_threads();
  virtual void thread_init(int device);
  virtual void thread_main(GraphTask* graph_task);
  std::vector<std::shared_ptr<ReadyQueue>> ready_queues;
};

The start_threads member creates a ReadyQueue for each device and launches a corresponding number of worker threads (CPU counts as one device, each GPU as another, plus an extra thread). Threads are started with std::thread , passing the Engine instance ( this ) so all threads share the same Engine object.

ready_queues = std::vector<std::shared_ptr<ReadyQueue>>(num_threads);
for (auto& queue : ready_queues) {
    queue.reset(new ReadyQueue());
}
for (int i = 0; i < num_threads; ++i) {
    std::thread t(&Engine::thread_init, this, i - 1);
    t.detach();
}

The ReadyQueue holds FunctionTask objects in a priority queue, uses a std::condition_variable for synchronization, and provides push and pop methods:

auto ReadyQueue::push(FunctionTask item) -> void {
  {
    std::lock_guard<std::mutex> lock(mutex);
    ++item.base->outstanding_tasks;
    heap.push(std::move(item));
  }
  not_empty.notify_one();
}

auto ReadyQueue::pop() -> FunctionTask {
  std::unique_lock<std::mutex> lock(mutex);
  not_empty.wait(lock, [this]{ return !heap.empty(); });
  auto task = std::move(const_cast<FunctionTask&>(heap.top()));
  heap.pop();
  return task;
}

The thread_init function sets the worker's device ID, creates device guards for CUDA devices, stores the device in worker_device , and then calls thread_main to start processing tasks.

auto Engine::thread_init(int device) -> void {
  at::init_num_threads();
  // create device guards ...
  worker_device = device;
  thread_main(nullptr);
}

The GraphTask struct tracks the state of a backward pass, including outstanding_tasks , keep_graph , grad_mode , a mutex, condition variables, and maps for pending functions and dependencies.

struct GraphTask {
  std::atomic<uint64_t> outstanding_tasks;
  bool keep_graph;
  bool grad_mode;
  std::mutex mutex;
  std::condition_variable not_done;
  std::unordered_map<Function*, InputBuffer> not_ready;
  std::unordered_map<Function*, int> dependencies;
  // ... other members ...
};

A FunctionTask bundles a pointer to the owning GraphTask , a shared pointer to a Function , and its input buffer; these tasks are queued and later executed by worker threads.

struct FunctionTask {
  GraphTask* base;
  std::shared_ptr<Function> fn;
  InputBuffer inputs;
  FunctionTask(GraphTask* base, std::shared_ptr<Function> fn, InputBuffer inputs)
    : base(base), fn(std::move(fn)), inputs(std::move(inputs)) {}
};

The compute_dependencies function simply increments the dependency count for each function that appears in another function's next_edges() , enabling the engine to know when a node is ready to run.

The core execute method performs the following steps: (1) start worker threads, (2) create a GraphTask instance, (3) set up local mutexes, (4) build the graph root, (5) compute dependencies, (6) push the initial FunctionTask into the appropriate queue, and (7) wait until all tasks finish.

The thread_main loop repeatedly pops tasks from the thread‑specific ReadyQueue , checks that the function is valid and no error has occurred, sets the gradient mode, calls evaluate_function , and updates outstanding_tasks . When a task’s owner is the main process and its counter reaches zero, the main thread is notified via not_done .

while (!graph_task || graph_task->outstanding_tasks > 0) {
  FunctionTask task = queue->pop();
  if (task.fn && !task.base->has_error.load()) {
    GradMode::set_enabled(task.base->grad_mode);
    evaluate_function(task);
  }
  // update outstanding_tasks and notify as needed
}

The evaluate_function routine checks whether a node’s dependencies have dropped to zero; if so, the node is sent to the queue, otherwise it is stored in not_ready until all its inputs arrive.

Finally, call_function runs any pre‑hooks, invokes the function itself (e.g., MeanBackward0 , MulBackward0 ), and then runs post‑hooks, returning the produced variables.

static variable_list call_function(FunctionTask& task) {
  auto& fn = *task.fn;
  auto inputs = call_pre_hooks(fn, InputBuffer::variables(std::move(task.inputs)));
  variable_list outputs = fn(std::move(inputs));
  if (!fn.post_hooks().empty()) {
    return call_post_hooks(fn, std::move(outputs), inputs);
  }
  return outputs;
}

In summary, the article details how PyTorch’s Engine creates a multi‑threaded, device‑aware back‑propagation pipeline: it launches one thread per device, each thread owns a ready queue, tasks are passed between the main process and workers, and the engine coordinates dependencies, synchronization, and resource cleanup to compute gradients efficiently.

C++multithreadingPyTorchenginebackpropagationGraphTask
Python Programming Learning Circle
Written by

Python Programming Learning Circle

A global community of Chinese Python developers offering technical articles, columns, original video tutorials, and problem sets. Topics include web full‑stack development, web scraping, data analysis, natural language processing, image processing, machine learning, automated testing, DevOps automation, and big data.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.