Unlocking 3D Scene Synthesis: A Deep Dive into Neural Radiance Fields (NeRF)
This article explains the core principles of Neural Radiance Fields, detailing how a fully‑connected network maps 5‑D coordinates to color and density, the role of positional encoding and hierarchical sampling, and provides a complete PyTorch implementation with training and rendering examples.
Neural Radiance Fields (NeRF) represent a 3D scene with a fully‑connected neural network that maps a 5‑dimensional input—spatial coordinates (x, y, z) and view direction (θ, φ)—to RGB color and volume density (σ). Training uses multiple photographs of a single scene, deliberately over‑fitting the network so it becomes an expert for that specific environment.
Core Concepts
The network treats color as a function of both position and view direction, while density depends only on position, assuming material opacity does not change with viewpoint. This separation reduces model complexity.
Related Work
Before NeRF, discrete representations such as voxels and meshes outperformed neural scene encodings. Early attempts mapped coordinates to occupancy or distance fields but were limited to synthetic datasets like ShapeNet. NeRF introduced differentiable volumetric rendering to achieve high‑quality view synthesis.
Scene Representation Mechanism
Input vectors are split into position x = (x, y, z) and direction d = (θ, φ). An MLP with eight fully‑connected layers processes the position, outputs density σ and a 256‑dimensional feature vector, which is concatenated with the encoded direction before producing RGB values.
Volume Rendering
Rendering proceeds in three steps: sample points along each ray, predict color and density with the MLP, and integrate these values using volumetric rendering to obtain a 2‑D pixel color. The integration has no closed form because both color and density are outputs of the network, so numerical quadrature with hierarchical sampling is used.
Positional Encoding
Directly feeding raw 5‑D coordinates limits the network to low‑frequency details. NeRF applies a high‑frequency Fourier mapping (sin / cos) to each coordinate, expanding it to a higher‑dimensional space (e.g., L=4 yields 24 dimensions per axis), which enables the network to capture fine details.
Hierarchical Sampling
Uniform sampling wastes computation on empty space. NeRF trains a coarse network to estimate where density is high, then performs importance sampling (inverse‑transform) in those regions with a fine network, reducing variance and focusing compute on important parts of the scene.
Implementation
The following PyTorch code implements the key components.
import os, json, math
import numpy as np
from PIL import Image
import torch
import torch.nn as nn
import torch.nn.functional as F
def positional_encoding(x, L):
freqs = (2.0 ** torch.arange(L, device=x.device)) * math.pi
xb = x[..., None, :] * freqs[:, None]
xb = xb.reshape(*x.shape[:-1], L * 3)
return torch.cat([torch.sin(xb), torch.cos(xb)], dim=-1)
def get_rays(H, W, camera_angle_x, c2w, device):
fx = 0.5 * W / math.tan(0.5 * camera_angle_x)
cx = (W - 1) * 0.5
cy = (H - 1) * 0.5
i, j = torch.meshgrid(torch.arange(W, device=device), torch.arange(H, device=device), indexing="xy")
i, j = i.float(), j.float()
x = (i - cx) / fx
y = -(j - cy) / fx
z = -torch.ones_like(x)
dirs = torch.stack([x, y, z], dim=-1)
dirs = dirs / torch.norm(dirs, dim=-1, keepdim=True)
R, t = c2w[:3, :3], c2w[:3, 3]
rd = dirs @ R.T
ro = t.expand_as(rd)
return ro, rd
class NeRF(nn.Module):
def __init__(self, L_pos=10, L_dir=4, hidden=256):
super().__init__()
in_pos = 3 + 2 * L_pos * 3
in_dir = 3 + 2 * L_dir * 3
self.fc1 = nn.Linear(in_pos, hidden)
self.fc2 = nn.Linear(hidden, hidden)
self.fc3 = nn.Linear(hidden, hidden)
self.fc4 = nn.Linear(hidden, hidden)
self.fc5 = nn.Linear(hidden + in_pos, hidden)
self.fc6 = nn.Linear(hidden, hidden)
self.fc7 = nn.Linear(hidden, hidden)
self.fc8 = nn.Linear(hidden, hidden)
self.sigma = nn.Linear(hidden, 1)
self.feat = nn.Linear(hidden, hidden)
self.rgb1 = nn.Linear(hidden + in_dir, 128)
self.rgb2 = nn.Linear(128, 3)
self.L_pos, self.L_dir = L_pos, L_dir
def forward(self, x, d):
x_enc = torch.cat([x, positional_encoding(x, self.L_pos)], dim=-1)
d_enc = torch.cat([d, positional_encoding(d, self.L_dir)], dim=-1)
h = F.relu(self.fc1(x_enc))
h = F.relu(self.fc2(h))
h = F.relu(self.fc3(h))
h = F.relu(self.fc4(h))
h = torch.cat([h, x_enc], dim=-1) # skip connection
h = F.relu(self.fc5(h))
h = F.relu(self.fc6(h))
h = F.relu(self.fc7(h))
h = F.relu(self.fc8(h))
sigma = F.relu(self.sigma(h))
feat = self.feat(h)
h = torch.cat([feat, d_enc], dim=-1)
h = F.relu(self.rgb1(h))
rgb = torch.sigmoid(self.rgb2(h))
return rgb, sigma
def render_rays(model, ro, rd, near=2.0, far=6.0, N=64):
t = torch.linspace(near, far, N, device=ro.device)
pts = ro[:, None, :] + rd[:, None, :] * t[None, :, None]
dirs = rd[:, None, :].expand_as(pts)
rgb, sigma = model(pts.reshape(-1, 3), dirs.reshape(-1, 3))
rgb = rgb.reshape(ro.shape[0], N, 3)
sigma = sigma.reshape(ro.shape[0], N)
delta = t[1:] - t[:-1]
delta = torch.cat([delta, torch.tensor([1e10], device=ro.device)])
alpha = 1 - torch.exp(-sigma * delta)
T = torch.cumprod(torch.cat([torch.ones((ro.shape[0], 1), device=ro.device), 1 - alpha + 1e-10], dim=-1), dim=-1)[:, :-1]
weights = T * alpha
return (weights[..., None] * rgb).sum(dim=1)The training loop samples random rays from the dataset, computes the coarse rendering, and minimizes mean‑squared error against ground‑truth RGB values. PSNR is logged every 200 iterations.
device = "cuda" if torch.cuda.is_available() else "cpu"
images, c2ws, H, W, fov = load_dataset("nerf_synth_cube_sphere")
images, c2ws = images.to(device), c2ws.to(device)
model = NeRF().to(device)
opt = torch.optim.Adam(model.parameters(), lr=5e-4)
for it in range(1, 5001):
idx = torch.randint(0, images.shape[0], (1,)).item()
ro, rd = get_rays(H, W, fov, c2ws[idx], device)
gt = images[idx].reshape(-1, 3)
sel = torch.randint(0, ro.numel() // 3, (2048,), device=device)
pred = render_rays(model, ro.reshape(-1, 3)[sel], rd.reshape(-1, 3)[sel])
loss = F.mse_loss(pred, gt[sel])
opt.zero_grad()
loss.backward()
opt.step()
if it % 200 == 0:
psnr = -10 * torch.log10(loss).item()
print(f"Iter {it} | Loss {loss.item():.6f} | PSNR {psnr:.2f} dB")
torch.save(model.state_dict(), "nerf_cube_sphere_coarse.pth")Novel‑View Synthesis
After training, the model can render unseen viewpoints by constructing camera poses with a simple look_at function and feeding the generated rays through the rendering pipeline.
def look_at(eye):
eye = torch.tensor(eye, dtype=torch.float32)
target = torch.tensor([0.0, 0.0, 0.0])
up = torch.tensor([0, 1, 0], dtype=torch.float32)
f = (target - eye); f /= torch.norm(f)
r = torch.cross(f, up); r /= torch.norm(r)
u = torch.cross(r, f)
c2w = torch.eye(4)
c2w[:3, 0], c2w[:3, 1], c2w[:3, 2], c2w[:3, 3] = r, u, -f, eye
return c2w
os.makedirs("novel_views", exist_ok=True)
for i in range(120):
angle = 2 * math.pi * i / 120
eye = [4 * math.cos(angle), 1.0, 4 * math.sin(angle)]
c2w = look_at(eye).to(device)
with torch.no_grad():
ro, rd = get_rays(H, W, fov, c2w, device)
rgb = render_rays(model, ro.reshape(-1, 3), rd.reshape(-1, 3))
img = rgb.reshape(H, W, 3).clamp(0, 1).cpu().numpy()
Image.fromarray((img * 255).astype(np.uint8)).save(f"novel_views/view_{i:03d}.png")
print("Rendered view", i)Rendered results show accurate geometry for the cube and sphere, while empty regions exhibit speckle noise due to density estimation errors—an artifact of using only the coarse network without fine‑level sampling.
References
Mildenhall, B., Srinivasan, P. P., Gharbi, M., Tancik, M., Barron, J. T., Simonyan, K., Abbeel, P., & Malik, J. (2020). NeRF: Representing scenes as neural radiance fields for view synthesis.
Data Party THU
Official platform of Tsinghua Big Data Research Center, sharing the team's latest research, teaching updates, and big data news.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
