Unlocking 3D Scene Synthesis: A Deep Dive into Neural Radiance Fields (NeRF)

This article explains the core principles of Neural Radiance Fields, detailing how a fully‑connected network maps 5‑D coordinates to color and density, the role of positional encoding and hierarchical sampling, and provides a complete PyTorch implementation with training and rendering examples.

Data Party THU
Data Party THU
Data Party THU
Unlocking 3D Scene Synthesis: A Deep Dive into Neural Radiance Fields (NeRF)

Neural Radiance Fields (NeRF) represent a 3D scene with a fully‑connected neural network that maps a 5‑dimensional input—spatial coordinates (x, y, z) and view direction (θ, φ)—to RGB color and volume density (σ). Training uses multiple photographs of a single scene, deliberately over‑fitting the network so it becomes an expert for that specific environment.

Core Concepts

The network treats color as a function of both position and view direction, while density depends only on position, assuming material opacity does not change with viewpoint. This separation reduces model complexity.

Related Work

Before NeRF, discrete representations such as voxels and meshes outperformed neural scene encodings. Early attempts mapped coordinates to occupancy or distance fields but were limited to synthetic datasets like ShapeNet. NeRF introduced differentiable volumetric rendering to achieve high‑quality view synthesis.

Scene Representation Mechanism

Input vectors are split into position x = (x, y, z) and direction d = (θ, φ). An MLP with eight fully‑connected layers processes the position, outputs density σ and a 256‑dimensional feature vector, which is concatenated with the encoded direction before producing RGB values.

Volume Rendering

Rendering proceeds in three steps: sample points along each ray, predict color and density with the MLP, and integrate these values using volumetric rendering to obtain a 2‑D pixel color. The integration has no closed form because both color and density are outputs of the network, so numerical quadrature with hierarchical sampling is used.

Positional Encoding

Directly feeding raw 5‑D coordinates limits the network to low‑frequency details. NeRF applies a high‑frequency Fourier mapping (sin / cos) to each coordinate, expanding it to a higher‑dimensional space (e.g., L=4 yields 24 dimensions per axis), which enables the network to capture fine details.

Hierarchical Sampling

Uniform sampling wastes computation on empty space. NeRF trains a coarse network to estimate where density is high, then performs importance sampling (inverse‑transform) in those regions with a fine network, reducing variance and focusing compute on important parts of the scene.

Implementation

The following PyTorch code implements the key components.

import os, json, math
import numpy as np
from PIL import Image
import torch
import torch.nn as nn
import torch.nn.functional as F

def positional_encoding(x, L):
    freqs = (2.0 ** torch.arange(L, device=x.device)) * math.pi
    xb = x[..., None, :] * freqs[:, None]
    xb = xb.reshape(*x.shape[:-1], L * 3)
    return torch.cat([torch.sin(xb), torch.cos(xb)], dim=-1)

def get_rays(H, W, camera_angle_x, c2w, device):
    fx = 0.5 * W / math.tan(0.5 * camera_angle_x)
    cx = (W - 1) * 0.5
    cy = (H - 1) * 0.5
    i, j = torch.meshgrid(torch.arange(W, device=device), torch.arange(H, device=device), indexing="xy")
    i, j = i.float(), j.float()
    x = (i - cx) / fx
    y = -(j - cy) / fx
    z = -torch.ones_like(x)
    dirs = torch.stack([x, y, z], dim=-1)
    dirs = dirs / torch.norm(dirs, dim=-1, keepdim=True)
    R, t = c2w[:3, :3], c2w[:3, 3]
    rd = dirs @ R.T
    ro = t.expand_as(rd)
    return ro, rd

class NeRF(nn.Module):
    def __init__(self, L_pos=10, L_dir=4, hidden=256):
        super().__init__()
        in_pos = 3 + 2 * L_pos * 3
        in_dir = 3 + 2 * L_dir * 3
        self.fc1 = nn.Linear(in_pos, hidden)
        self.fc2 = nn.Linear(hidden, hidden)
        self.fc3 = nn.Linear(hidden, hidden)
        self.fc4 = nn.Linear(hidden, hidden)
        self.fc5 = nn.Linear(hidden + in_pos, hidden)
        self.fc6 = nn.Linear(hidden, hidden)
        self.fc7 = nn.Linear(hidden, hidden)
        self.fc8 = nn.Linear(hidden, hidden)
        self.sigma = nn.Linear(hidden, 1)
        self.feat = nn.Linear(hidden, hidden)
        self.rgb1 = nn.Linear(hidden + in_dir, 128)
        self.rgb2 = nn.Linear(128, 3)
        self.L_pos, self.L_dir = L_pos, L_dir
    def forward(self, x, d):
        x_enc = torch.cat([x, positional_encoding(x, self.L_pos)], dim=-1)
        d_enc = torch.cat([d, positional_encoding(d, self.L_dir)], dim=-1)
        h = F.relu(self.fc1(x_enc))
        h = F.relu(self.fc2(h))
        h = F.relu(self.fc3(h))
        h = F.relu(self.fc4(h))
        h = torch.cat([h, x_enc], dim=-1)  # skip connection
        h = F.relu(self.fc5(h))
        h = F.relu(self.fc6(h))
        h = F.relu(self.fc7(h))
        h = F.relu(self.fc8(h))
        sigma = F.relu(self.sigma(h))
        feat = self.feat(h)
        h = torch.cat([feat, d_enc], dim=-1)
        h = F.relu(self.rgb1(h))
        rgb = torch.sigmoid(self.rgb2(h))
        return rgb, sigma

def render_rays(model, ro, rd, near=2.0, far=6.0, N=64):
    t = torch.linspace(near, far, N, device=ro.device)
    pts = ro[:, None, :] + rd[:, None, :] * t[None, :, None]
    dirs = rd[:, None, :].expand_as(pts)
    rgb, sigma = model(pts.reshape(-1, 3), dirs.reshape(-1, 3))
    rgb = rgb.reshape(ro.shape[0], N, 3)
    sigma = sigma.reshape(ro.shape[0], N)
    delta = t[1:] - t[:-1]
    delta = torch.cat([delta, torch.tensor([1e10], device=ro.device)])
    alpha = 1 - torch.exp(-sigma * delta)
    T = torch.cumprod(torch.cat([torch.ones((ro.shape[0], 1), device=ro.device), 1 - alpha + 1e-10], dim=-1), dim=-1)[:, :-1]
    weights = T * alpha
    return (weights[..., None] * rgb).sum(dim=1)

The training loop samples random rays from the dataset, computes the coarse rendering, and minimizes mean‑squared error against ground‑truth RGB values. PSNR is logged every 200 iterations.

device = "cuda" if torch.cuda.is_available() else "cpu"
images, c2ws, H, W, fov = load_dataset("nerf_synth_cube_sphere")
images, c2ws = images.to(device), c2ws.to(device)
model = NeRF().to(device)
opt = torch.optim.Adam(model.parameters(), lr=5e-4)
for it in range(1, 5001):
    idx = torch.randint(0, images.shape[0], (1,)).item()
    ro, rd = get_rays(H, W, fov, c2ws[idx], device)
    gt = images[idx].reshape(-1, 3)
    sel = torch.randint(0, ro.numel() // 3, (2048,), device=device)
    pred = render_rays(model, ro.reshape(-1, 3)[sel], rd.reshape(-1, 3)[sel])
    loss = F.mse_loss(pred, gt[sel])
    opt.zero_grad()
    loss.backward()
    opt.step()
    if it % 200 == 0:
        psnr = -10 * torch.log10(loss).item()
        print(f"Iter {it} | Loss {loss.item():.6f} | PSNR {psnr:.2f} dB")
torch.save(model.state_dict(), "nerf_cube_sphere_coarse.pth")

Novel‑View Synthesis

After training, the model can render unseen viewpoints by constructing camera poses with a simple look_at function and feeding the generated rays through the rendering pipeline.

def look_at(eye):
    eye = torch.tensor(eye, dtype=torch.float32)
    target = torch.tensor([0.0, 0.0, 0.0])
    up = torch.tensor([0, 1, 0], dtype=torch.float32)
    f = (target - eye); f /= torch.norm(f)
    r = torch.cross(f, up); r /= torch.norm(r)
    u = torch.cross(r, f)
    c2w = torch.eye(4)
    c2w[:3, 0], c2w[:3, 1], c2w[:3, 2], c2w[:3, 3] = r, u, -f, eye
    return c2w

os.makedirs("novel_views", exist_ok=True)
for i in range(120):
    angle = 2 * math.pi * i / 120
    eye = [4 * math.cos(angle), 1.0, 4 * math.sin(angle)]
    c2w = look_at(eye).to(device)
    with torch.no_grad():
        ro, rd = get_rays(H, W, fov, c2w, device)
        rgb = render_rays(model, ro.reshape(-1, 3), rd.reshape(-1, 3))
    img = rgb.reshape(H, W, 3).clamp(0, 1).cpu().numpy()
    Image.fromarray((img * 255).astype(np.uint8)).save(f"novel_views/view_{i:03d}.png")
    print("Rendered view", i)

Rendered results show accurate geometry for the cube and sphere, while empty regions exhibit speckle noise due to density estimation errors—an artifact of using only the coarse network without fine‑level sampling.

References

Mildenhall, B., Srinivasan, P. P., Gharbi, M., Tancik, M., Barron, J. T., Simonyan, K., Abbeel, P., & Malik, J. (2020). NeRF: Representing scenes as neural radiance fields for view synthesis.

PyTorchPositional EncodingNeRFneural renderingVolume RenderingHierarchical Sampling3D Scene Representation
Data Party THU
Written by

Data Party THU

Official platform of Tsinghua Big Data Research Center, sharing the team's latest research, teaching updates, and big data news.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.