Skip to main content Arjun Sharma

Representing Scenes as Neural Radiance Fields for View Synthesis

My notes on the core ideas of the NeRF technique introduced by Mildenhall et al. 2020

Core Idea

A NeRF represents a scene as a function $f: (x, y, z, \theta, \phi) \to (\sigma, \mathbf{c})$, where $f$ is an MLP, $\mathbf{c} = (r, g, b)$ is color, and $\sigma$ is volume density. In practice, the viewing direction $(\theta, \phi)$ is represented as a 3D Cartesian unit vector $\mathbf{d} = (\sin\theta\cos\phi, \sin\theta\sin\phi, \cos\theta)$.

Once we have a trained $f$, we render a 2-D view with the following process:

  1. March rays through the scene to get a sampled set of 3D points.
  2. Use the $(x, y, z)$ position and $(\theta, \phi)$ direction at each point as input to $f$.
  3. Use classic volume rendering techniques to accumulate the output colors and densities into a 2D image.

Training is straightforward: optimize a fresh MLP for each scene using gradient descent, given a set of observed images and known camera poses.

Why NeRF over Previous Approaches?

Previous implicit 3D representations (signed distance functions, occupancy fields) don’t work well for complex shapes and tend to produce oversmoothed renderings. NeRF addresses this by modeling a full 5D radiance field rather than just 3D geometry.

Differentiable rasterizers for mesh representations struggle due to local minima and poor conditioning of the loss landscape. Volumetric approaches that sample onto finite grids are limited in resolution by their discrete sampling. NeRF instead uses continuous sampling through an MLP.

Rendering with a NeRF

Let camera rays originating from $\mathbf{o}$ with direction $\mathbf{d}$ be parametrized by $t$ such that $\mathbf{r}(t) = \mathbf{o} + t\mathbf{d}$. The expected color of a ray with near bound $t_n$ and far bound $t_f$ is

$$ C(\mathbf{r}) = \int_{t_n}^{t_f} T(t)\,\sigma(\mathbf{r}(t))\,\mathbf{c}(\mathbf{r}(t), \mathbf{d})\,dt $$

where $\sigma(\mathbf{r}(t))$ and $\mathbf{c}(\mathbf{r}(t), \mathbf{d})$ are the outputs of the MLP $f_\theta$ at that point, and $T(t)$ is the accumulated transmittance from $t_n$ to $t$—the probability that the ray travels this far without hitting any other particle:

$$ T(t) = \exp\left(-\int_{t_n}^{t}\sigma(\mathbf{r}(s))\,ds\right) $$

Intuitively, this is a density-weighted average of colors along the ray.

The representation is encouraged to be multiview consistent by restricting the network to predict $\sigma$ as a function of only the 3D location $(x, y, z)$, while allowing color $\mathbf{c}$ to depend on both location and viewing direction. Volume density should not change with viewing angle, but color may (e.g., specular reflections).

Discrete Approximation

During inference, the integral is estimated through discrete sampling with quadrature:

$$ \hat{C}(\mathbf{r}) = \sum_{i=1}^{N} T_i \cdot (1 - \exp(-\sigma_i \delta_i)) \cdot \mathbf{c}_i, \quad T_i = \exp\left(-\sum_{j=1}^{i-1}\sigma_j\delta_j\right) $$

where $\delta_i = t_{i+1} - t_i$ is the distance between adjacent samples. This is trivially differentiable with respect to $(\mathbf{c}_i, \sigma_i)$ and reduces to standard alpha compositing with $\alpha_i = 1 - \exp(-\sigma_i\delta_i)$.

To avoid the network only learning at a fixed set of locations from deterministic sampling, stratified sampling is used: the interval $[t_n, t_f]$ is partitioned into bins, and one sample is drawn uniformly at random from each bin.

Optimization 1: Positional Encoding

A positional encoding $\gamma$ is applied to each of the normalized $(x, y, z)$ coordinates and the viewing direction components before they are passed to the MLP. The encoding maps each scalar $p$ to:

$$ \gamma(p) = \left(\sin(2^0\pi p),\,\cos(2^0\pi p),\,\dots,\,\sin(2^{L-1}\pi p),\,\cos(2^{L-1}\pi p)\right) $$

The motivation is that deep networks are biased toward learning lower-frequency (sidenote: Discussed by Rahaman et al., 2019 in 'On the Spectral Bias of Neural Networks' ) . Mapping inputs to a higher-dimensional space using high-frequency functions before passing them to the network enables better fitting of data with high-frequency variation.

This encoding is similar to the one used in Transformers, but serves a different purpose. Transformers use positional encoding to break permutation invariance in sequence models. Here, it maps continuous coordinates into a higher-dimensional space so the MLP can more easily approximate high-frequency scene content.

The example below shows the effect of including high-frequency positional encodings:

Optimization 2: Hierarchical Sampling

Arbitrarily querying points along the camera ray is not efficient, since these points may, for example, be free or occluded. A prior for what points to sample is developed through a hierarchical approach.

Two networks are optimized simultaneously, one coarse and one fine. First, the coarse network is sampled at $N_c$ points along each ray. The resulting color is written as a weighted sum:

$$ \hat{C}_c(\mathbf{r}) = \sum_{i=1}^{N_c} w_i \mathbf{c}_i $$

where

$$ w_i = T_i(1 - \exp(-\sigma_i \delta_i)). $$

The weights $w_i$ are normalized into a PDF over the ray. A second set of $N_f$ sample locations is drawn from this distribution via inverse transform sampling. The fine network is then evaluated at the union of the coarse and fine sample sets.

Training

The loss is the total squared error between rendered and true pixel colors, summed over both the coarse and fine renderings:

$$ \mathcal{L} = \sum_{\mathbf{r} \in \mathcal{R}} \left[\left\|\hat{C}_c(\mathbf{r}) - C(\mathbf{r})\right\|_2^2 + \left\|\hat{C}_f(\mathbf{r}) - C(\mathbf{r})\right\|_2^2\right] $$

Both terms are included even though only the fine model produces the final rendering, because the coarse network’s weight distribution is used to allocate samples for the fine network.

Camera poses are estimated beforehand using Structure from Motion using COLMAP.