3-D Gaussian Splatting for Real-Time Radiance Field Rendering
My notes on the method introduced by Kerbl & Kopanas et al. 2023
Disclaimers:
1. These are my personal notes and may contain errors. The original paper by Kerbl & Kopanas et al. is available here and should be treated as the authoritative source on this topic.
2. I used Claude Opus 4.6 to directly apply light edits to my original notes on the paper, to suggest points where additional content was required for clarity, and to fact-check my claims for technical accuracy. You can view my original notes, prepared without any use of LLMs, here, and the conversation logs associated with this process here.
The big idea
Gaussian splatting takes the same inputs as comparable methods like NeRF (images passed through SfM) and initializes Gaussians at each point $p_i$ from Structure from Motion to represent the scene. 2-D views are produced by rasterizing 2-D projections of optimized versions of these Gaussians.

3D Gaussians are a good representation for a scene because they are:
- Differentiable, making their properties easy to optimize
- Easy to rasterize via 2D projection and alpha compositing
- Efficient on GPUs— (sidenote: The point clouds utilized here are anisotropic, meaning the points are not spheres but Gaussians that can be stretched into ellipsoids. ) splatting is straightforward thanks to sorting and alpha blending.
The authors claim that this method is preferable to NeRF since it avoids the costly and potentially noisy stochastic sampling involved in NeRF.
About 1–5 million Gaussians are sufficient to represent a scene.
Representing Gaussians for Differentiability
Define a Gaussian with covariance $\Sigma$ in world space centered at $\mu$ as
$$ G(x) = e^{-\frac{1}{2} (x - \mu)^T \Sigma^{-1} (x - \mu)} $$Ideally, we’d just optimize $\Sigma$ directly to get 3D Gaussians representing the radiance field. But this might break positive semi-definiteness and produce invalid covariance matrices.
(sidenote: In the original paper, the authors state the covariance matrix of a 3-D gaussian is analogous to describing the configuration of an elipsoid, and immediately present its form in terms of rotation and scaling. While this formulation is intuitive, I was only able to convince myself of its validity when I understood it in terms of the spectral decomposition ) , we can exploit a nice property of the covariance matrix – it’s symmetric! This means it admits a spectral decomposition:
$$ \Sigma = Q \Lambda Q^T $$where the columns of the orthogonal matrix $Q$ are the eigenvectors of the covariance matrix (representing the principal axes of the 3D ellipsoid), and $\Lambda$ is a diagonal scaling matrix.
For physical validity, we define scaling and rotation matrices $S$ and $R$ that we can optimize:
$$ \Sigma = R S S^T R^T $$- We want $\Lambda$ to have only positive scaling values, so we take the square of the values in the scaling matrix $S$.
- The rotation matrix $R$ corresponds to $Q$.
We store $R$ as a unit quaternion $q$ and $S$ as a 3D scaling vector $s$ with $S = \text{diag}(s)$.
The paper explicitly derives gradients of $s$ and $q$ with respect to the projection (as defined below).
Projecting Gaussians to 2D
The covariance matrix in camera coordinates will be:
$$ \Sigma' = J W \Sigma W^T J^T $$This formula arises because we need to apply two transformations to the variable $x$: a viewing transformation $W$ that converts coordintaes from world space to camera space, and the Jacobian $J$ of the affine approximation of the projective transformation to 2-D space.
If we apply a combined transformation $A$ to a random vector $x$ with covariance matrix $\Sigma$, its new covariance matrix is $A \Sigma A^T$. Therefore, the covariance matrix of our Gaussian in camera coordinates will be $(JW) \Sigma (JW)^T = JW \Sigma W^T J^T$.
Optimizing 3D Gaussians
Optimization combines gradient-based updates of existing Gaussian parameters with adaptive density control.
For each Gaussian, we optimize:
- $\Sigma$ (represented by $q$ and $s$)
- Position $p$
- Opacity $\alpha$
- Spherical harmonic coefficients for the color $c$

Implementation details:
- $\alpha$ is set to the sigmoid of a learnable value, capped to $[0, 1)$.
- Scale values are exponentiated to keep them positive.
- $q$ is normalized to remain a valid rotation.
- The learning rate for $p$ uses exponential decay scheduling.
- The loss is $\mathcal{L} = (1 - \lambda)\mathcal{L}_1 + \lambda\mathcal{L}_{\text{D-SSIM}}$ with $\lambda = 0.2$.
Color $c$ with spherical harmonics:
- Color is view-dependent, so we can’t just store RGB scalars. Instead, a set of spherical harmonic coefficients is stored per color channel.
- To get the RGB value for a given viewing angle, we take the linear combination of the harmonic basis functions for that angle with the learned coefficients.
- Intuitively, the 0th-order term (a sphere) represents diffuse color; higher-order terms capture view-dependent appearance.
- It doesn’t make sense to learn higher-order terms before lower-order ones, so one band of coefficients is introduced every 1000 iterations until all bands are represented.
Adaptive Density Control
SfM gives us a sparse set of Gaussians, so we need to densify. After an initial warmup, every 100 iterations the method densifies and removes Gaussians with very low $\alpha$.
Large gradients (indicating bad reconstructions) emerge in view space in two kinds of regions:
- Under-reconstruction: missing geometric features. These are addressed by cloning the Gaussian and moving the copy in the direction of the positional gradient.
- Over-reconstruction: a single Gaussian covers too large an area. These are split into two new Gaussians (positions sampled from the PDF of that gaussian), with the scale factor reduced by an experimentally determined parameter $\phi = 1.6$.

Specifically, we apply these updates to Gaussians whose average positional gradient exceeds an experimentally determined threshold $\tau_\text{pos} = 0.0002$.
Practical fixes during training
- Gaussians close to the camera can get stuck, causing spurious increases in density. To address this, $\alpha$ is set near 0 every 3000 iterations. Important Gaussians regain their gradients while unimportant ones get culled.
- Gaussians that are very large in world space and have a large footprint in view space are periodically removed
- For stability, training starts at 4 $\times$ smaller image resolution and upsamples twice (after 250 and 500 iterations).
Differentiable Rasterization
The screen is split into $16 \times 16$ tiles. 3D Gaussians are culled against the view frustum and each tile. Additionally, a guard band is used to reject Gaussians at extreme positions (means close to the near plane and far outside the frustum), since computing their projected 2D covariance would be unstable.
For a given Gaussian, it is instantiated once for each tile it overlaps. Each Gaussian is assigned a sort key combining tile ID (high bits) and view-space depth (low bits), and the full set is sorted with radix sort. After sorting, a per-tile list is obtained by identifying the first and last depth-sorted entries for each tile.
One thread block is launched per tile for rasterization. Packets of Gaussians are loaded into shared memory, and for each pixel, color and $\alpha$ are accumulated by traversing the list front to back. A thread stops when its pixel reaches a target saturation of alpha. At regular intervals, threads in a tile are queried, and the entire tile’s thread block terminates when all pixels have saturated.

Backward pass: the per-tile lists are traversed back to front, starting from the last point that affected any pixel. Each pixel only begins processing points whose depth is less than or equal to the depth of the last point that contributed to its color during the forward pass.
Gradient computation requires intermediate accumulated opacity values from each step of the forward blending. Rather than storing explicit per-step opacity lists, the method stores only the total accumulated opacity at the end of the forward pass and recovers intermediate values by dividing by each point’s $\alpha$ during the back-to-front traversal. They avoid numerical instabilities by clamping $\alpha$ to 0.99, and skipping any blending updates with $\alpha < \epsilon = \frac{1}{255}$. During forward rasterization, they do not include a Gaussian if blending it in would cause accumulated opacity to exceed 0.9999.