Scalable Backpropagation for Gaussian Processes using Celerite

Published February 2018 © 2018. The American Astronomical Society. All rights reserved.
, , Citation Daniel Foreman-Mackey 2018 Res. Notes AAS 2 31 DOI 10.3847/2515-5172/aaaf6c

2515-5172/2/1/31

Export citation and abstract BibTeX RIS

1. Introduction

This research note presents a derivation and implementation of efficient and scalable gradient computations using the celerite algorithm for Gaussian Process (GP) modeling. The algorithms are derived in a "reverse accumulation" or "backpropagation" framework and they can be easily integrated into automatic differentiation frameworks to provide a scalable method for evaluating gradients of the GP likelihood. The algorithms derived in this note use less memory and are more efficient than versions using automatic differentiation and the cost scales linearly with the number of data points.

GPs (Rasmussen & Williams 2006) are a class of models used extensively in the astrophysical literature to model stochastic processes. The applications are broad-ranging with examples from time domain variability to the cosmic microwave background (see the references in Foreman-Mackey et al. 2017). In these applications, the calculation and optimization of the GP marginalized likelihood function

Equation (1)

is the computational bottleneck. For a data set with N data points, every likelihood evaluation requires the log-determinant and inverse of the N × N covariance matrix ${K}_{{\boldsymbol{\alpha }}}$. The cost scales as ${ \mathcal O }({N}^{3})$ in general, but the celerite method can be used to compute Equation (1) for a class of models with ${ \mathcal O }(N)$ scaling (Foreman-Mackey et al. 2017).

The details can be found in Foreman-Mackey et al. (2017), and the only difference in notation is that all the matrices in what follows are the "pre-conditioned" matrices, indicated with a tilde by Foreman-Mackey et al. (2017). The tilde is not included here for simplicity. I use the symbol P for the (N − 1) × J pre-conditioning matrix that was called ϕ by Foreman-Mackey et al. (2017). The factorization algorithm derived by Foreman-Mackey et al. (2017, their Equation (46)) is as follows:

functioncelerite_factor(U, P, ${\boldsymbol{d}}$, W)
    # Initially ${\boldsymbol{d}}={\boldsymbol{a}}$ and W = V
    $S\,\leftarrow $ zeros(J, J)
    ${{\boldsymbol{w}}}_{1}\leftarrow {{\boldsymbol{w}}}_{1}/{d}_{1}$
    $\mathrm{for}\,n=2,\,\ldots ,\,N$:
      $S\leftarrow {\mathtt{diag}}({{\boldsymbol{p}}}_{n-1})\,[S+{d}_{n-1}\,{{{\boldsymbol{w}}}_{n-1}}^{{\rm{T}}}\,{{\boldsymbol{w}}}_{n-1}]\,{\mathtt{diag}}({{\boldsymbol{p}}}_{n-1})$
      ${d}_{n}\leftarrow {d}_{n}-{{\boldsymbol{u}}}_{n}\,S\,{{{\boldsymbol{u}}}_{n}}^{{\rm{T}}}$
      ${{\boldsymbol{w}}}_{n}\leftarrow \left[{{\boldsymbol{w}}}_{n}-{{\boldsymbol{u}}}_{n}\,S\right]/{d}_{n}$
    return ${\boldsymbol{d}}$, W, S
Here, zeros $(P,Q)$ creates a P × Q matrix of zeros, diag creates a diagonal matrix from a vector, and ${{\boldsymbol{x}}}_{n}$ is a row vector made from the nth row of the matrix X. The cost of this algorithm scales as ${ \mathcal O }(N\,{J}^{2})$. Using this, the log-determinant of ${K}_{{\boldsymbol{\alpha }}}$ is

Equation (2)

Similarly, the algorithm for applying the inverse of K is (Equations (47)–(48) in Foreman-Mackey et al. 2017):

functioncelerite_solve $(U,P,{\boldsymbol{d}},W,Z)$
    # Initially Z = Y
    $F\,\leftarrow $ zeros(J, Nrhs)
    $\mathrm{for}\,n=2,\,\ldots ,\,N$:
      $F\leftarrow {\mathtt{diag}}({{\boldsymbol{p}}}_{n-1})\,[F+{{{\boldsymbol{w}}}_{n-1}}^{{\rm{T}}}\,{{\boldsymbol{z}}}_{n-1}]$
      ${{\boldsymbol{z}}}_{n}\leftarrow {{\boldsymbol{z}}}_{n}-{{\boldsymbol{u}}}_{n}\,F$
    $\mathrm{for}\,n=1,\ldots ,N$:
      ${{\boldsymbol{z}}}_{n}\leftarrow {{\boldsymbol{z}}}_{n}/{d}_{n}$
    $G\,\leftarrow $ zeros(J, Nrhs)
    $\mathrm{for}\,n=N-1,\ldots ,1$:
      $G\leftarrow {\mathtt{diag}}({{\boldsymbol{p}}}_{n})\,[G+{{{\boldsymbol{u}}}_{n+1}}^{{\rm{T}}}\,{{\boldsymbol{z}}}_{n+1}]$
      ${{\boldsymbol{z}}}_{n}\leftarrow {{\boldsymbol{z}}}_{n}-{{\boldsymbol{w}}}_{n}\,G$
    return Z, F, G
The empirical scaling of these algorithms is shown in Figure 1.

Figure 1.

Figure 1. The empirical computational scaling of the algorithms. (top row): The cost of computing Equation (1) as a function of the number of data points (N; left) and the model complexity (J; right). (bottom row): The cost of computing the gradient of Equation (1) with respect to the vector ${\boldsymbol{a}}$ and the matrices U, V, and P. In the left panels, each line corresponds to a different value of J as indicated in the legend (with J increasing from bottom to top). Similarly, in the right panels, the lines correspond to different values of N increasing from bottom to top. In both cases, the theoretical scaling is ${ \mathcal O }(N,{J}^{2})$.

Standard image High-resolution image

2. Algorithms

Many numerical inference methods can benefit from efficient calculation of the gradient of Equation (1). This is generally computed following Section 5.4.1 in Rasmussen & Williams (2006), but even with a scalable method for applying ${{K}_{{\boldsymbol{\alpha }}}}^{-1}$, the cost of this method scales as ${ \mathcal O }({N}^{2})$. It was recently demonstrated that higher performance can be achieved by directly differentiating factorization algorithms (Murray 2016).

Following this reasoning and using the notation from (Giles 2008), I present the reverse-mode gradients of the celerite method. After some tedious algebra, the reverse accumulation function for celerite_factor is:

functioncelerite_factor_rev $(U,P,{\boldsymbol{d}},W,S,\bar{S},\bar{{\boldsymbol{a}}},\bar{V})$
    # Initially $\bar{{\boldsymbol{a}}}=\bar{{\boldsymbol{d}}}$ and $\bar{V}=\bar{W}$
    $\bar{U}\leftarrow \mathrm{zeros}(N,J)$
    $\bar{P}\leftarrow \mathrm{zeros}(N-1,J)$
    ${\bar{{\boldsymbol{v}}}}_{N}\leftarrow {\bar{{\boldsymbol{v}}}}_{N}/{d}_{N}$
    $\mathrm{for}\,n=N,\ldots ,2$:
      ${\bar{a}}_{n}\leftarrow {\bar{a}}_{n}-{{\boldsymbol{w}}}_{n}\,{\bar{{\boldsymbol{v}}}}_{n}^{{\rm{T}}}$
      ${\bar{{\boldsymbol{u}}}}_{n}\leftarrow -[{\bar{{\boldsymbol{v}}}}_{n}+2\,{\bar{a}}_{n}\,{{\boldsymbol{u}}}_{n}]\,S$
      $\bar{S}\leftarrow \bar{S}-{{{\boldsymbol{u}}}_{n}}^{{\rm{T}}}\,[{\bar{{\boldsymbol{v}}}}_{n}+{\bar{a}}_{n}\,{{\boldsymbol{u}}}_{n}]$
      ${\bar{{\boldsymbol{p}}}}_{n-1}\leftarrow \mathrm{diag}(\bar{S}\,S\,\mathrm{diag}{({{\boldsymbol{p}}}_{n-1})}^{-1}+\mathrm{diag}{({{\boldsymbol{p}}}_{n-1})}^{-1}\,S\,\bar{S})$
      $\bar{S}\leftarrow \mathrm{diag}({{\boldsymbol{p}}}_{n-1})\,\bar{S}\,\mathrm{diag}({{\boldsymbol{p}}}_{n-1})$
      ${\bar{d}}_{n-1}\leftarrow {\bar{d}}_{n-1}+{{\boldsymbol{w}}}_{n-1}\,\bar{S}\,{{{\boldsymbol{w}}}_{n-1}}^{{\rm{T}}}$
      ${\bar{{\boldsymbol{v}}}}_{n-1}\leftarrow {\bar{{\boldsymbol{v}}}}_{n-1}/{d}_{n-1}+{{\boldsymbol{w}}}_{n-1}\,[\bar{S}+{\bar{S}}^{{\rm{T}}}]$
      $S\leftarrow \mathrm{diag}{({{\boldsymbol{p}}}_{n-1})}^{-1}\,S\,\mathrm{diag}{({{\boldsymbol{p}}}_{n-1})}^{-1}-{d}_{n-1}\,{{{\boldsymbol{w}}}_{n-1}}^{{\rm{T}}}\,{{\boldsymbol{w}}}_{n-1}$
    ${\bar{a}}_{1}\leftarrow {\bar{a}}_{1}-{\bar{{\boldsymbol{v}}}}_{1}\,{{{\boldsymbol{w}}}_{1}}^{{\rm{T}}}$
    return $\bar{U}$, $\bar{P}$, $\bar{{\boldsymbol{a}}}$, $\bar{V}$
Similarly, the reverse accumulation function for celerite_solve is:

functioncelerite_solve_rev $(U,P,{\boldsymbol{d}},W,Z,F,G,\bar{F},\bar{G},\bar{Y})$
    # Initially $\bar{Y}=\bar{Z}$
    $\bar{U}\leftarrow \mathrm{zeros}(N,J)$
    $\bar{P}\leftarrow \mathrm{zeros}(N-1,J)$
    $\bar{{\boldsymbol{d}}}\leftarrow \mathrm{zeros}(N)$
    $\bar{W}\leftarrow \mathrm{zeros}(N,J)$
    $\mathrm{for}\,n=1,\ldots ,N-1$:
      ${\bar{{\boldsymbol{w}}}}_{n}\leftarrow -{\bar{{\boldsymbol{y}}}}_{n}\,{G}^{{\rm{T}}}$
      $\bar{G}\leftarrow \bar{G}-{{{\boldsymbol{w}}}_{n}}^{{\rm{T}}}\,{\bar{{\boldsymbol{y}}}}_{n}$
      ${{\boldsymbol{z}}}_{n}\leftarrow {{\boldsymbol{z}}}_{n}+{{\boldsymbol{w}}}_{n}\,G$
      $G\leftarrow \mathrm{diag}{({{\boldsymbol{p}}}_{n})}^{-1}\,G$
      ${\bar{{\boldsymbol{p}}}}_{n}\leftarrow \mathrm{diag}(\bar{G}\,{G}^{{\rm{T}}})$
      $\bar{G}\leftarrow \mathrm{diag}({{\boldsymbol{p}}}_{n})\,\bar{G}$
      $G\leftarrow G-{{{\boldsymbol{u}}}_{n+1}}^{{\rm{T}}}\,{{\boldsymbol{z}}}_{n+1}$
      ${\bar{{\boldsymbol{u}}}}_{n+1}\leftarrow {{\boldsymbol{z}}}_{n+1}\,{\bar{G}}^{{\rm{T}}}$
      ${\bar{{\boldsymbol{y}}}}_{n+1}\leftarrow {{\boldsymbol{u}}}_{n+1}\,\bar{G}$
    $\mathrm{for}\,n=1,\ldots ,N$:
      ${\bar{{\boldsymbol{y}}}}_{n}\leftarrow {\bar{{\boldsymbol{y}}}}_{n}/{d}_{n}$
      ${\bar{d}}_{n}\leftarrow -{{\boldsymbol{z}}}_{n}\,{\bar{{\boldsymbol{y}}}}_{n}^{{\rm{T}}}$
    $\mathrm{for}\,n=N,\ldots ,2$:
      ${\bar{{\boldsymbol{u}}}}_{n}\leftarrow {\bar{{\boldsymbol{u}}}}_{n}-{\bar{{\boldsymbol{y}}}}_{n}\,{F}^{{\rm{T}}}$
      $\bar{F}\leftarrow \bar{F}-{{{\boldsymbol{u}}}_{n}}^{{\rm{T}}}\,{\bar{{\boldsymbol{y}}}}_{n}$
      $F\leftarrow \mathrm{diag}{({{\boldsymbol{p}}}_{n-1})}^{-1}\,F$
      ${\bar{{\boldsymbol{p}}}}_{n-1}\leftarrow {\bar{{\boldsymbol{p}}}}_{n-1}+\mathrm{diag}(\bar{F}\,{F}^{{\rm{T}}})$
      $\bar{F}\leftarrow \mathrm{diag}({{\boldsymbol{p}}}_{n-1})\,\bar{F}$
      $F\leftarrow F-{{{\boldsymbol{w}}}_{n-1}}^{{\rm{T}}}\,{{\boldsymbol{z}}}_{n-1}$
      ${\bar{{\boldsymbol{w}}}}_{n-1}\leftarrow {\bar{{\boldsymbol{w}}}}_{n-1}+{{\boldsymbol{z}}}_{n-1}\,{\bar{F}}^{{\rm{T}}}$
      ${\bar{{\boldsymbol{y}}}}_{n-1}\leftarrow {\bar{{\boldsymbol{y}}}}_{n-1}+{{\boldsymbol{w}}}_{n-1}\,\bar{F}$
      return $\bar{U}$, $\bar{P}$, $\bar{{\boldsymbol{d}}}$, $\bar{W}$, $\bar{Y}$
A reference C++ implementation can be found online (Foreman-Mackey 2018)1 and Figure 1 shows the performance of this implementation.

3. Discussion

This research note presents the algorithms needed to efficiently compute gradients of GP models applied to large data sets using the celerite method. These developments increase the performance of inference methods based on celerite and improve the convergence properties of nonlinear optimization routines. Furthermore, the derivation of reverse accumulation algorithms for celerite allow its integration into popular model building and automatic differentiation libraries like Stan (Carpenter et al. 2015), TensorFlow (Abadi et al. 2016), and others.

Footnotes

Please wait… references are loading.
10.3847/2515-5172/aaaf6c