Robust online Hamiltonian learning

Christopher E Granade; Christopher Ferrie; Nathan Wiebe; D G Cory

doi:10.1088/1367-2630/14/10/103013

1. Introduction

Building a large scale quantum information processor is a significant challenge. First, we require an accurate characterization of the dynamics experienced by the device to allow for the application of error correcting codes and other tools for implementing useful quantum algorithms. Characterization is carried out through the statistical estimation of parameters describing the quantum states and processes involved (also called tomography) [1]. This characterization problem is especially timely, since quantum simulation experiments are approaching a complexity where classical computers are unable to simulate their evolution [2–4]. In cases where so called analogue quantum simulation is applied, the validity of the simulation directly depends on the accuracy with which the Hamiltonian of the quantum simulation conforms to the dynamical model that the experimenter believes describes the simulator. At present, such simulators are certified by comparing their outputs to those expected from a classical-computer simulation [3, 4]. This means that new methods for Hamiltonian characterization will be vital for certifying the next generation of quantum simulators.

In quantum tomography, the usual scenario discussed has been full quantum process estimation—there really is a 'black box' that the experimenter knows nothing about. While conservative, this is highly unlikely in practice because physics typically gives some insight into the form of the Hamiltonian which gives rise to the process. This additional knowledge reduces the number of parameters of the process and processes for learning these parameters are sometimes called partial process estimation or partial tomography [5–10]. If our goal is to build a quantum information processing device, we must consider also an additional complication: a characterization of a process at a 'snap-shot' in time is not nearly as useful as a characterization of the dynamics a quantum system undergoes. The evolution of closed systems is given by a Hamiltonian operator, and hence this process is usually called Hamiltonian estimation in such cases.

One way to adapt the above schemes to Hamiltonian estimation is by stroboscopically estimating snap-shots of the process at fixed times and then use various algorithms to invert these to find the Hamiltonian via post-processing⁸. The key additional freedom we consider has received little attention to date, and is that the controls after some number of measurements can depend on the outcomes of those previous measurements. Hence, we call our derived strategies adaptive or online. Under its very broad definition, our method can be called machine learning; however, a more descriptive name is sequential Monte Carlo (SMC) Bayesian experimental design. The marriage of SMC methods [12] and Bayesian experimental design methods [13] has been considered very recently in a wide variety of classical contexts [14–18] and also for measurement adaptive quantum state tomography [19]⁹. Other machine learning ideas have also been generalized to the quantum domain [21–27]. In this paper we present such an algorithm for learning dynamical parameters of quantum mechanical systems.

We make this learning process tractable by utilizing information about a system, rather than starting from worst-case assumptions such as those made in traditional quantum process and state tomography. We often have in practice knowledge about the dynamical model that describes a system of interest, and wish to improve that knowledge by estimating specific model parameters. Thus, practical Hamiltonian finding can often be achieved via a suitable parameterization of the Hamiltonian, $H(x_1, \dots , x_d)$ , reducing the problem to estimating the vector of parameters x = (x₁,...,x_d). The task we consider is the design of experiments for the purpose of deducing these parameters in the smallest number of experiments possible. Our algorithm also provides a region estimation for the Hamiltonian parameters that encloses some fixed volume of parameter space in which the mean or the variance of the Hamiltonian parameters are expected to be found with high probability. We also generalize this concept to allow the algorithm to learn hyperparameters, which describe the distribution of the Hamiltonian parameters in cases where the parameters randomly vary between experiments.

This paper is organized as follows. In section 2 we review the formalism of Bayesian experimental design. Next, we discuss the statistical metrics we employ and the Cramer–Rao bound in section 3. Section 4 introduces the SMC algorithm. The test cases we use for the numerical experiments are described in section 7. In sections 5 and 6, we discuss the application of our algorithm to region estimation and hyperparameter estimation, respectively. We explore the implications of the numerical benchmarking results in section 8.

2. Experimental design formalism

The key element which interfaces quantum theory and machine learning is the equivalence of the Born rule from quantum theory and the likelihood function from statistics [28]. Each quantum mechanical problem specification produces a probability distribution Pr(d_k|x;c_k), where d_k is the data obtained and c_k are the experimental designs (or controls) chosen for measurement k, and where x is a vector parameterizing the system of interest.

Suppose we have performed experiments with control settings C: = {c₁,c₂,...,c_N} and obtained data D: = {d₁,d₂,...,d_N}. The model specifies the likelihood function

$\begin{eqnarray*} &&\Pr(D|\boldsymbol{x};C) = \prod_{k=1}^N \Pr(d_k|\boldsymbol{x};c_k). \end{eqnarray*}$

However, we are ultimately interested in Pr(x|D;C), the probability distribution of the model parameters x given the experimental data. We achieve this using Bayes' rule

$\begin{eqnarray*} &&\Pr(\boldsymbol{x}|D;C)=\frac{\Pr(D|\boldsymbol{x};C)\Pr(\boldsymbol{x})}{\Pr(D|C)}, \end{eqnarray*}$

where Pr(x) is the prior, which encodes any a priori knowledge of the model parameters. The final term Pr(D|C) can simply be thought as a normalization factor. We refer to this update procedure as batch processing because the experimental controls C do not depend on the observed data and hence the processing for the experiment design can be done offline (meaning after all data is collected).

An alternative to batch processing of given data is to adaptively choose the controls for the next experiment given the data from the past experiment. This idea can be formalized in various ways—the most natural for our purposes being called Bayesian experimental design [13]. For this we conceive of possible future data d_N+1 obtained from a set of, possibly different, experimental controls c_N+1. The probability of obtaining this data can be computed from the distributions at hand via marginalizing over model parameters

$\begin{eqnarray*} &&\Pr(d_{N+1}|D; c_{N+1}, C)=\int \Pr(d_{N+1}|\boldsymbol{x};c_{N+1})\Pr(\boldsymbol{x}|D;C)\mathrm{ d}\boldsymbol{x}. \end{eqnarray*}$

Note, in the remainder we will use the following abbreviated notation for expectation values:

$\begin{equation} \Pr(d_{N+1}|D; c_{N+1}, C)=\mathbb{E}_{\boldsymbol{x}|D;C}[\Pr(d_{N+1}|\boldsymbol{x};c_{N+1})], \end{equation} \tag{ 1 }$

where the subscript on $\mathbb {E}$ denotes the variable for the expectation to be taken over.

The expectation value in (1) can be used to inform the algorithm about the choices of experimental parameters that are more useful than others. This usefulness is quantified, for a given choice of a utility function U(D;C), by the expected utility of an experiment

$\begin{eqnarray*} &&U(c_{N+1})=\mathbb{E}_{d_{N+1}|D; c_{N+1}, C}[U(d_{N+1};c_{N+1})], \end{eqnarray*}$

where U(d_N+1;c_N+1) is the utility we would derive if experiment c_N+1 yielded result d_N+1. The choice of the utility function is motivated by the figure of merit that we want to optimize. We will consider two canonical choices: information gain and the negative variance and discuss them in detail in the subsequent section.

3. Utility functions and the Cramer–Rao lower bound

Given a set of observed outcomes, the choice of subsequent experimental parameters that informs us most about the model parameters is given by the utility function. A generally well motivated measure of utility for scientific inference is information gain [31, 32]. In information theory, information is measured by the entropy

$\begin{eqnarray*} &&U(d_{N+1};c_{N+1})=\mathbb{E}_{\boldsymbol{x}|d_{N+1},D;c_{N+1},C}[\log\Pr(\boldsymbol{x}|d_{N+1},D;c_{N+1},C)]. \end{eqnarray*}$

Maximizing the expected value of this utility function is equivalent to minimizing the expected entropy in the posterior distribution, Pr(x|d_N+1,D;c_N+1,,C). We also test or method with a utility function that minimizes the expected variance in Pr(x|d_N+1,D;c_N+1,,C). We show that this choice is optimal for minimizing the the mean squared error of the protocol.

**Figure 1.** Overview of algorithm the combined experimental design and parameter estimation algorithm, showing the posterior distribution as a feedback into the next iteration. Blocks with a white background indicate model-dependent steps.
Download figure:
Standard image

An estimator is a function $\hat {\boldsymbol{x}}$ that takes a set of observed data D collected from a set of experiments with controls C and produces an estimate for the unknown parameters x. Here, we evaluate the quality of an estimator $\hat {\boldsymbol{x}}$ by using a generalization of the squared error loss called the quadratic loss as our figure of merit. The quadratic loss is defined for a vector of parameters x, data D and experiment designs C, as

$\begin{equation} L_{{\boldsymbol{Q}}}(\boldsymbol{x},\hat{\boldsymbol{x}}(D,C)) = (\boldsymbol{x}-\hat{\boldsymbol{x}}(D, C))^{\rm T}{\boldsymbol{Q}}(\boldsymbol{x}-\hat{\boldsymbol{x}}(D,C)), \end{equation} \tag{ 2 }$

where Q is a positive definite matrix on the space of unknown parameters that defines the relative scale between the various parameters of interest. The quadratic loss function is useful to us in that it is computationally inexpensive to calculate and may be analyzed by well-known statistical techniques. In particular, the Cramer–Rao bound can be used to lower-bound the mean quadratic loss incurred by an estimator, under the hypothesis of a given true model x [33].

Following a decision theoretic methodology [34], the risk of an estimator given a set of experiment designs C is its expected performance over all possible outcomes D with respect to the loss function:

$\begin{eqnarray*} &&R({\boldsymbol{x}},\hat{\boldsymbol{x}};\ C) = \mathbb{E}_{D|\boldsymbol{x};C}[L(\boldsymbol{x},\hat{\boldsymbol{x}}(D;C))]. \end{eqnarray*}$

The Bayes risk is the average of this quantity with respect to a prior distribution on x (denoted π) and is given explicitly by

$\begin{eqnarray*} r(\pi;C) &=& \mathbb{E}_{\boldsymbol{x}}[R({\boldsymbol{x}},\hat{\boldsymbol{x}};\ C)]\\ &=&\int \pi(\boldsymbol{x}) R({\boldsymbol{x}},\hat{\boldsymbol{x}};\ C)\mathrm{d}\boldsymbol{x}, \end{eqnarray*}$

where $\hat {\boldsymbol{x}}$ is assumed to be a Bayes estimator, which means it is the one which minimizes the Bayes risk. When the loss function is taken to be squared error (in the single-parameter case) or the quadratic loss (in the multi-parameter case), the Bayes risk is more familiarly known as MSE.

For quadratic loss (and many others [35]) the unique Bayes estimator is the mean of the posterior distribution

$\begin{eqnarray*} &&\hat{\boldsymbol{x}}(D;C) = \mathbb{E}_{\boldsymbol{x}|D;C}[\boldsymbol{x}]. \end{eqnarray*}$

Minimizing the Bayes risk of a choice of parameters is equivalent to maximizing the negative Bayes risk for that set; therefore, it is reasonable to choose the negative Bayes risk as our utility function. It also has theoretical benefits in that it is easy to compare the performance of algorithms that take U(c_N+1) = −r(π;c_N+1,C).

The question of how well can we estimator x becomes the question of how low can we make the Bayes risk r(π;C). We lower bound the achievable risk via the Bayesian variant of the Cramer–Rao bound [36]. Both require finding the Fisher information. In the case of multiple parameters, the Fisher information is a matrix defined by

$\begin{eqnarray*} &&{\boldsymbol{I}}(\boldsymbol{x};C) = \mathbb{E}_{D|\boldsymbol{x};C} \left[ \boldsymbol{\nabla}_{\boldsymbol{x}} \log\left(\Pr(D|\boldsymbol{x};C)\right) \cdot \boldsymbol{\nabla}_{\boldsymbol{x}}^{{\rm T}} \log\left(\Pr(D|\boldsymbol{x};C)\right) \right]. \end{eqnarray*}\noindent$

The Fisher information does not depend at all on the prior distribution, and thus is calculated in the same way regardless of how many experiments have already been performed.

The standard Cramer–Rao bound is then given by ${\rm Cov}(\boldsymbol{\hat {x}}) \geqslant {\boldsymbol {I}}(\boldsymbol{x};C)^{-1}$ , where X ⩾ Y means that X − Y is positive semi-definite. If we choose the matrix Q associated with the quadratic loss to be ${\boldsymbol {Q}}= \mathbbm 1$ , then $R({\boldsymbol{x}}, \boldsymbol{\hat {x}};C) = \mathrm {Tr}({\rm Cov}(\hat {\boldsymbol{x}})) \geqslant \mathrm {Tr}({\boldsymbol {I}}(\boldsymbol{x};C)^{-1})$ . Clearly, this statement of the multivariate Cramer–Rao bound assumes that I is non-singular. Singular Fisher information matrices arise when there are experiments that provide no information about at least one of the experimental parameters. Unfortunately, that assumption is not met in general. We avoid this problem by considering the Bayesian information matrix ${\boldsymbol {J}}(\pi ;C) = \mathbb {E}_{\boldsymbol{x}}[{\boldsymbol {I}}(\boldsymbol{x};C)]$ . Then, the Bayesian Cramer–Rao bound (BCRB) is given by [36]

$\begin{eqnarray*} &&r(\pi;C) \geqslant {\boldsymbol{J}}(\pi;C)^{-1}. \end{eqnarray*}$

Lower bounds can be found for specific values of C using numerical integration. Here we apply an iterative algorithm¹⁰ to sequentially monitor the lower bound. We will subscript the various quantities of interest by N, which means 'at the Nth measurement'. Then, the BCRB is given by

$\begin{equation} r(\pi;c_{N+1})\geqslant {\boldsymbol{J}}_{N+1}(c_{N+1})^{-1}, \end{equation} \tag{ 3 }$

where the iteration is given by

$\begin{eqnarray*} &&{\boldsymbol{J}}_{N+1}(c_{N+1}) = {\boldsymbol{J}}(\pi;c_{N+1}) + {\boldsymbol{J}}_{N}(c_{N}) \end{eqnarray*}$

and the initial condition is ${\boldsymbol {J}}_0 = \mathbb {E}_{\boldsymbol{x}}[\boldsymbol{\nabla }_{\boldsymbol{x}} \log \left (\pi (\boldsymbol{x})\right ) \cdot \boldsymbol{\nabla }_{\boldsymbol{x}}^{{\rm T}} \log \left (\pi (\boldsymbol{x})\right )]$ .

4. Sequential Monte Carlo algorithm

Monte Carlo methods remedy problems that deterministic numerical integrators encounter in finding the volume of multi-dimensional spaces. Specifically, the errors incurred by Monte Carlo integrators do not depend on the dimension of the space. The variant of Monte Carlo integration that we use dates back to 1993 [38] but has been rediscovered many times in a wide variety of scientific applications under the following names¹¹: SMC, particle filters, sequential importance sampling, Bayes filters and so on. We will use the term SMC and will now sketch the idea behind the algorithm.

Recall the Bayes update rule for one datum d₁,

$\begin{eqnarray*} &&\Pr(\boldsymbol{x}|d_1) \propto \Pr(d_1|\boldsymbol{x}) \Pr(\boldsymbol{x}). \end{eqnarray*}$

The distribution can be processed sequentially as more data arrive:

$\begin{eqnarray*} \Pr(\boldsymbol{x}|d_2,d_1) &\propto & \Pr(d_2|\boldsymbol{x}) \Pr(\boldsymbol{x}|d_1),\\ \Pr(\boldsymbol{x}|d_3,d_2,d_1) &\propto & \Pr(d_3|\boldsymbol{x}) \Pr(\boldsymbol{x}|d_2,d_1),\\ &\vdots & \end{eqnarray*}$

These updates are difficult to perform because the evaluation of the posterior distributions require evaluations of the costly multi-dimensional integrals over the parameter space. We address this issue by using SMC methods, which approximate a distribution over parameters with a distribution that has support only over a finite number of points (often referred to as particles). The support of the distribution over these points is called the weight of the particle, and by convention the sum of all weights must be 1. More concretely, we approximate an arbitrary distribution by

$\begin{eqnarray*} &&\Pr(\boldsymbol{x}|D) \approx \sum_{k=1}^n w_k(D) \delta(\boldsymbol{x} - \boldsymbol{x}_k), \end{eqnarray*}$

where the weights at each step are iteratively calculated from the previous step via

$\begin{eqnarray*} &&w_k(d_{j+1}\cup D) = \sum_{k=1}^n \Pr(d_{j+1}|\boldsymbol{x}_k)w_k(d_j). \end{eqnarray*}$

Using this form of a SMC algorithm also has the advantage of ensuring that the prior and posterior distributions for the update always have support over a finite number of particles, which simplifies the analysis of our algorithm.

The particle approximation can be made arbitrarily accurate by increasing the number of particles and will be a good approximation at every update provided we feed in, at the initial stage, the appropriate weights {w_k} and support points {x_k}. Since both the weights and support points of the particles carry information about distributions over the model parameters x, we can without loss of generality choose the initial weights to be uniform, w_k = 1/n for all k, and the initial support points to be samples from the correct prior Pr(x). Having made the particle approximation, we can compute expectation values and variances according to algorithms 1 and 2, and can perform Bayes updates according to algorithm 3.

SMC techniques require careful effort to avoid introducing errors due to limited numerical precision. The first problem any SMC algorithm runs into is zero weights. This is doubly painful since we are effectively operating with fewer particles but using the same amount of computational resources. Since the support of our approximate distribution is a measure-zero set according to the correct distribution, all the weights will eventually be zero; we cannot avoid this but it can be postponed by using resampling techniques.

Generally, the idea behind resampling is to adaptively change the location of the particles to those which are most likely. The simplest of these types of algorithm chooses n particles (the original number), with replacement, according to the distribution of weights then reset the weights of all particles to 1/n. Thus, zero weight particles are 'moved' to higher weight locations. To determine when to resample, we shall compare the effective sample size $n_{\rm ess} = 1/\sum _i w_i^2$ to a threshold resample_threshold, which is the effective ratio of the original number of particles n. We use resample_threshold = 0.5, as suggested by [39].

The resampling algorithm we use was first proposed in [39] and is given explicitly in algorithm 4. The idea behind the algorithm conforms to the intuition given above but it incorporates randomness to search larger volumes of the parameter space. This randomness is inserted in the resampling algorithm by applying a random perturbation to the location of each particle that is introduced during the resampling process. Thus, the new particles are randomly spread around the previous locations of the old. More formally, we model this by randomly choosing a particle location x_i, then perturbing it by a normally distributed vector ${\boldsymbol{\epsilon}} \sim \mathcal{N}(0, \Sigma)$ (we will come back to how to choose the mean and covariance). The new particles are thus samples of the convolved distribution

$\begin{equation} p(\boldsymbol{x}') = \sum_i w_i \frac{1}{\sqrt{(2\pi)^k\vert\Sigma\vert}} \exp\left( -\frac{1}{2} {(\boldsymbol{x}'-\boldsymbol{\mu}_i)}^{{\rm T}} \Sigma^{-1} (\boldsymbol{x}'-\boldsymbol{\mu}_i) \right), \end{equation} \tag{ 4 }$

where k is the number of model parameters. A distribution of this form is known as a mixture distribution, and can be efficiently sampled by first choosing a particle, then choosing a perturbation vector.

To choose the mean μ_i of each term in the resampling mixture distribution, we choose a vector that is a convex combination of the original particle location x_i and the expected model $\boldsymbol{\mu } = \mathbb {E}[\boldsymbol{x}]$ , so that

$\begin{eqnarray*} &&\boldsymbol{\mu}_i = a \boldsymbol{x}_i + (1 - a) \boldsymbol{\mu}, \end{eqnarray*}$

where a is a tunable parameter of the resampling algorithm. We will use a = 0.98, as suggested by Liu and West [39]. The covariance of each perturbation is then given by

$\begin{eqnarray*} &&{\boldsymbol{\Sigma}} = (1 - a^2) {\rm Cov}[\boldsymbol{x}]. \end{eqnarray*}$

Our resampling algorithm then involves drawing n new particles from the distribution given by (4) and setting the weight of each new particle to 1/n.

There are a few details to address regarding the efficiency of the SMC algorithm. The first thing to note is that, since the choice of resampling algorithm is usually tailored to the problem at hand, it is hard to say something in general about the algorithmic complexity of it. A more pressing issue for us, however, is that quantum simulation is required in order to evaluate the likelihood function. This step will not generally be efficient because no known classical algorithm exists that can simulate generic quantum dynamics in time polynomial in the number of interacting subsystems.

Thus, since calls to the likelihood function are expensive, we wish to minimize the number of times it is called. To achieve this, we use an approximation that involves using only the highest weighted particles to compute the expectation values appearing in the utility function. Note that this will not reduce the accuracy of the estimation of parameters given experiments—rather, it reduces the accuracy with which we choose optimal experiments. This is the ideal place to make the approximation since the optimization routine makes the most calls to the likelihood function and no experiment is 'bad' in the sense that the expected risk (and hence the actual risk, on average) can increase. This idea is formalized in the algorithm by setting a parameter approx_ratio which is the percentage of particles to use for the approximation. We give pseudocode for this algorithm in algorithm 6.

We combine these prior algorithms to obtain algorithm 7, which is our complete algorithm for adaptively designing experiments using the SMC approximation. Note that we have left unspecified here the choice of local optimizer; in practice, this will be chosen depending on what works for a given experimental model. In section 8, we compare the Newton conjugate-gradient (NCG) and nonlinear conjugate-gradient methods to a 'null' optimizer that only evaluates the utility function at the initial guesses. Also in section 8, we consider which choices of n, n_guesses and approx_ratio result in a useful estimation algorithm.

5. Region estimation

In addition to providing an accurate estimate of the true model parameters for the system, it is important to be able to quantify the uncertainty in the estimated model parameters. This task can be achieved by finding a region $\hat{X}$ of the space of models such that $\Pr (\boldsymbol{x}_0\in \hat {X})$ is maximized and such that ${\rm Vol}(\hat {X})$ is minimized. This is useful, for instance, if we consider the region estimate $\hat {X}$ as input to an optimal control theory (OCT) algorithm that describes the range of dynamics experienced by a quantum system. Since the OCT algorithm must find a control design that is robust for every point in that range, the cost of that optimization increases with the volume of $\hat {X}$ .

Because we are interested in the probability $\Pr (\boldsymbol{x}_0 \in \hat {X})$ of the true model x₀ lying within our region estimate $\hat {X}$ for a given run of the SMC algorithm, we say that $\hat {X}$ is a credible region [40, 41]. This is in contrast to confidence regions which are functions from data records D to regions $\hat {X}_{\rm conf}(D)$ such that the probability of obtaining a data record for which $\boldsymbol{x}_0 \in \hat {X}_{\rm conf}(D)$ is at least some threshold [42, 43]. That is, credible regions give probabilities about a single data record, while confidence regions give probabilities about all possible data. Broadly speaking, credible regions are Bayesian analogues to the frequentist concept of confidence regions. Unless otherwise noted, we concern ourselves here with credible regions as obtained from posterior distributions. The two kinds of region estimates are closely related, as has recently been explored in the context of quantum tomography [44, 45].

We make the problem of finding a credible region estimate amenable to analysis by SMC by turning the problem into that of estimating an expectation value. In particular, the probability of the true model being within a region can be expressed as

$\begin{eqnarray*} &&\Pr(\boldsymbol{x}_0 \in \hat{X}) = \mathbb{E}[1_{\hat{X}}], \end{eqnarray*}$

where $1_{\hat {X}}$ is the indicator function for $\hat {X}$ , defined by

$\begin{eqnarray*} 1_{\hat{X}} (\boldsymbol{x}) = \left\{\!\!\begin{array}{ll} 1, & \boldsymbol{x} \in \hat{X} \\ 0, & \boldsymbol{x} \notin \hat{X} \end{array}.\right. \end{eqnarray*}$

The expectation value $\mathbb {E}[1_{\hat {X}}]$ can then be computed using algorithm 2, giving that

$\begin{eqnarray*} \mathbb{E}[1_{\hat{X}}] & \approx& \sum_i w_i 1_{\hat{X}} (\boldsymbol{x}_i) \\ & = & \sum_{i, \boldsymbol{x}_i \in \hat{X}} w_i. \end{eqnarray*}$

Thus, by construction, any region containing particles of total weight at least r will have an approximate probability mass of at least r. We formalize this intuition by introducing a probability mass function m(R) on regions R such that

$\begin{eqnarray*} &&m(R) = \mathbb{E}[1_R]. \end{eqnarray*}$

Similarly, let $\tilde {m}(R) = \sum _{i,\ i\in R} w_i$ be an approximation of m(R) using the SMC algorithm.

We thus seek a region $\hat {X}$ such that ${\rm Vol}(\hat {X})$ is small, $m(\hat {X})$ is large and such that $\hat {X}$ is an efficiently computable property of the current SMC state. We achieve the latter two properties by choosing some appropriate geometric function of a set of particles X_r whose weight is above some threshold weight r; for example, the convex hull or the minimum-volume enclosing ellipse of X_r both satisfy $\tilde {m}(X_r) \geqslant r$ and may be computed using well-known classical algorithms [46, 47].

We improve on these results by supposing that, after collecting a reasonable amount of data, the posterior distribution is approximately normally distributed according to $\mathcal {N}(\boldsymbol{\mu}, {\boldsymbol {\Sigma }})$ for some μ and Σ. This assumption holds when the Fisher information is non–singular, and we find in the case of our benchmarks that it approximately holds if Pr(x|D) is sharply peaked. Under this assumption, it follows from the definition of the multivariate normal distribution that the inverse covariance matrix Σ⁻¹ describes an ellipse such that approximately 0.682^d of the probability mass is contained inside the ellipse, where d is the number of unknown model parameters, d = dim x. In particular, the covariance matrix transforms a vector z of length d with each component drawn from the standard normal distribution $\mathcal {N}(0, 1)$ into a random variate of a multivariate normal distribution with the given covariance, and so inverting that transformation gives the z-score for each component.

More generally, under the assumption of a normally distributed posterior, the error ellipse of points x satisfying

$\begin{equation} (\boldsymbol{x}-\boldsymbol{\mu})^{{\rm T}} {\boldsymbol{\Sigma}}^{-1} (\boldsymbol{x}-\boldsymbol{\mu}) \leqslant Z^2 \end{equation} \tag{ 5 }$

for some Z > 0 will contain a ratio

$\begin{eqnarray*} &&\left({\rm cdf}_{\mathcal{N}}(Z) - {\rm cdf}_{\mathcal{N}}(-Z)\right)^d = {\rm erf}\left[\frac{Z}{\sqrt{2}}\right]^d \end{eqnarray*}$

of the particle weight, where ${\rm cdf}_{\mathcal{N}}(Z)$ is the cumulative distribution function for the normal distribution, evaluated at Z. Thus, if the assumption of a normal posterior is a good approximation, then the estimated covariance matrix according to algorithm 1 can be used as a region estimator.

The volume of the covariance ellipse region estimator can then be found by again treating Σ as a transformation of a d-dimensional coordinate system, so that

$\begin{eqnarray*} &&{\rm Vol}({\boldsymbol{\Sigma}}^{-1}) = \frac{\pi^{d/2}}{\Gamma(\frac{d}{2} + 1)} \det({\boldsymbol{\Sigma}}^{1/2}). \end{eqnarray*}$

We test that the normal posterior assumption is approximately met by finding the approximate probability mass $\tilde {m}({\rm Cov}(\hat {\boldsymbol{x}})^{-1}/Z^2)$ for a given Z, and comparing to the expected cdf. Moreover, we compare the optimal size of the covariance ellipse to the actual size by appealing to the BCRB, since $\mathbb {E}_{\pi }[{\rm Cov}(\hat {\boldsymbol{x}})] \geqslant {\boldsymbol {J}}(\pi ; C)^{-1}$ . This comparison will be explored further in section 8, and will be used as a performance metric for our algorithm.

6. Hyperparameter estimation

Now that we have discussed the general framework for using Bayesian inference to learn Hamiltonian parameters, we will proceed to discuss an important generalization of the prior work. The generalization that we consider addresses the fact that quantum systems seldom have consistent Hamiltonians from experiment to experiment, due to experimental errors. Hyperparameters allow us to generalize the Hamiltonian learning problem from one involving learning the Hamiltonian parameters to one that involves learning the parameters that describe the distribution of Hamiltonian parameters. In this way, the method of hyperparameter estimation is an alternative to approaches such as quantum smoothing [29, 30], which integrate over the history of a time-varying random parameter, rather than considering the distribution from which each realization is being drawn.

We denote the hyperparameters for a model Hamiltonian as y to avoid subtle conceptual differences between the hyperparameters and the distributions on x that they describe. The probability distribution for x can then be written as Pr(x|y). Despite interpretational differences, the hyperparameters can also be learned using algorithm 7 in exactly the same way that x is learned. The region estimates yielded by the algorithm are region estimations for y and, as we will show shortly, can easily be converted into region estimates for x.

The drawback to this approach is that computations of the likelihood function can become much more expensive. Specifically, in order to compute the likelihood function Pr(D|y) we need to compute the probabilities of data d emerging from an ensemble of randomly sampled experimental parameters taken from the distribution described by y. A large number of samples, N_s, may be required in some cases since the sample error scales as $1/\sqrt {N_{\mathrm {s}}}$ . On the other hand, the approach is straight forward to implement because it does not require either increases to the number of particles or changes to the underlying algorithm.

In some important special cases, this drawback can be avoided by analytically performing the marginalization over x,

$\begin{eqnarray*} &&\Pr(D|\boldsymbol{y}) = \int \mathrm{d}\boldsymbol{x} \Pr(D|\boldsymbol{x}) \Pr(\boldsymbol{x} | \boldsymbol{y}). \end{eqnarray*}$

In section 7.3, we discuss a particular case where the marginalization is analytically tractable.

The resulting means and covariance matrices for y can be readily converted to the corresponding quantities for x by using the chain rule for expectation values,

$\begin{equation} \mathbb{E}_{\boldsymbol{x},\boldsymbol{y}}[\boldsymbol{x}]= \mathbb{E}_{\boldsymbol{y}}[\mathbb{E}_{\boldsymbol{x}|\boldsymbol{y}}[\boldsymbol{x}]]. \end{equation} \tag{ 6 }$

This expectation value can be computed using the posterior distribution Pr(y|D) and the intermediate model distribution Pr(x|y), which will typically be easy to compute from the definition of the hyperparameters. The covariance matrix for x is slightly more complicated. It is straightforward to verify that

$\begin{equation} {\rm Cov}_{\boldsymbol{x}, \boldsymbol{y}} (\boldsymbol{x}) = \mathbb{E}_{\boldsymbol{y}}[{\rm Cov}_{\boldsymbol{x}|\boldsymbol{y}}(\boldsymbol{x})] +{\rm Cov}_{\boldsymbol{y}}(\mathbb{E}_{\boldsymbol{x}|\boldsymbol{y}}[\boldsymbol{x}]). \end{equation} \tag{ 7 }$

For the special case that x is a single parameter, the covariance can be replaced with the variance to obtain that

$\begin{equation} {\rm Var}_{\boldsymbol{x}, \boldsymbol{y}} (\boldsymbol{x}) = \mathbb{E}_{\boldsymbol{y}}[{\rm Var}_{\boldsymbol{x}|\boldsymbol{y}}(\boldsymbol{x})] +{\rm Var}_{\boldsymbol{y}}(\mathbb{E}_{\boldsymbol{x}|\boldsymbol{y}}[\boldsymbol{x}]). \end{equation} \tag{ 8 }$

Using the region estimate given by (5) to estimate a hyperparameter region translates to a region estimator for the model parameters x, if the distribution over hyperparameters y is approximately Gaussian near its peak. In the limit of many experiments, we find that this is a good assumption, as is discussed in section 8.

An important consequence of this derivation is that the same SMC algorithm can be used to estimate regions of model parameters, even when those model parameters vary with each experiment. In particular, as the covariance ${\rm Cov}_{\boldsymbol{y}}(\hat {\boldsymbol{x}})$ in our estimate $\hat {\boldsymbol{x}}$ of the mean model parameter vector decreases, our estimate of the model region approaches the 'true' model region given by Cov_x|y(x|y).

7. Test cases

We assess the performance of algorithm 7 through a number of test cases that are designed to examine the performance of the algorithm in a number of different relevant settings. Here we describe the test cases, which we build up in complexity starting from the simple model studied in detail in [48–50]. The first case that we consider is learning a single parameter for an experiment in the presence of a known decoherence time. The second case generalizes this by allowing T₂ to be unknown. This problem is particularly important because existing experiments require substantial processing to learn both the unknown parameter and the decoherence time. Finally, we consider a two-hyperparameter model with a single-parameter Hamiltonian and perform region estimation of both the hyperparameters and the parameters that are distributed according to them.

7.1. Single parameter Hamiltonian

We first consider the example with one unknown and one control parameter. Suppose that a qubit evolves under an internal Hamiltonian

$\begin{eqnarray*} &&H(\omega) = \frac{\omega}2 \sigma_z. \end{eqnarray*}$

Here ω is an unknown parameter whose value we want to estimate. An experiment consists of preparing a single known input state ψ_in = | + 〉, the +1 eigenstate of σ_x, evolving under the Hamiltonian H for a controllable time t and performing a measurement in the σ_x basis. The kth measurement has two outcomes which we record as d_k∈{0,1}. The design specification for each experiment is given by the time t_k that the state is allowed to evolved for under the unknown Hamiltonian in this case. Protocols that are provably optimal have been obtained by finding analytic expressions, and lower bounds, on the accuracy of generic protocols [50].

We will slightly generalize this model by allowing noise sources which lead to a decay in the information extractable from any measurement. This can manifest from, for example, a T₂ dephasing process which leads to the following likelihood function:

$\begin{equation} \begin{array}{l} \displaystyle \Pr(0|\omega;t) = {\rm e}^{-\frac{t}{T_2}}\cos^2\left(\frac\omega 2 t\right)+\frac{1-{\rm e}^{-\frac{t}{T_2}}}2,\\ \displaystyle \Pr(1|\omega;t) = 1 - \Pr(0|\omega;t), \end{array} \end{equation} \tag{ 9 }$

where ω is the unknown parameter to be estimated, t is the controllable parameter and T₂ is a known constant.

This model was studied in [48–50]. There the model was able to be treated analytically. For the case with no noise (T₂ = ∞), given a normal prior with variance σ², the risk scales exponentially as r ∼ σ²(1 − e⁻¹)^N, whereas for finite T₂, the scaling is exponentially suppressed when the measurement times reach t = T₂.

7.2. Two parameter model with single control

Here, we are going to consider the same model from the previous section,

$\begin{equation} \begin{array}{ll} \displaystyle \Pr(0|\omega,T_2;t) = {\rm e}^{-\frac{t}{T_2}}\cos^2\left(\frac\omega 2 t\right)+\frac{1-{\rm e}^{-\frac{t}{T_2}}}2,\\ \displaystyle \Pr(1|\omega,T_2;t) = 1 - \Pr(0|\omega,T_2;t), \end{array} \end{equation} \noindent \tag{ 10 }$

but where now both ω and T₂ are unknown. The time t remains the only experimentally controllable parameter. For numerical convenience, we choose to parameterize this model as x = (ω,T⁻¹₂), so that each unknown parameter has the same dimensions.

Even for such a simple generalization as this, the methods discussed above are not adequate for this more general problem. In particular, the Fisher matrix of any one measurement is singular and hence the standard Cramer–Rao bound does not hold—nor is it possible to utilize standard asymptotic approximations to normal distributions. Although we cannot find an analytic expression for the BCRB in the same way we did for the single parameter problem, we use our SMC algorithm to efficiently numerically estimate it.

In general, there is no reason to expect that ω and T⁻¹₂ will be expressed in such a way as to admit the same scale, and so in estimating using this model, we must set the semidefinite matrix Q for the loss function (2). This is discussed further in section 8.2.

7.3. One parameter model with hyperparameters

We also derive a two-parameter model similar to (10) by again considering a single qubit undergoing Larmor precession as in (9), but where the 'true' precession frequency ω is itself distributed according to a Gaussian distribution of mean μ and variance σ². In this case, following the discussion of section 6, the probability of data conditioned on the hyperparameters y = (μ, σ) can be found by marginalizing over the intermediate random variable ω, so that

$\begin{equation} \Pr(d | \mu, \sigma; t) = \int \Pr(d | \omega) \Pr(\omega | \mu, \sigma) \mathrm{d}\omega. \end{equation}\noindent \tag{ 11 }$

For the specific example of the Gaussian distribution,

$\begin{equation} \Pr(0 | \mu, \sigma; t) = \frac{1}{\sigma\sqrt{2\pi}} \int \cos^2\left(\frac{\omega t}{2}\right) {\rm e}^{-\frac{(\omega - \mu)^2}{\sigma^2}} \mathrm{d}\omega \end{equation} \tag{ 12 }$

$\begin{equation} \hphantom{\Pr(0 | \mu, \sigma; t)}= \frac{1}{2} \left(1 + {\rm e}^{-2 \sigma ^2 t^2} \cos (2 \mu t)\right). \end{equation}\noindent \tag{ 13 }$

At this point, we have entirely removed ω from the problem, leaving a two-parameter model, where we wish to estimate the mean and variance of an unknown normal distribution.

As another example, instead of marginalizing against a Gaussian distribution, we consider the case that the intermediate model parameter ω is drawn from a Lorentz distribution. A Lorentz distribution is completely determined by its location and scale parameters ω₀ and γ, respectively, and so we use these hyperparameters to derive a new model,

$\begin{equation} \Pr(0 | \omega_0, \gamma; t) = \int \cos^2(\omega t / 2) \frac{1}{\pi \gamma \left(\frac{\left(\omega -\omega _0\right){}^2}{\gamma ^2}+1\right)} \mathrm{d}\omega \end{equation} \tag{ 14 }$

$\begin{equation} \hphantom{\Pr(0 | \omega_0, \gamma; t)}= \frac{1}{2} (1 + {\rm e}^{-t\gamma} \cos \left(t \omega _0\right)). \end{equation} \tag{ 15 }$

Note that if we identify γ = T⁻¹₂, then the Lorentz hyperparameter model is the identical to that of equation (10). This illustrates the relationship between decoherence processes and the lack of knowledge formalized by a hyperparameter model. In a similar fashion, (13) is also model of decoherence. Due to the t² dependence of the Gaussian-hyperparameter model, (13) represents a decoherence process that cannot be written in Lindblad form [51] because it cannot be drawn from a quantum dynamical semigroup.

The approach of introducing hyperparameters allows us to estimate unknown distributions using the same SMC algorithm 7 that we use for estimating other model parameters, by making assumptions about the form of an unknown distribution. Using techniques developed in section 6, we can also extend region estimation to hyperparameter models to obtain regions on the intermediate model (in this case, ω) using regions on y. We discuss the results of these application in detail in section 8.2.

8. Results and discussion

We will now turn our attention toward assessing the performance of algorithm 7 in practical examples¹². Our results show that our adaptive Bayesian algorithm is able to learn model parameters using a very small number of experiments, including adverse situations where hyperparameters are needed to describe the fluctuations of model parameters between experiments or where the decoherence time of the system is unknown. This performance is especially noteworthy in the case of an unknown decoherence time T₂. Though methods of learning unknown decoherence processes have been developed and are well-understood [52–54], these methods require multiple iterations of quantum process tomography implemented using either ensemble measurement or a large amount of data obtained using strong measurement. In the case discussed here, we can adaptively use prior information to obtain accurate estimates of T₂ using significantly less measurements than methods currently employed in systems with strong measurement [32].

Figures 2 and 12 build intuition for how algorithm 7 actually learns parameters by describing a trial run for each of the models in sections 7.1 and 7.2. These illustrate the movement of the particles, through resampling, to regions of high likelihood. In particular, we note that figure 2 demonstrates that only 11 experiments are needed in order to achieve a tight approximately Gaussian posterior distribution over the model parameter ω for the known T₂ model described in section 7.1. Figure 12 shows that performance of the algorithm for the unknown T₂ model described in section 7.2. We see in that case that only 200 experiments are needed to find a distribution that is centered around the true values of ω and T₂.

**Figure 2.** Left to right: the likelihood function for N = 1,6 and 11 simulated measurements at random times in in (0,5π). The model is that given in equation (9) with T₂ = 100π. The red dots (and red arrows) are the randomly chosen true parameter ω. The blue dots are the n = 100 SMC 'particles'.
Download figure:
Standard image

8.1. Results for known T₂ model

Our first set of numerical experiments examines the performance of our algorithm for the case where the decoherence time is known with perfect precision. Specifically, we take T₂ = 100π, $\omega \sim \mathcal {N}(0.5,0.01)$ and choose the experimental times to be spaced uniformly such that the kth experiment occurs at time t_k = 2kπ/3 (chosen arbitrarily) and examine the mean-square error in ω averaged over a minimum of 1625 trials with randomly chosen ω. This data is presented in figure 3. The figure shows that the mean-square error for ω decreases as we vary the number of particles from 100 to 10 000, and in particular gives a relative MSE that is less than 1% for N ⩾ 100. We also see that the data remain close to the BCRB (which is a lower bound on the MSE) for either n = 1000 or 10 000 particles if no optimization is used, whereas n = 10 000 is needed to approach the BCRB in the case where NCG is used. This not only shows that several thousand particles should be sufficient for our purposes, but also that the MSE yielded by our algorithm scales near–optimally if a sufficiently large number of particles are used in the SMC approximation.

**Figure 3.** Left to right: the performance, as a function of the number of measurement N, of the SMC algorithm for n = 100, 1000 and 10 000 particles. The model is that of equation (9) with T₂ = 100π (see also figure 2). The dashed lines indicate data taken without local optimization, while the solid lines indicate trials in which initial guesses were optimized using the NCG method. For each data set, the corresponding thick line indicates the BCRB. Errors in estimating the performance are indicated by red shaded regions around each curve.
Download figure:
Standard image

Figure 4 provides a more in depth analysis of the error scaling for the known T₂ model. The previous data set contained only one experimental guess per experiment, whereas in general we could consider making many guesses for the optimal experiment and choosing the one with the highest utility $U(t) = -\mathbb {E}_D[L]$ . We assess the performance of our algorithm as a function of the number of guesses by introducing a new guess heuristic. Instead of choosing uniformly spaced guesses, we choose each guessed time randomly from an exponential distribution with mean T₂. This guess heuristic also has the advantage of using very little intuition about the structure of the Hamiltonian that we are attempting to parameterize, which means that we expect the performance of this heuristic to be a better estimate of the worst-case performance of our algorithm.

Unsurprisingly, we find that the MSE is reduced if we choose the best of 30 possible experiments rather than a single randomly chosen experiment. We also see that local optimization tends to improve the quality of the approximation if we pick the approximation ratio to be 1. Figure 4 also shows that smaller values of the approximation ratio can cause the mean-square error to saturate at relatively large values. In fact, taking approx_ratio = 0.1 was sufficient to cause the data taken for 30 guesses and NCG optimization to have a larger mean-square error in ω than the case with no optimization and 1 random guess. For this reason, we take the approx_ratio = 1 for most of the examples in this section. In section 8.4, we shall see an example where approx_ratio < 1 provides more benefit.

An individual trial may have a mean-square error that differs significantly from the expected loss. The shaded regions in figure 4 give 68% confidence intervals for the actual loss for the case with NCG and approx_ratio = 1 (the confidence intervals for the other data sets were similar), and find the surprising result that the mean-square error is frequently outside the confidence interval in the cases where NCG optimization is not used. This suggests that the distribution of the utility of experiments can wildly vary and that the distribution is skewed because a significant fraction of the guesses provide virtually no information. NCG minimizes the chances of choosing an uninformative experiment and hence it forces the guesses toward the more informative experiments. We should also note that there is room for improvement here, since the MSE is closer to the upper limit of the confidence interval. Sophisticated optimization procedures may therefore be of use when searching for informative experiments given an uninformed guess heuristic (such as our random guess heuristic).

These results show that poor guess heuristics can be mitigated using our algorithm by optimizing over more guesses and using local optimization of the experiments. We also note that the median-square error performance of our algorithm tends to be much better than the mean-square error because a small fraction of the trials randomly choose very bad guesses. We see that NCG optimization can be used to cause the mean-square error to approach the median-square error. We therefore find that estimates of the error (such as the posterior variance or region estimates of ω) are needed in order to guarantee that a particular trial is not pathological.

8.2. Region estimation

One of the most substantial contributions of our algorithm is its ability to provide region estimates for the location of the true Hamiltonian, which allow us to quantify our uncertainty in the true model parameters. We compare the probability mass enclosed by the covariance region estimator described in section 5. A simplifying assumption is made in our analysis: we assume that the posterior distribution is approximately Gaussian. Although difficult to justify theoretically, we have yet to find an example for the models considered here where the posterior does not appear Gaussian after a sufficiently large number of experiments. Under the Gaussian model of the posterior distribution, we expect the true model parameters to be within an ellipse described by the covariance matrix whose volume is then described by the Z-score used. For example, in the one-dimensional case approximately 95% of the probability mass is located within 2-standard deviations, which corresponds to Z = 2. We choose Z = 3 standard deviations from the mean for these examples which correspond to probability masses of $\tilde {m}({\rm Cov}(\hat {\boldsymbol{x}})^{-1}/Z^2) \approx 0.9973$ and $\tilde {m}({\rm Cov}(\hat {\boldsymbol{x}})^{-1}/Z^2) \approx 0.9946$ for the one- and two-parameter cases respectively.

Figure 5 illustrates that the approximate probability mass $\tilde {m}$ approaches the probability mass we would expect for a normal distribution for the known-T₂ model (introduced in section 7.1) in the limit of large N, providing evidence in favor of our use of the covariance ellipse as a region estimator on the posterior. In particular, we note that the value of $\tilde {m}$ approaches 0.9973, such that the quality of the Gaussian approximation improves as we collect data. The transient behavior for small experiment numbers occurs because insufficient experiments have been considered for the posterior to approach a Gaussian. In this specific example, the average differences in enclosed probability mass after each experiment are on the order of 0.01%, and thus may not be of practical significance.

**Figure 5.** SMC approximated covariance probability mass $\tilde {m}({\rm Cov}(\hat {\boldsymbol{x}})^{-1}/Z^2)$ for the known-T₂ single-parameter model as compared to the probability mass m_normal ≈ 0.9973 expected for the normal distribution and as a function of the number N of experiments performed, averaged over 20 119 trials using a guess heuristic that chooses the kth guess to occur at time (9/8)^k, using 30 guesses, 1000 particles and NCG optimization. The dashed line shows the probability mass for the corresponding normal distribution.
Download figure:
Standard image

8.3. Results for unknown T₂ model

We now turn our attention to the comparably challenging task of learning Hamiltonian parameters without a precise estimate of T₂. These calculations were performed using the true distributions $\omega \sim \mathcal {N}(0.5, 0.0025)$ and $1/T_2 \sim \mathcal {N}(0.001, 0.00025^2)$ , and with the scale matrix Q = diag(1,0.0025/0.00025²) = diag(1,100). The guess heuristic that we focus on chooses times randomly from an exponential distribution with mean 1000, corresponding to the mean value of T₂ according to the initial prior.

We examine the variation of the MSE with the number of guesses used in figure 6. The figure shows that, in the absence of local optimization of experiment times, the MSE for both ω and 1/T₂ is significantly improved by using an increased number of guesses. In particular, we find that if 30 guesses are used, then only 50 experiments are required on average to learn ω within a 0.9% error, even without a well characterized T₂. The improvement is much more substantial for ω than it is for 1/T₂ because the contrast on T₂ is much less significant.

Figure 7 examines the effect of increasing the number of guesses for strategies that use NCG. The most significant qualitative difference between the data collected using NCG and that of figure 6 is that the MSE for ω shows no evidence of saturating and instead continues to shrink as the number of experiments are increased (as seen most clearly in figure 8). This implies that our randomized guess heuristic is unlikely to randomly guess very informative experiments after a fixed number of experiments, but the landscape is sufficiently devoid of local optima that NCG optimization finds informative experiments in the vicinity of our uninformed guesses. We also observe that NCG does not substantially improve the MSE if one guess is used. This suggests that the landscape is not sufficiently convex that local optimization about an individual guess is likely to find experiments that are substantially more informative. We therefore conclude, again, that increasing number of guesses used and using NCG substantially improves the MSE for ω and has a much more subtle effect on the knowledge of T₂ if local optimization is used.

**Figure 8.** Benchmarking of the 'unknown-T₂' model (section 7.2) using n = 5000 particles and 30 random initial guesses. Data indicated dashed lines correspond to trials where a each initial guess was used without local optimization, while data indicated by solid lines were collected using NCG optimization for each guess. The unoptimized data is averaged over 1109 trials while the optimized data is averaged over 930 trials. Errors in estimating performance are indicated by red shaded regions about each curve.
Download figure:
Standard image

Similarly to the case of known T₂, it is useful to benchmark the performance of our algorithm against the BCRB, which gives a lower bound on the MSE. Figure 9 provides a comparison of the MSE, the estimate of the MSE given by the variance of the posterior and the BCRB for ω, T⁻¹₂ and Tr(Σ·Q). We see that the expected posterior variance is typically within statistical error of the MSE for all three of these quantities, suggesting that the posterior variance can be used as a very good estimate of the MSE for this model. We also note that the MSE is very close to the MSE for the T⁻¹₂ data and Tr(Σ·Q). The MSE for ω is within a constant multiple of the BCRB. We do not, in fact, expect that the MSE in ω should approach the BCRB because the algorithm chooses experiments to optimize Tr(Σ·Q) rather than the error for either ω or T⁻¹₂ individually.

**Figure 9.** The actual and estimated performance, as a function of the number of measurements N, of the SMC algorithm for n = 5000 particles. The model is that of equation (10) with unknown T₂ (which is estimated as Γ = 1/T₂ for numerical precision considerations). The dotted curve is the posterior variance of the particles; dashed is the actual mean squared error and solid is numerically calculated Bayesian Cramer–Rao lower bound. In subfigures (a) and (b), the MSE and variances are those of the individual parameters ω and T⁻¹₂, respectively, while subfigure (c) shows the actual and estimated quadratic losses scaled using Q = diag(1,σ²_ω/σ²_T⁻¹₂), where σ²_ω and σ²_T⁻¹₂ are the variances in ω and T⁻¹₂ according to the initial prior π.
Download figure:
Standard image

8.4. Hyperparameter region estimation performance

Having demonstrated the effectiveness of our region estimation algorithm, it remains to show that the generalization to hyperparameter regions works as described in section 6. The objective here is to analyze the robustness of our algorithm in the presence of fluctuating 'true' parameters of the Hamiltonian. We do so by using the Gaussian hyperparameter model of equation (12) as discussed in section 7.3, then comparing the model parameter region volume and probability mass for the region estimated from equation (7) to the volume and probability mass of the corresponding 'true' model parameter region. We benchmark this model by choosing 'true' hyperparameters μ and σ² for ω according to the normal distribution

$\begin{equation} \mu, \sigma^2 \sim \mathcal{N}\left[(\mu_{\mu}, \mu_{\sigma^2}), {\rm diag}(\sigma^2_{\mu}, \sigma^2_{\sigma^2}) \right]. \end{equation} \tag{ 16 }$

Recall that the unknown frequency is distributed as $\omega \sim \mathcal {N}(\mu ,\sigma ^2)$ . In particular, this true distribution does not admit any correlation between the mean and variance hyperparameters. We then use the true distribution as our prior distribution.

Figure 10(a) provides estimates of the probability mass contained within our estimated region for the Hamiltonian, which uses a Z-score of 3. Since we assume a Gaussian posterior, we anticipate that 99.7% of the probability mass should lie within the region estimation of ${\mathbb {E}}[\hat {\omega }] \pm 3\sqrt {{{\rm Var}}(\hat {\omega })}$ . We find very good agreement with this assumption, and find that at worst 99.4% of the probability mass for the hyperparameters lies within the estimated region. The data also suggests that these small differences vanish for the optimized data sets, which appear to approach the ideal enclosed probability mass of 99.7% in the limit of large N. It is also interesting that the data with approx_ratio = 0.1 yielded the best region estimation for the probability mass (unlike the previous examples). This is likely because the low approximation ratio de-emphasizes the tails of the posterior and non-Gaussian behavior typically is manifested in the tails. This shows that there are, perhaps surprisingly, examples where taking approx_ratio < 1 is useful.

**Figure 10.** Benchmarking region estimators for Gaussian hyperparameter model (section 7.3) using n = 2000 particles, $\omega \sim \mathcal {N}(\mu ,\sigma ^2)$ where $\mu \sim \mathcal {N}(0.5,0.001^2)$ and $\sigma ^2\sim \mathcal {N}(0.0025,0.0025^2)$ . (a) Probability mass contained by region estimators. (b) Comparison of estimated and true model variances.
Download figure:
Standard image

Hyperparameters are not typically a quantity of interest by themselves. They usually are of relevance because they parameterize a distribution of the unknown parameter. Following equation (8), we calculate ${\rm Var}(\hat {\omega })$ as

$\begin{eqnarray*} &&{\rm Var}(\hat{\omega}) = {\rm Var}(\hat{\mu}) + \mathbb{E}[\hat{\sigma}^2]. \end{eqnarray*}$

Figure 10(b) compares Var(ω) to the variance parameter σ². As the number of experiments grows, our region estimator for ω slightly overestimates the 'true' variance of ω (on average). We see from figure 10(b) that the estimate of the variance of the unknown frequency that is inferred from our region estimate of the hyperparameters systematically over-estimates the variance of ω_true on average. This bias vanishes as the number of experiments increases. We can therefore conclude that we can use the method of hyperparameters to robustly estimate the distribution of an unknown frequency, even in the presence of noise.

8.5. Computational cost

Another way that we can assess the cost of inferring the Hamiltonian of a system is in terms of the classical computing time needed to learn the Hamiltonian parameters to within a fixed error tolerance (as measured by the number of likelihood calls made). Our previous discussion found that the experimental time (measured by the number of experiments) can be minimized by choosing measurements that minimize the risk, and showed that increasingly sophisticated heuristics for generating these guesses tended to reduce the experimental time. This suggests that a trade-off may be present between the experimental time and the classical processing time needed to learn the parameter. This tradeoff will become increasingly relevant as the size of the quantum system grows, since existing quantum simulation techniques do not scale efficiently with the number of particles in the system and thus the cost of performing a likelihood call may asymptotically become much more expensive than performing an experiment.

If computational time is of primary importance (rather than experimental time), then the relative merits of the experimental design heuristics changes. In total, our data sets in figures 4 and 11 required (on average) a number of likelihood calls that fell within the range [1.05 × 10⁷,1.5 × 10⁹]. A likelihood call required the evaluation of $\exp ({-{t}/{T_2}})\cos ^2\left (\frac \omega 2 t\right )+({1-\exp ({-{t}/{T_2}})})/2$ , which required time on the order of 10⁻⁷ seconds on our computers and lead to total computational times that were on the order of a second to a minute. If the rate at which experiments can be performed were much faster than 200 Hz then the utility of our algorithm as a means to speed up data collection may be lost. If the two rates are approximately comparable, then interesting trade-offs appear between the computational time needed and the total experimental time.

**Figure 11.** This figure compares the mean-square error as a function of the computational time for the known T₂ model with T₂ = 100, 5000 particles, `approx_ratio` = 1 and guessed experimental times chosen randomly from an exponential distribution with mean T₂. The expected loss incurred by each optimization strategy is shown is shown in the left figure and the figure on the right shows the 84th percentile Q_0.84 of the loss, such that no more than 16% of trials incur loss greater than the shown percentile.
Download figure:
Standard image

**Figure 12.** The likelihood function for N = 1,51,101,151 and 201 simulated measurements at random times in in (0,20π). The model is that given in equation (10). The red dot (and red arrow) is the randomly chosen true parameter x = (ω,T₂). The blue dots are the n = 100 SMC 'particles'.
Download figure:
Standard image

These trade-offs become apparent by plotting the scaling of the MSE as a function of the computational time for the randomized guess heuristic in figure 11. The first feature that is obvious from the plot is that the strategies which yielded the lowest MSE per experiment tend to yield the highest MSE per likelihood call; although several of these strategies cause the expected loss (mean-square error) to saturate after a finite number of experiments. In particular, this causes the strategy with 30 guesses and no optimization as well as the strategy with 30 guesses, NCG optimization and approx_ratio = 0.1 to intersect the curve for the cases with NCG optimization and approx_ratio = 1. On the surface, this seems to indicate that the more expensive heuristics may have an advantage if small loss is desired; but this is misleading and to get a complete picture we need to look at more than just the expected performance of the strategies.

We can get a better understanding of this saturation by looking at the plot of the 84th percentile of the loss in figure 4, which shows that all of these strategies continue to provide improved estimates of ω even into this regime of saturation for at least 84% of the trials considered. This shows that there were a few trials where very poor guesses were chosen and the algorithm became stuck at a large MSE. The data also suggests that the use of NCG and a large value of the approximation ratio can mitigate these problems, causing the learning algorithm to become more stable at the price of requiring more computational time.

There are many strategies that can be employed to reduce the computational time required by our algorithm. Firstly, since the calculation of each guess is independent of the other guesses, this task can be trivially parallelized on a cluster that uses very little communication between the nodes. Additionally, because the simulations do not need to be high precision for the method to be successful, a single-precision implementation of the simulation step could also be used to improve the performance of the simulation in circumstances where the quantum simulation used to evaluate the likelihood function is computationally expensive. Such simulations have been demonstrated using graphical processing units [55] and field-programmable gate arrays (FPGAs) [56]; the latter is of particular interest due to the use of FPGA devices in the control of quantum information processing systems.

9. Conclusions

Our work provides a simple algorithm that applies Bayesian inference to learn a Hamiltonian in an online fashion; that is to say, that our algorithm learns the Hamiltonian parameters as the experiment proceeds rather than collecting data and inferring the Hamiltonian through post-processing. This eliminates the need to store and process gigabytes of data that are recovered from even relatively short experiments. Our work has several advantages over existing approaches to learning Hamiltonian parameters. Firstly, it can be used to estimate the optimal parameterization of the dynamics of an arbitrary quantum system within a space of model Hamiltonians. Secondly, it can be used to provide a region estimate of the Hamiltonian parameters. The importance of this is obvious: it allows us to not only learn the unknown parameters but also quantify our uncertainty in them. Thirdly, our analysis of the algorithm shows a clear trade off between the experimental time and the computational time needed to parameterize the Hamiltonian.

We illustrated these advantages by benchmarking our algorithm's performance for a number of computationally tractable Hamiltonian models that involve a qubit precessing with an unknown frequency in the presence of decoherence (finite T₂ time). Our results showed that the scaling of the mean-square error of an unknown frequency with the number of experiments irrespective of whether T₂ is known approaches the BCRB, which is known to be optimal, given a set of experimental designs. In contrast, other methods for learning an unknown frequency require a well characterized T₂ time as a prerequisite. Our work, on the other hand, shows that we can learn the unknown frequency and the unknown T₂ simultaneously, which has obvious advantages if data collection is slow.

Perhaps most importantly, our algorithm also provides region estimates of the unknown parameters of the Hamiltonian. This is to say that at the end of the experiments we not only have an estimate of not only the values of the unknown parameters but also our uncertainty in those values. We showed that the method is capable of providing region estimates that contain the actual unknown parameters with probability approximately 99.7%. Our algorithm also was used to provide construct similar confidence intervals for cases where the unknown Hamiltonian parameters were not constant between experiments but instead fluctuated according to an unknown distribution. These tests showed that our algorithm is robust and also is valuable even if the cost of performing experiments is minimal.

Finally, we compared the costs in terms of experimental and computational time of our algorithm. We found that the heuristics that reduced the experimental time most significantly often required more computational time to reach a desired level of accuracy. Conversely, we found that many of the computationally inexpensive heuristics failed to reduce the mean-square errors after a finite number of experiments. This suggests that the relative merits of different heuristics change as the relative costs of computational time and experimental time and the precision with which the unknown parameters should be estimated vary.

An obvious extension of our work would be to consider more advanced optimization heuristics than conjugate gradient searches (such as particle swarm optimization algorithms). Similarly, more advanced resampling techniques may lead to substantial reductions in the number of particles which in turn would reduce the computational cost of the algorithm. Finally, estimates of how the number of experiments required to achieve a specific mean-square error scales with the number of unknown parameters would be an important extension of this work since it would assess the viability of these techniques for controlling and characterizing larger quantum systems.

Acknowledgments

This work was financially supported by the Canadian government through NSERC and CERC and by the United States government through DARPA. NW would like to acknowledge funding from USARO-DTO.

Robust online Hamiltonian learning

Article metrics

Author e-mails

Author affiliations

Author notes