Quantum model averaging

Christopher Ferrie

doi:10.1088/1367-2630/16/9/093035

1. Introduction

Parameter estimation is an integral part of physics. Accurate estimates of physical parameters in quantum mechanical models allow for precision quantum control [1, 2] which enables practical goals such as quantum computation [3] and quantum metrology [4] which in turn can provide probes of fundamental physics such as gravity wave detection [5].

Quantum parameter estimation shares many similarities with its classical counterpart. But there are many subtle and peculiar differences. Even in single parameter estimation, quantum metrology [4, 7] shows that we can obtain advantages from quantum resources such as squeezed states of light [9] and entanglement [10]. With such rich structure, many new subtleties [11], considerations [12–14] and generalities [15–17] can arise, including entirely new approaches to estimation [18–21] and verification [22–24].

At the other end of the parameter spectrum the quantum state of a physical system is our most complete description of it. Thus estimation of quantum states [6] might be seen as the most ambitious form of estimation. There are many approaches to the general problem [25–28], some which specialize for computational efficiency [29–32] and those which go beyond to estimation of regions [33–38].

All of the above mentioned results assume a model. That is, it was taken as given that a particular parametric distribution generated the data. But what if this assumption is not correct? The effect of systematic measurement errors, for example [39], can comprise the security of quantum key distribution protocols [40]. Schemes which guard against measurement errors go by the name self-consistent tomography [41–46]. A recently investigated alternative is the application of classical approaches to model selection, which have been used in a variety of experimental [51, 52] and theoretical [47–50] works. Here we supplement these results with a new approach to parameter estimation: model averaging. The technique shares many similarities with model selection—in fact, model selection is a crucial component of model averaging, but goes beyond it. Although model selection adds an additional layer of security to overconfident estimation, selecting models can be itself a red herring, for the most probable model might only be slightly more probable than others.

The approach considered here combines Bayesian parameter estimation with Bayesian model selection, such that the final estimate of the parameters is the best value of the parameters within each model, averaged over the probability assigned to each model. We will show that such an approach can reduce the error incurred by first selecting a model—which has some probability of being incorrect—then selecting parameters within that model. In fact, the numerical experiments presented here show that model average estimation always does better than the estimates from incorrect models and in some scenarios can perform better than even estimates from the correct model. This is due to the additional hedging afforded by considering multiple models, all of which carry some information.

The paper is organized as follows. In section 2 we outline the problem and review the model selection techniques used so far. In section 3, we present the full Bayesian approach to model selection and define the model average estimate (MAE) of parameters. Section 4 presents two distinct examples where the MAE provides an advantage for parameter estimation. We conclude with a discussion in section 5.

2. General problem and common methods

In section 2.1 we overview the problem. In sections 2.2 and 2.3 we review the Akaike information criterion (AIC) and the Bayesian information criterion (BIC) which are by far the most commonly used approaches to model selection (and the only used thus far for quantum estimation). These sections are included for reference and completeness.

2.1. Problem setup

Let us begin with the base problem of parameter estimation. A physical model prescribes the probabilities for the outcomes of experiments: ${\rm Pr} (D|{\boldsymbol{x}} ;C,M)$ . Here D is some hypothetical or observed data, ${\boldsymbol{x}}$ is a set of real numbers in ${{\mathbb{R}}^{d}}$ , where d is the dimension of the model, C is the experimental context¹ and M is our model. For example, we could have the model M be that of a qubit which is parameterized by a Bloch vector ${\boldsymbol{x}} =(x,y,z)$ . The experimental context could be a measurement basis, say that of σ_x. Then, the quantum mechanical model prescribes ${\rm Pr} (\pm |{\boldsymbol{x}} ;C,M)=(1\pm x)/2$ . In quantum mechanics this is called the Born rule and in statistics, the likelihood function.

Going from parameters to the probability of data is a deductive process—the model gives us numerical values of ${\rm Pr} (D|{\boldsymbol{x}} ;C,M)$ . The experiment, on the other hand, gives us a particular data set D—but what we really want is ${\boldsymbol{x}}$ . This is an example of an inverse problem. What the Bayesian solution provides is ${\rm Pr} ({\boldsymbol{x}} |D;C,M)$ , the distribution of ${\boldsymbol{x}}$ given the data. Although we lack certainty about ${\boldsymbol{x}}$ , we can accurately (read: quantitatively, mathematically) describe our state of knowledge of ${\boldsymbol{x}}$ given we have seen the data D.

The formal path forward is through Bayes' rule

$\begin{eqnarray}&&{\rm Pr} \left( {\boldsymbol{x}} |D;C,M \right)=\frac{{\rm Pr} (D|{\boldsymbol{x}} ;C,M){\rm Pr} ({\boldsymbol{x}} |C,M)}{{\rm Pr} (D|C,M)}.\end{eqnarray} \tag{ 1 }$

From a chronological point of view relative to D, we begin with the prior ${\rm Pr} ({\boldsymbol{x}} |C,M)$ , which encodes the information we have about ${\boldsymbol{x}}$ prior to learning about D. We weight the prior by the likelihood function and normalize by the marginal likelihood²

$\begin{eqnarray}&&{\rm Pr} (D|C,M)={{\mathbb{E}}_{{\boldsymbol{x}} |C,M}}[{\rm Pr} (D|{\boldsymbol{x}} ;C,M)].\end{eqnarray} \tag{ 2 }$

The distribution produced as result of Bayes' rule, ${\rm Pr} ({\boldsymbol{x}} |D;C,M)$ , is called the posterior and represents our knowledge of ${\boldsymbol{x}}$ after the data has been observed.

At this point we could call the problem solved. This, however, assumes the model is correct. If the model is suspect, then we have the meta-problem of determining the best model.

2.2. AIC

The most commonly used model selection³ technique is the AIC, which arises as follows. First, we suppose there is some true model, M = T, giving a distribution ${\rm Pr} (D^{\prime} |T;C)$ . We quantify the discrepancy between this model and our candidate model via the Kullback–Leibler divergence

$\begin{eqnarray}\begin{array}{rcl} {\rm KL}(T\parallel M) & = & {{\mathbb{E}}_{D^{\prime} |T;C}}\left[ {\rm log} \frac{{\rm Pr} (D^{\prime} |T,C)}{{\rm Pr} (D^{\prime} |{\boldsymbol{x}} ;M,C)} \right]={{\mathbb{E}}_{D^{\prime} |T;C}}\left[ {\rm logPr}(D^{\prime} |T,C) \right] \\ {} & {} & -{{\mathbb{E}}_{D^{\prime} |T;C}}\left[ {\rm logPr}(D^{\prime} |{\boldsymbol{x}} ;M,C) \right], \\ \end{array}\end{eqnarray} \tag{ 3 }$

for some appropriate choice of parameters ${\boldsymbol{x}}$ . The 'best' model, then, is the one that minimizes ${\rm KL}(T\parallel M)$ , which is equivalent to maximizing the second term in (3) since the first term only depends on the true model. However, we do not know the true model and we do not know the best set of parameters within our candidate model. The latter problem is naturally addressed by collecting data D and producing an estimate of the parameters ${\boldsymbol{\hat{x}}} (D)$ , then averaging over possible data sets such that the quantity of interest becomes

$\begin{eqnarray}&&{{\mathbb{E}}_{D|T;C}}\left[ {{\mathbb{E}}_{D^{\prime} |T;C}}\left[ {\rm logPr}(D^{\prime} |{\boldsymbol{\hat{x}}} (D);M,C) \right] \right].\end{eqnarray} \tag{ 4 }$

Akaike showed that, independent of the true model (and under some regularity conditions), an unbiased estimator of this quantity is

$\begin{eqnarray}&&{\rm AIC}(M)=\mathop{{\rm max} }\limits_{{\boldsymbol{x}} }{\rm logPr}(D|{\boldsymbol{x}} ;C,M)-d,\end{eqnarray} \tag{ 5 }$

where d is the number of dimensions of the model (the number of free parameters). The preferred model is the one with largest value of AIC $(M)$ . The simple linear penalization with dimension makes it is clear how models with more parameters are penalized.

2.3. BIC

The Bayesian approach is more general. Being so, it is less obvious how it might penalize complex models. Here we show how an asymptotic approximation leads to a form similar to the AIC. First, we write the marginal likelihood (2) as the integral expectation

$\begin{eqnarray}&&{\rm Pr} \left( D|M;C \right)=\int {\rm d}{\boldsymbol{x}} {\rm Pr} \left( {\boldsymbol{x}} |M;C \right){\rm Pr} \left( D|{\boldsymbol{x}} ;C,M \right).\end{eqnarray} \tag{ 6 }$

We will approximate this integral using Laplaceʼs method. To this end, consider the Taylor expansion of the log of the likelihood function about its peak (note that for brevity, we have dropped the context C and model M from the conditionals)

$\begin{eqnarray}&&l({\boldsymbol{x}} ):={\rm logPr}(D|{\boldsymbol{x}} )\approx l({{{\boldsymbol{x}} }_{0}})+{{\nabla }^{{\rm t}}}l({{{\boldsymbol{x}} }_{0}})({\boldsymbol{x}} -{{{\boldsymbol{x}} }_{0}})+\frac{1}{2}{{({\boldsymbol{x}} -{{{\boldsymbol{x}} }_{0}})}^{{\rm t}}}\nabla {{\nabla }^{{\rm t}}}l({{{\boldsymbol{x}} }_{0}})({\boldsymbol{x}} -{{{\boldsymbol{x}} }_{0}}).\end{eqnarray} \tag{ 7 }$

If we assume the number of measurements $N\to \infty$ , the law of large numbers gives us

$\begin{eqnarray}&&\nabla {{\nabla }^{{\rm t}}}l({{{\boldsymbol{x}} }_{0}})=\mathop{\sum }\limits_{j=1}^{N}\nabla {{\nabla }^{{\rm t}}}{{l}_{j}}({{{\boldsymbol{x}} }_{0}}),\end{eqnarray} \tag{ 8 }$

$\begin{eqnarray}&&\to \mathop{\sum }\limits_{j=1}^{N}\mathbb{E}\left[ \nabla {{\nabla }^{{\rm t}}}{{l}_{j}}({{{\boldsymbol{x}} }_{0}}) \right],\end{eqnarray} \tag{ 9 }$

$\begin{eqnarray}&&=-\mathop{\sum }\limits_{j=1}^{N}{{I}_{j}}({{{\boldsymbol{x}} }_{0}}),\end{eqnarray} \tag{ 10 }$

$\begin{eqnarray}&&=-NI({{{\boldsymbol{x}} }_{0}}),\end{eqnarray} \tag{ 11 }$

where I_j is the Fisher information of the jth measurement and I is the arithmetic average of these values. Then, the integral (6) becomes

$\begin{eqnarray}&&{\rm Pr} (D|M;C)=\int {\rm d}{\boldsymbol{x}} {\rm exp} (l({\boldsymbol{x}} )){\rm Pr} ({\boldsymbol{x}} ),\end{eqnarray} \tag{ 12 }$

$\begin{eqnarray}&&=\;{\rm exp} (l({{{\boldsymbol{x}} }_{0}})){\rm Pr} ({{{\boldsymbol{x}} }_{0}}){{\left( \frac{2\pi }{N} \right)}^{\frac{d}{2}}}{{\left| I({{{\boldsymbol{x}} }_{0}}) \right|}^{-\frac{1}{2}}}\left( 1+O\left( \frac{1}{N} \right) \right).\end{eqnarray} \tag{ 13 }$

Now we take the logarithm to obtain

$\begin{eqnarray}&&{\rm logPr}(D|M;C)\approx l({{{\boldsymbol{x}} }_{0}})-\frac{d}{2}{\rm log} N+\frac{d}{2}{\rm log} 2\pi +{\rm logPr}({{{\boldsymbol{x}} }_{0}})+{\rm log} {{\left| I({{{\boldsymbol{x}} }_{0}}) \right|}^{-\frac{1}{2}}}.\end{eqnarray} \tag{ 14 }$

If we ignore the terms not changing with N, we have a new quantity

$\begin{eqnarray}&&{\rm BIC}(M)=\mathop{{\rm max} }\limits_{{\boldsymbol{x}} }\;{\rm logPr}(D|{\boldsymbol{x}} ;C,M)-\frac{d}{2}{\rm log} N,\end{eqnarray} \tag{ 15 }$

which is the well-known BIC. Notice the striking similarity to the AIC (5). Being nearly equivalent, the BIC is often considered in addition to the AIC. Next, we will consider the full solution, which will allow us to obtain more accurate estimates of parameters averaged over the competing models.

Recent proposals have used the AIC/BIC on both simulated [47–50] and experimental data [51, 52]. In [50], however, the authors caution its unattended use. The argument against AIC for quantum states, for example, is simple. The AIC is derived from a metric which measures their closeness of model in their predictive probability—a certainly well-motivated measure. However, such a measure is only useful if all future measurements will be the same as those used to perform the data analysis. That is, one can measure copies of a quantum system in some fixed set of bases, estimate the state, then use that estimate to predict the outcome of a measurement in a new basis. Thus, in quantum theory, one ought to consider a measure on predictive distributions as maximized over all possible future measurements. As suggested in [50], such a measure might well be the quantum relative entropy, for example. Here we avoid these problems by considering the full Bayesian solution, which we describe next.

3. Bayesian model selection and averaging

3.1. MAE

Within the Bayesian framework, the model selection approach is no different than for parameter estimation. Rather than focus on the distribution ${\rm Pr} ({\boldsymbol{x}} |D;C,M)$ , we first consider ${\rm Pr} (M|D;C)$ . Using Bayes rule we have

$\begin{eqnarray}&&{\rm Pr} \left( M|D;C \right)=\frac{{\rm Pr} (D|M;C){\rm Pr} (M|C)}{{\rm Pr} (D|C)}.\end{eqnarray} \tag{ 16 }$

Often, practitioners of Bayesian methods go one step further and compare two models—say, M₁ and M₂—by taking the ratio of these posteriors

$\begin{eqnarray}&&\frac{{\rm Pr} ({{M}_{1}}|D;C)}{{\rm Pr} ({{M}_{2}}|D;C)}=\frac{{\rm Pr} (D|{{M}_{1}};C){\rm Pr} ({{M}_{1}}|C)}{{\rm Pr} (D|{{M}_{2}};C){\rm Pr} ({{M}_{2}}|C)},\end{eqnarray} \tag{ 17 }$

noticing the normalization factor cancels. This quantity is called the posterior odds ratio and first considered by Jeffreys [55]. Clearly, if the posterior odds ratio is larger than 1, we favor M₁. The last fraction is called the prior odds ratio and the unbiased choice favoring neither model is set this term equal to 1. This leaves us with

$\begin{eqnarray}&&\frac{{\rm Pr} ({{M}_{1}}|D;C)}{{\rm Pr} ({{M}_{2}}|D;C)}=\frac{{\rm Pr} (D|{{M}_{1}};C)}{{\rm Pr} (D|{{M}_{2}};C)},\end{eqnarray} \tag{ 18 }$

which is called the Bayes factor [56]. Each quantity in the ratio is marginal likelihood (2) of its respective model.

For a discrete number of hypothetical models $\{{{M}_{k}}\}$ , and assuming one model must be chosen, the optimal strategy is to compute the marginal likelihood if each model and select the one with the highest value. Model selection of this type has been used in quantum mechanical problems of Hamiltonian finding [57] and estimating error channels from syndrome measurements [58].

Here we will investigate the idea of using a meta-model, which is an average over those in $\{{{M}_{k}}\}$ . Assume that we are interested in some subset of parameters ${\boldsymbol{y}}$ common to all models. Given we have taken data D and computed each marginal likelihood ${\rm Pr} ({{M}_{k}}|D;C)$ , we define the MAE as

$\begin{eqnarray}&&{{{\boldsymbol{\hat{y}}} }_{{\rm MAE}}}(D;C)={{\mathbb{E}}_{{{M}_{k}}|D;C}}\left[ {{\mathbb{E}}_{{\boldsymbol{y}} |D;C,{{M}_{k}}}}[{\boldsymbol{y}} ] \right].\end{eqnarray} \tag{ 19 }$

In words, this is the average (over models) of the average (over parameters within models). Variants of this approach are referred to as Bayesian model averaging [53].

Before moving to our examples, a few comments are in order. First, it is not necessary that the models are 'nested' in the sense that we can order them into supersets. The only requirement is that the parameters of interest are included in each model. The other parameters in each model we might call nuisance parameters. Part of the appeal of the Bayesian approach is that these parameters are automatically dealt with and we can focus on those parameters which are of immediate interest. Of course, the nuisance parameters can be inferred as well.

Let us wax philosophical for a moment. What is being proposed is to select a meta-model, an average over many different physical models. This may seem awkward for physical theories—after all, there is only one true model, right? Not in our view. Models are human constructs, those Platonic ideals which describe another world, a world we were clever enough to find through our mastery of mathematics and abstraction. Here, we have dropped the idea that the point is to find the capital-T-truth. Rather, we measure our understanding of nature through our ability to predict and control its behavior. By averaging physical models, we can show that this idea has merit.

3.2. Sequential Monte Carlo (SMC)

In practice, the Bayesian update rule and the expectations required in the equations above are analytically and computationally intractable since they involve complicated integrals over multidimensional parameter spaces which may include solutions to equations of motions which are themselves intractable. To perform the calculations we turn to Monte Carlo techniques. Our numerical algorithm fits within the subclass of Monte Carlo methods called SMC or particle filtering [59].

The SMC procedure prescribes that we approximate the probability distribution by a weighted sum of Dirac delta-functions

$\begin{eqnarray}&&{\rm Pr} ({\boldsymbol{x}} )\approx \mathop{\sum }\limits_{j=1}^{n}{{w}_{j}}\delta ({\boldsymbol{x}} -{{{\boldsymbol{x}} }_{j}}),\end{eqnarray} \tag{ 20 }$

where the weights at each step are iteratively calculated from the previous step via

$\begin{eqnarray}&&{{w}_{j}}\mapsto {\rm Pr} (D|{{{\boldsymbol{x}} }_{j}}){{w}_{j}},\end{eqnarray} \tag{ 21 }$

followed by a normalization step. The elements of the set $\{{{{\boldsymbol{x}} }_{j}}\}_{j=1}^{n}$ are called particles. Here, $n=|\{{{{\boldsymbol{x}} }_{i}}\}|$ is the number of particles and controls the accuracy of the approximation. Like all Monte Carlo algorithms, the SMC algorithm approximates expectation values, such that

$\begin{eqnarray}&&{{\mathbb{E}}_{{\boldsymbol{x}} }}[f({\boldsymbol{x}} )]\approx \mathop{\sum }\limits_{j=1}^{n}{{w}_{j}}f({{{\boldsymbol{x}} }_{j}}).\end{eqnarray} \tag{ 22 }$

In other words, SMC allows us to efficiently compute multidimensional integrals with respect to the measure defined by the probability distribution.

The resultant posterior probability provides a full specification of our knowledge. However, in most applications, it is sufficient—and certainly more efficient—to summarize this distribution. In our context, the optimal single parameter vector to report is the mean of the posterior distribution

$\begin{eqnarray}&&{\boldsymbol{\hat{x}}} ={{\mathbb{E}}_{{\boldsymbol{x}} }}[{\boldsymbol{x}} ]=\mathop{\sum }\limits_{j=1}^{n}{{w}_{j}}{{{\boldsymbol{x}} }_{j}}.\end{eqnarray} \tag{ 23 }$

The SMC approximation can also provide efficient calculation and description of regions [38, 60]. For our purpose, we also require the SMC approximation to give an accurate and efficient esimate of the marginal likelihood equation (6), which we need to calculate Bayes' rule at the level of models equation (16). Via the SMC approximation, the integral expectation in the definition of the marginal likelihood, equation (6), is

$\begin{eqnarray}&&{\rm Pr} (D)\approx \mathop{\sum }\limits_{j=1}^{n}{{w}_{j}}{\rm Pr} (D|{{{\boldsymbol{x}} }_{j}}).\end{eqnarray} \tag{ 24 }$

It is not immediately obvious but is easy to see in hindsight that this is exactly the normalization that must be computed already in the SMC algorithm after the weight update, equation (21), is applied. By storing this value, we can apply Bayes rule at the meta-level of models eqaution (16).

An iterative numerical algorithm such as SMC requires care to ensure stability. Conditions for stability of the algorithm and the specifications of an implementation have been detailed elsewhere [60, 61]. The SMC algorithm has now been used in many quantum mechanical parameter estimation problems [38, 57, 60–63] and a software implementation (the one used here) is available as a Python package [64, 65].

4. Examples

4.1. Rank selection

We consider first an example similar to Guta, Kypraios and Dryden [52]: n qubits subjected to random Pauli measurements where the models under consideration are those of differing rank. Each model will be denoted M_r, where r is the rank of the unknown quantum state so that the dimension of model M_r is $d={{2}^{n+1}}r$ . We generate unknown quantum states with fixed rank r as follows [66]. Begin with a matrix $X\in {{\mathbb{C}}^{{{2}^{n}}\times r}}$ where each component x_ij is chosen independently according to $\Re ({{x}_{ij}})\sim \mathcal{N}(0,1)$ and $\Im ({{x}_{ij}})\sim \mathcal{N}(0,1)$ —standard Normal distributions. Then define the rank r density operator

$\begin{eqnarray}&&\rho =\frac{X{{X}^{\dagger }}}{{\rm Tr}(X{{X}^{\dagger }})}.\end{eqnarray} \tag{ 25 }$

If $r={{2}^{n}}$ , this construction is equivalent to a Hilbert-Schmidt random density matrix. The model parameters will be the vectorization of the matrix X: ${\boldsymbol{x}} ={\rm vec}(X)$ .

We label the single qubit Pauli operators $\{{{\sigma }_{0}},{{\sigma }_{1}},{{\sigma }_{2}},{{\sigma }_{3}}\}$ and the multi-qubit Paulis by

$\begin{eqnarray}&&{{\sigma }_{k}}={{\sigma }_{{{k}_{1}}}}\otimes {{\sigma }_{{{k}_{2}}}}\otimes \cdots \otimes {{\sigma }_{{{k}_{n}}}},\end{eqnarray} \tag{ 26 }$

where $k={{k}_{1}}+4{{k}_{2}}+{{4}^{2}}{{k}_{3}}+\cdots +{{4}^{n-1}}{{k}_{n}}$ . Since each Pauli is idempotent, $\sigma _{k}^{2}=\mathbb{1}$ , each individual measurement has two possible outcomes which we label $d\in \{+1,-1\}$ for the $+1$ and $-1$ eigenvalues. Then the likelihood function of a single measurement can be related to the expectation value via: ${\rm Pr} (\pm 1|{{\sigma }_{k}})=(1\pm \langle {{\sigma }_{k}}\rangle )/2$ . Using the properties of the vec operation, we can write this as an explicit function of ${\boldsymbol{x}}$

$\begin{eqnarray}&&{\rm Pr} (\pm 1|X,{{\sigma }_{k}})=\frac{1}{2}\left( 1\pm \frac{{\rm Tr}(X{{X}^{\dagger }}{{\sigma }_{k}})}{{\rm Tr}(X{{X}^{\dagger }})} \right)=\frac{1}{2}\left( 1\pm \frac{{\boldsymbol{x}} \cdot ({{\sigma }_{k}}\otimes \mathbb{1}){\boldsymbol{x}} }{{\boldsymbol{x}} \cdot {\boldsymbol{x}} } \right)={\rm Pr} (\pm 1|{\boldsymbol{x}} ,{{\sigma }_{k}}).\end{eqnarray} \tag{ 27 }$

We label the parameter vector within the rank r model M_r by ${{{\boldsymbol{x}} }_{r}}$ and the associated density matrix given by equation (25) ρ_r. Within each model, there is not a one-to-one correspondence between ${{{\boldsymbol{x}} }_{r}}$ and ρ_r—different vectors will yield the same density matrix. Since we will be interested in obtaining accurate estimates of density matrices as quantified by some norm on the space of density matrices, we will average the models over their ρ_rʼs rather than their ${{{\boldsymbol{x}} }_{r}}$ ʼs.

Within each model, then, we have the mean density matrix (recall D is data, C is the context—that is, which Paulis were measured)

$\begin{eqnarray}&&{{\hat{\rho }}_{r}}(D;C)={{\mathbb{E}}_{{{{\boldsymbol{x}} }_{r}}|D;C}}\left[ \frac{{{X}_{r}}X_{r}^{\dagger }}{{\rm Tr}({{X}_{r}}X_{r}^{\dagger })} \right].\end{eqnarray} \tag{ 28 }$

Explicitly, the MAE in equation (19) is

$\begin{eqnarray}&&{{\rho }_{{\rm MAE}}}(D;C)=\mathop{\sum }\limits_{r}{{\hat{\rho }}_{{\rm r}}}(D;C){\rm Pr} (r|D;C).\end{eqnarray} \tag{ 29 }$

We assume there is a true density matrix ${{\rho }_{{\rm t}}}$ and we judge each estimate of the true state $\hat{\rho }$ by its spectral distance to ${{\rho }_{{\rm t}}}$

$\begin{eqnarray}&&|{{\rho }_{{\rm t}}}-\hat{\rho }|={{\sigma }_{{\rm max} }}({{\rho }_{{\rm t}}}-\hat{\rho }),\end{eqnarray} \tag{ 30 }$

where ${{\sigma }_{{\rm max} }}$ denotes the largest singular value. This is the norm induced by the usual Euclidean norm on vectors and, without giving an operational meaning to the states identified in our toy example, it seems the most convenient.

The data for two qubits is presented in figure 1. The important take-away is that the MAE does as well, or slightly better than the true model in every case. Also, we see that it is quite easy to identify rank 1 models (pure states) as well as rank 4 models (full rank states) while it seems difficult to identify states of non-extreme rank. Notice that it is extremely difficult to distinguish the rank 3 model from the full rank model when the former defines the underlying truth. However, both models perform well with respect to the error in the estimated parameters and the MAE does best, on average.

To further illustrate the difficulty in differentiating high rank states, the probabilities assigned to three qubit models is shown in figure 2. Again, we see that pure states and full rank states are correctly identified, yet it is difficult to correctly distinguish between rank 7 and rank 8 states, in the same way as it was difficult to distinguish rank 3 and rank 4 states for two qubits. We conclude that for the models considered here, it is easiest to correctly identify low rank and full rank states, while it is difficult to correctly identify nearly high rank states.

**Figure 2.** The performance of Bayesian model selection and the model average estimate for three qubits. Each box represents the data for its labeled rank as the 'true' model. The lines represent the median of the data and, where present, the shaded areas are the interquartile ranges. Each 'measurement' on the horizontal axis corresponds to 100 experiments of a randomly chosen Pauli measurement. In the SMC algorithm, 10⁵ particles were used in each model. For each true rank, ten simulated states were generated and measured. Note again that rank one states are easily distinguished. Whereas, it appears that—on average—much more data will be necessary to rule out full rank (that is, rank eight) states when the true rank is seven. Some speculation on why this might be is given in the discussion section 5.
Download figure:
Standard image High-resolution image

Notice also that models which are far away—in the sense that the ranks differ by relatively large amounts—are quickly ruled out. In these cases, since the SMC algorithm can be run online (in parallel with the experiment), simulating such models can be stopped to mitigate the computational difficultly in simultaneously simulating many quantum models. Still, tomography is at one extreme in the spectrum of methods for estimating quantum mechanical parameters—it is the most complete description of the physical system. At the other extreme is summarizing information from experiments into a single number, such as fidelity.

4.2. Randomized benchmarking

In the second example, we consider the experimental protocol of randomized benchmarking [67] which has been demonstrated in a variety of experimental settings [69–72] to efficiently characterize noise and quantum channels. The protocol consists of stringing together potentially long sequences of gates which is then undone to determine if the initial state has survived. In [67] the approach was shown to give, in expectation, an exponentially decay $A{{p}^{m}}+B$ , where A and B encode the errors in preparation and measurement, p is the bare survival probability and m is the length of the sequence. In the models we consider, p can be related to the average fidelity over a group of gates, but more specialized protocols exists [24, 73].

Typically, p is the only parameter of interest since it is directly related to the average fidelity of the device which is then compared to some threshold. The other parameters are often consider nuisance parameters. In [74] it was shown that the decay can be interpreted as probabilistic model where the binary outcome of each measurement sequence of length m has probability of survival (labeled 0)

$\begin{eqnarray}&&{\rm Pr} (0|{{{\boldsymbol{x}} }_{0}};m,{{M}_{0}})={{A}_{0}}{{p}^{m}}+{{B}_{0}},\end{eqnarray} \tag{ 31 }$

where the subscripts refer to this as the zeroth order model. In [67], a hierarchy of models was introduced because the zeroth order model assumes the errors in the gates within the sequence are independent. By dropping this assumption, a richer set of noise models can be studied [68]. Here we will study the zeroth order model and the first order model

$\begin{eqnarray}&&{\rm Pr} \left( 0|{{{\boldsymbol{x}} }_{1}};m,{{M}_{1}} \right)={{A}_{1}}{{p}^{m}}+{{B}_{1}}+{{C}_{1}}(m-1)\left( {{q}_{1}}-{{p}^{2}} \right){{p}^{m-2}},\end{eqnarray} \tag{ 32 }$

where A₁ and B₁ again encode the preparation and measurement errors, C₁ encodes the error on the final gate in the sequence and ${{q}_{1}}-{{p}^{2}}$ is a measure of the gate dependence in the errors.

For the zeroth order model, we take the prior to be a normal distribution with a mean vector $(p,{{A}_{0}},{{B}_{0}})=(0.95,0.3,0.5)$ and equal diagonal covariances given by a deviation of $\sigma =0.01$ . Note that the first order model is equal to the zeroth order model when either ${{C}_{1}}=0$ or ${{q}_{1}}={{p}^{2}}$ . In order to not make the model so different that it would be trivial to distinguish them, we look at two priors for the first order model which are close to the zeroth order model. The first is a normal distribution with a mean vector $(p,{{A}_{1}},{{B}_{1}},{{C}_{1}},{{q}_{1}})=(0.95,0.3,0.5,0.03,0.95)\;=:\;{{\mu }_{{\rm I}}}$ and equal diagonal covariances given by a deviation of $\sigma =0.01$ and the second slightly closer with the same covariance matrix but mean vector $(p,{{A}_{1}},{{B}_{1}},{{C}_{1}},{{q}_{1}})=(0.95,0.3,0.5,0.02,0.92)\;=:\;{{\mu }_{{\rm II}}}$ . Note that the difference between these two distributions in the relative entropy divergence is only 0.050, and so we might expect them to behave the same. Since the both models are close to the zeroth order model, we expect it to be difficult to distinguish them.

In figure 3 we simulate the models noting again that, via the priors, they are very close. This intuition is quantified by the fact that the models are hard to distinguish, regardless of which is true. On the top in figure 3, we see that parameters in the first order model are so close to those in the zeroth order model that it is irrelevant which is chosen, for the purpose of estimating p. However, what is 'close' can be deceiving as we see in the bottom of figure 3. Recall that relative entropy from ${{\mu }_{{\rm I}}}$ to ${{\mu }_{{\rm II}}}$ is only 0.050 (which explains why they are so difficult to distinguish). In this case, the accuracy of the estimates of the parameter p depend crucially on which model is actually correct. In such cases, the MAE can be seen as providing a more conservative estimate of the average gate fidelity by hedging what is at best a 50/50 guess on which model is correct.

**Figure 3.** The performance of Bayesian model selection and the model average estimate for the survival probability in randomized benchmarking experiments. Each box represents the data for its label as the 'true' model. The lines represent the median of the data and, where present, the shaded areas are the interquartile ranges. Each 'measurement' on the horizontal axis corresponds to a randomized benchmarking experiment with sequences lengths $\{10,30,50,\ldots ,200\}$ and 1000 repetitions per sequence length. In the SMC algorithm, 10³ particles were used in for each model. The top row is labeled ${{\mu }_{{\rm I}}}$ and the bottom row labeled ${{\mu }_{{\rm II}}}$ which refer to different prior probabilities over the first-order model parameters, as described in the main text. The prior over the zeroth order model parameters remains fixed. Both ${{\mu }_{{\rm I}}}$ and ${{\mu }_{{\rm II}}}$ represent priors which are 'close' to the prior over the zeroth order model parameters, which is why the models are hard to distinguish. Moreover, the models ${{\mu }_{{\rm I}}}$ and ${{\mu }_{{\rm II}}}$ are themselves extremely close. The plots illustrate then one subtle and important fact: our intuitive notion of closeness of models does not translate to closeness in performance of estimates drawn from these models. The model average estimate can hedge the risk of selecting the wrong model.
Download figure:
Standard image High-resolution image

5. Conclusion and discussion

We have introduced a Bayesian model averaging approach to estimating parameters in quantum mechanical models describing data. In the examples considered, the MAE performance as well as the unknown true model in most cases. In situations where models are difficult to distinguish, the MAE can slightly outperform the true model.

For the quantum state estimation example (section 4.1) we considered models of differing rank of the density matrix. Ranks which differ by large amounts from the true rank are rapidly ruled out—that is, the probability assigned to them quickly approaches zero. Thus, they contribute nothing to the MAE. On the other hand, ranks which are close to the true rank—especially when the true rank is high—are not so easily distinguished. This means that first selecting a rank and then performing estimation within that model can lead to overconfident estimates of the state. By averaging, we can allow those estimates to only contribute with the relative probability that we deem them to be true.

The mechanism for why higher rank states are hard to distinguish is yet unclear. Although the SMC algorithm and the implementation used here have been extensively studied for a wide range of problems, it is possible that state tomography is an outlier. That is, some major modification to modeling or the algorithm itself may be required to obtain the best performance. This resolution would be less interesting than a physical explanation, such as Pauli measurement tomography (which we have used in the example here) is not the optimal scheme to distinguish rank. These questions are left for future work.

In the example of randomized benchmarking (section 4.2), we explored a situation where the presence of higher order perturbations were difficult to detect. The Bayesian model selection approach accurately predicts this by assigning roughly 50/50 probability assignments to the models. Surprisingly, the 'closeness' of the models as measured by our ability to distinguish them does not translate into our ability to accurately estimate parameters common to each. In some cases we can do no more than guess which is correct, yet guessing the wrong model may have disastrous consequences in our ability to accurately infer the parameters of interest. Again, the MAE mitigates the risk of improperly 'guessing' which model is correct.

We have noted in the introduction the numerous approaches to estimation within quantum theory. This should urge one to ask, are all of these distinct approaches necessary—is there not some unified approach? Yes! The Bayesian framework outlined here is remarkably powerful in its generality. We note that Bayesian ideas have already been put to good use in quantum information theoretic and foundational problems [77–85] as well as for tomographic and parameter estimation problems [26, 86–93] and experimental design [94, 95]. Importantly, the Bayesian algorithm can provide solutions to these problems online, while the experiment is running, with the same software tools [64].

Acknowledgments

The author thanks C Granade and R Blume-Kohout for helpful discussions. This work was supported in part by National Science Foundation Grant No. PHY-1212445 and by the Canadian Government through the NSERC PDF program.

Quantum model averaging

Article metrics

Author e-mails

Author affiliations

Dates

Abstract

1. Introduction

2. General problem and common methods

2.1. Problem setup

2.2. AIC

2.3. BIC

3. Bayesian model selection and averaging

3.1. MAE

3.2. Sequential Monte Carlo (SMC)

4. Examples

4.1. Rank selection

4.2. Randomized benchmarking

5. Conclusion and discussion

Acknowledgments

Footnotes

Quantum model averaging

Article metrics

Share this article

Author e-mails

Author affiliations

Dates

Abstract

1. Introduction

2. General problem and common methods

2.1. Problem setup

2.2. AIC

2.3. BIC

3. Bayesian model selection and averaging

3.1. MAE

3.2. Sequential Monte Carlo (SMC)

4. Examples

4.1. Rank selection

4.2. Randomized benchmarking

5. Conclusion and discussion

Acknowledgments

Footnotes