Quantum variational autoencoder

Amir Khoshaman; Walter Vinci; Brandon Denis; Evgeny Andriyash; Hossein Sadeghi; Mohammad H Amin

doi:10.1088/2058-9565/aada1f

1. Introduction

While rooted in fundamental ideas that date back decades ago [1, 2], deep-learning algorithms [3, 4] have only recently started to revolutionize the way information is collected, analyzed, and interpreted in almost every intellectual endeavor [5]. This is made possible by the computational power of modern dedicated processing units (such as GPUs). The most remarkable progress has been made in the field of supervised learning [6], which requires a labeled dataset. There has also been a surge of interest in the development of unsupervised learning with unlabeled data [4, 7, 8]. One notable challenge in unsupervised learning is the computational complexity of training most models [9].

It is reasonable to hope that some of the computational tasks required to perform both supervised and unsupervised learning could be significantly accelerated by the use of quantum processing units (QPU). Indeed, there are already quantum algorithms that can accelerate machine learning tasks [10–13]. Interestingly, machine learning algorithms have been used in quantum-control techniques to improve fidelity and coherence [14–16]. This natural interplay between machine learning and quantum computation is stimulating a rapid growth of a new research field known as quantum machine learning [17–21].

A full implementation of quantum machine-learning algorithms requires the construction of fault-tolerant QPUs, which is still challenging [22–24]. However, the remarkable recent development of gate-model processors with a few dozen qubits [25, 26] and quantum annealers with a few thousand qubits [27, 28] has triggered an interest in developing quantum machine-learning algorithms that can be practically tested on current and near-future quantum devices. Early attempts to use small gate-model devices for machine learning use techniques similar to those developed in the context of quantum approximate optimization algorithms [29] and variational quantum algorithms [26, 30] to perform quantum heuristic optimization as a subroutine for small unsupervised tasks such as clustering [31]. The use of quantum annealing devices for machine-learning tasks is perhaps more established and relies on the ability of quantum annealers to perform both optimization [32–34] and sampling [35–38].

As optimizers, quantum annealers have been used to perform supervised tasks such as classification [39–42]. As samplers, they have been used to train RBMs, and are thus well-suited to perform unsupervised tasks such as training deep probabilistic models [43–46]. In [47], a D-Wave quantum annealer was used to train a deep network of stacked RBMs to classify a coarse-grained version of the MNIST dataset [48]. Quantum annealers have also been used to train fully visible Boltzmann machines on small synthetic datasets [49]. While mostly used in conjunction with traditional RBMs, quantum annealing should find a more natural application in the training of QBM [51].

A clear disadvantage of such early approaches is the need to consider datasets with a small number of input units, which prevents a clear route towards practical applications of quantum annealing with current and next-generation devices. A first attempt towards this end was presented in [52], with the introduction of a quantum-assisted Helmholtz machine (QAHM). However, training QAHM is based on the wake-sleep algorithm [8], which does not have a well-defined loss function in the wake and sleep phases of training. Moreover, the gradients do not correctly propagate in the networks between the two phases. Because of these shortcomings, QAHM generates blurry images and training does not scale to standard machine-learning datasets such as MNIST.

Our approach is to use variational autoencoders (VAEs), a class of generative models that provide an efficient inference mechanism [53, 54]. We show how to implement a quantum VAE (QVAE), i.e., a VAE with discrete variables (DVAE) [55] whose generative process is realized by a QBM. QBMs were introduced in [51], and can be trained by minimizing a quantum lower bound to the true log-likelihood. We show that QVAEs can be effectively trained by sampling from the QBM with continuous-time quantum Monte Carlo (CT-QMC). We demonstrate that QVAEs have performance on par with conventional DVAEs equipped with traditional RBMs, despite being trained via an additional bound to the likelihood.

QVAEs share some similarities with QAHMs, such as the presence of both an inference (encoder) and a generation (decoder) network. However, they have the advantage of a well-defined loss function with fully propagating gradients that can be efficiently trained via back-propagation. This allows to achieve state-of-the-art performance (for models with only discrete units) on standard datasets such as MNIST by training (classical) DVAEs with large RBMs. Training QVAEs with a large number of latent units is impractical with CT-QMC, but can be be accelerated with quantum annealers. Our work thus opens a path to practical machine learning applications with current and next-generation quantum annealers.

The QVAEs we introduce in this work are generative models with a classical autoencoding structure and a quantum generative process. This is in contrast with the quantum autoencoders (QAEs) introduced in [56, 57]. QAEs have a quantum autoencoding structure (realized via quantum circuits), and can be used for quantum and classical data compression, but lack a generative structure.

The structure of the paper is as follows. In section 2 we provide a general discussion of generative models with latent variables, which include VAEs as a special case. We then introduce the basics of VAEs with continuous latent variables in section 3. Section 4 discusses the generalization of VAEs to discrete latent variables and presents our experimental results with RBMs implemented in the latent space. In section 5 we introduce QVAEs and present our results. We conclude in section 6 and give further technical and methodological details in the appendices.

2. Generative models with latent variables

Let $X={\{{{\bf{x}}}^{d}\}}_{d=1}^{N}$ represents a training set of N independent and identically distributed samples coming from an unknown data distribution, p_data(X) (for instance, the distribution of the pixels of a set of images). Generative models are probabilistic models that minimize the 'distance' between the model distribution, ${p}_{{\boldsymbol{\theta }}}(X)$ , and the data distribution, p_data(X), where ${\boldsymbol{\theta }}$ denotes the parameters of the model. Generative models can be categorized in several ways, but for the purpose of this paper we focus on the distinction between models with latent (unobserved) variables and fully visible models with no latent variables. Examples of the former include generative adversarial networks [58], VAEs [53], and RBMs, whereas some important examples of the latter include NADE [59], MADE [60], pixelRNNs, and pixelCNNs [61].

The conditional relationships among the visible units, ${\bf{x}}$ , and latent units, ${\boldsymbol{\zeta }}$ , determine the joint probability distribution, ${p}_{{\boldsymbol{\theta }}}({\bf{x}},{\boldsymbol{\zeta }})$ , of a generative model and can be represented in terms of either undirected (figure 1(a)) or directed (figure 1(b)) graphs. Unlike fully visible models, generative models with latent variables can potentially learn and encode in the latent space useful representations of the data. This is an appealing property that can be exploited to improve other tasks such as supervised and semi-supervised learning (i.e., when only a fraction of the input data is labeled) [62] with substantial practicality in image search [63], speech analysis [64], genomics [65], drug design [66], and so on.

**Figure 1.** Generative models with latent variables can be represented as graphical models that describe conditional relationships within variables. (a) Undirected generative models are defined in terms of the a joint probability distribution, ${p}_{{\boldsymbol{\theta }}}({\bf{x}},{\boldsymbol{\zeta }})$ . Boltzmann machines belong to this group of generative models. (b) In a directed generative model, the joint probability distribution ${p}_{{\boldsymbol{\theta }}}({\bf{x}},{\boldsymbol{\zeta }})$ , is decomposed as ${p}_{{\boldsymbol{\theta }}}({\bf{x}},{\boldsymbol{\zeta }})={p}_{{\boldsymbol{\theta }}}({\bf{x}}| {\boldsymbol{\zeta }}){p}_{{\boldsymbol{\theta }}}({\boldsymbol{\zeta }})$ : a prior distribution over the latent variables ${p}_{{\boldsymbol{\theta }}}({\boldsymbol{\zeta }})$ and a decoder distribution ${p}_{{\boldsymbol{\theta }}}({\bf{x}}| {\boldsymbol{\zeta }})$ . The prior and the decoder are 'hard-coded' or explicitly determined by the model; however, the posterior, ${p}_{{\boldsymbol{\theta }}}({\boldsymbol{\zeta }}| {\bf{x}})$ (dotted red arrow) is intractable. In VAEs, an approximating posterior, ${q}_{{\boldsymbol{\phi }}}({\boldsymbol{\zeta }}| {\bf{x}})$ , is proposed to replace the intractable true posterior. (c) Structure of the generative and the inference (red dotted arrows) models of a DVAE and a QVAE. Here, ${p}_{{\boldsymbol{\theta }}}({\bf{z}})$ represents the prior over discrete variables, ${\bf{z}}$ , and is characterized by an RBM or a QBM in DVAEs or QVAEs, respectively. The continuous variables ${\boldsymbol{\zeta }}$ are introduced to allow for a smooth propagation of the gradients.
Download figure:
Standard image High-resolution image

**Figure 1.** Generative models with latent variables can be represented as graphical models that describe conditional relationships within variables. (a) Undirected generative models are defined in terms of the a joint probability distribution, ${p}_{{\boldsymbol{\theta }}}({\bf{x}},{\boldsymbol{\zeta }})$ . Boltzmann machines belong to this group of generative models. (b) In a directed generative model, the joint probability distribution ${p}_{{\boldsymbol{\theta }}}({\bf{x}},{\boldsymbol{\zeta }})$ , is decomposed as ${p}_{{\boldsymbol{\theta }}}({\bf{x}},{\boldsymbol{\zeta }})={p}_{{\boldsymbol{\theta }}}({\bf{x}}| {\boldsymbol{\zeta }}){p}_{{\boldsymbol{\theta }}}({\boldsymbol{\zeta }})$ : a prior distribution over the latent variables ${p}_{{\boldsymbol{\theta }}}({\boldsymbol{\zeta }})$ and a decoder distribution ${p}_{{\boldsymbol{\theta }}}({\bf{x}}| {\boldsymbol{\zeta }})$ . The prior and the decoder are 'hard-coded' or explicitly determined by the model; however, the posterior, ${p}_{{\boldsymbol{\theta }}}({\boldsymbol{\zeta }}| {\bf{x}})$ (dotted red arrow) is intractable. In VAEs, an approximating posterior, ${q}_{{\boldsymbol{\phi }}}({\boldsymbol{\zeta }}| {\bf{x}})$ , is proposed to replace the intractable true posterior. (c) Structure of the generative and the inference (red dotted arrows) models of a DVAE and a QVAE. Here, ${p}_{{\boldsymbol{\theta }}}({\bf{z}})$ represents the prior over discrete variables, ${\bf{z}}$ , and is characterized by an RBM or a QBM in DVAEs or QVAEs, respectively. The continuous variables ${\boldsymbol{\zeta }}$ are introduced to allow for a smooth propagation of the gradients.
Download figure:
Standard image High-resolution image

Training a generative model is commonly done via maximum likelihood approach, in which optimal model parameters ${{\boldsymbol{\theta }}}^{* }$ are obtained by maximizing the likelihood of the dataset:

$\begin{eqnarray}&&\displaystyle \sum _{{\bf{x}}}{p}_{\mathrm{data}}({\bf{x}})\mathrm{log}{p}_{{\boldsymbol{\theta }}}({\bf{x}})={{\mathbb{E}}}_{{\bf{x}}\sim {p}_{\mathrm{data}}}[\mathrm{log}{p}_{{\boldsymbol{\theta }}}({\bf{x}})],\end{eqnarray} \tag{ 1 }$

where ${p}_{{\boldsymbol{\theta }}}({\bf{x}})={\sum }_{{\boldsymbol{\zeta }}}{p}_{{\boldsymbol{\theta }}}({\bf{x}},{\boldsymbol{\zeta }})$ is the marginal probability distribution of the visible units and ${{\mathbb{E}}}_{{\bf{x}}\sim {p}_{\mathrm{data}}}[...]$ means the expectation value over ${\bf{x}}$ sampled from ${p}_{\mathrm{data}}({\bf{x}})$ .

To better understand the behavior of generative models with latent variables, we now write ${{\mathbb{E}}}_{{\bf{x}}\sim {p}_{\mathrm{data}}}[\mathrm{log}{p}_{{\boldsymbol{\theta }}}({\bf{x}})]$ in a more insightful form. First, note that $\mathrm{log}{p}_{{\boldsymbol{\theta }}}({\bf{x}})={{\mathbb{E}}}_{{\boldsymbol{\zeta }}\sim {p}_{{\boldsymbol{\theta }}}({\boldsymbol{\zeta }}| {\bf{x}})}[\mathrm{log}{p}_{{\boldsymbol{\theta }}}({\bf{x}})]$ , since ${p}_{{\boldsymbol{\theta }}}({\bf{x}})$ is independent of ${\boldsymbol{\zeta }}$ . The quantity ${p}_{{\boldsymbol{\theta }}}({\boldsymbol{\zeta }}| {\bf{x}})$ is called the posterior distribution, since it represents the probability of the latent variables after an observation ${\bf{x}}$ has been made (see figure 1(b)). Also, since we have ${p}_{{\boldsymbol{\theta }}}({\bf{x}})={p}_{{\boldsymbol{\theta }}}({\bf{x}},{\boldsymbol{\zeta }})/{p}_{{\boldsymbol{\theta }}}({\boldsymbol{\zeta }}| {\bf{x}})$ , we can write:

$\begin{eqnarray}&&\begin{array}{l}{{\mathbb{E}}}_{{\bf{x}}\sim {p}_{\mathrm{data}}}[\mathrm{log}{p}_{{\boldsymbol{\theta }}}({\bf{x}})]=\,{{\mathbb{E}}}_{{\bf{x}}\sim {p}_{\mathrm{data}}}\left[{{\mathbb{E}}}_{{\boldsymbol{\zeta }}\sim {p}_{{\boldsymbol{\theta }}}({\boldsymbol{\zeta }}| {\bf{x}})}\left[\mathrm{log}\displaystyle \frac{{p}_{{\boldsymbol{\theta }}}({\bf{x}},{\boldsymbol{\zeta }})}{{p}_{{\boldsymbol{\theta }}}({\boldsymbol{\zeta }}| {\bf{x}})}\right]\right].\end{array}\end{eqnarray} \tag{ 2 }$

By noticing that ${p}_{{\boldsymbol{\theta }}}({\boldsymbol{\zeta }},{\bf{x}})={p}_{{\boldsymbol{\theta }}}({\boldsymbol{\zeta }}){p}_{{\boldsymbol{\theta }}}({\bf{x}}| {\boldsymbol{\zeta }})$ and rearranging equation (2), we have:

$\begin{eqnarray}\begin{array}{rcl}{{\mathbb{E}}}_{{\bf{x}}\sim {p}_{\mathrm{data}}}[\mathrm{log}{p}_{{\boldsymbol{\theta }}}({\bf{x}})] & = & {{\mathbb{E}}}_{{\bf{x}}\sim {p}_{\mathrm{data}}}\left[\Space{0ex}{7.0ex}{0ex}{{\mathbb{E}}}_{{\boldsymbol{\zeta }}\sim {p}_{{\boldsymbol{\theta }}}({\boldsymbol{\zeta }}| {\bf{x}})}\left[\mathrm{log}{p}_{{\boldsymbol{\theta }}}({\bf{x}}| {\boldsymbol{\zeta }})\right]\right.\left.-\,\mathop{\underbrace{{{\mathbb{E}}}_{{\boldsymbol{\zeta }}\sim {p}_{{\boldsymbol{\theta }}}({\boldsymbol{\zeta }}| {\bf{x}})}\left[\mathrm{log}\displaystyle \frac{{p}_{{\boldsymbol{\theta }}}({\boldsymbol{\zeta }}| {\bf{x}})}{{p}_{{\boldsymbol{\theta }}}({\boldsymbol{\zeta }})}\right]}}\limits_{{D}_{{\rm{KL}}}\left({p}_{{\boldsymbol{\theta }}}({\boldsymbol{\zeta }}| {\bf{x}})| | {p}_{{\boldsymbol{\theta }}}({\boldsymbol{\zeta }})\right)}\right],\end{array}\end{eqnarray} \tag{ 3 }$

where ${p}_{{\boldsymbol{\theta }}}({\boldsymbol{\zeta }})={\sum }_{{\bf{x}}}{p}_{{\boldsymbol{\theta }}}({\bf{x}},{\boldsymbol{\zeta }})$ is the prior distribution. The term ${D}_{{\rm{KL}}}(p| | q)\equiv {{\mathbb{E}}}_{p}\mathrm{log}[p/q]$ represents the Kullback–Leibler (KL) divergence, which is a measure of 'distance' between the two distributions p and q [67].

Maximizing the first term maximizes the probability of ${p}_{{\boldsymbol{\theta }}}({\bf{x}}| {\boldsymbol{\zeta }})$ when ${\boldsymbol{\zeta }}$ is sampled from ${p}_{{\boldsymbol{\theta }}}({\boldsymbol{\zeta }}| {\bf{x}})$ for a given input from the dataset. This is called reconstruction, because it implies that samples from ${p}_{{\boldsymbol{\theta }}}({\bf{x}}| {\boldsymbol{\zeta }})$ have maximum similarity to the input ${\bf{x}}$ . This is an 'autoencoding' process and hence the first term in equation (3) is the autoencoding term. Conversely, maximizing the second term, corresponds to minimizing the expected KL divergence under p_data. For a given input ${\bf{x}}$ , this amounts to minimizing the distance between the posterior ${p}_{{\boldsymbol{\theta }}}({\boldsymbol{\zeta }}| {\bf{x}})$ and the prior ${p}_{{\boldsymbol{\theta }}}({\boldsymbol{\zeta }})$ . In the limiting case, this leads to ${p}_{{\boldsymbol{\theta }}}({\boldsymbol{\zeta }}| {\bf{x}})={p}_{{\boldsymbol{\theta }}}({\boldsymbol{\zeta }})$ , which is only possible if ${p}_{{\boldsymbol{\theta }}}({\bf{x}},{\boldsymbol{\zeta }})={p}_{{\boldsymbol{\theta }}}({\boldsymbol{\zeta }}){p}_{{\boldsymbol{\theta }}}({\bf{x}})$ . Therefore, the mutual information [68] between ${\bf{x}}$ and ${\boldsymbol{\zeta }}$ is zero. In other words, while the autoencoding term strives to maximize the mutual information, the KL term tries to minimize it.

Eventually, the amount of information condensed in the latent space depends on the intricate balance between the two terms in equation (3), which in turn depends on the type of generative model chosen and also on the training method used. For example, directed models such as those depicted in figure 1(b) are characterized in terms of explicitly defining the prior ${p}_{{\boldsymbol{\theta }}}({\boldsymbol{\zeta }})$ and the decoder ${p}_{{\boldsymbol{\theta }}}({\bf{x}}| {\boldsymbol{\zeta }})$ distributions. If the decoder distribution has high representation power, it can easily decouple ${\bf{x}}$ and ${\boldsymbol{\zeta }}$ to avoid paying the KL penalty. This leads to poor reconstruction quality. On the other hand, if the decoder is less expressive (as is a neural net yielding the parameters of a factorial Bernoulli distribution) a high amount of information is stored in the latent space and the model autoencodes to a good degree.

3. Variational autoencoders

A common problem of generative models with latent variables is the intractability of inference, i.e., calculating the posterior distribution ${p}_{{\boldsymbol{\theta }}}({\boldsymbol{\zeta }}| {\bf{x}})={p}_{{\boldsymbol{\theta }}}({\bf{x}}| {\boldsymbol{\zeta }}){p}_{{\boldsymbol{\theta }}}({\boldsymbol{\zeta }})/{p}_{{\boldsymbol{\theta }}}({\bf{x}})$ . This involves the evaluation of

$\begin{eqnarray}&&{p}_{{\boldsymbol{\theta }}}({\bf{x}})=\int {p}_{{\boldsymbol{\theta }}}({\bf{x}}| {\boldsymbol{\zeta }}){p}_{{\boldsymbol{\theta }}}({\boldsymbol{\zeta }}){\rm{d}}{\boldsymbol{\zeta }}.\end{eqnarray} \tag{ 4 }$

The first crucial element of the VAE setup is variational inference; i.e., introducing a tractable variational approximation ${q}_{{\boldsymbol{\phi }}}({\boldsymbol{\zeta }}| {\bf{x}})$ (figure 1(b)) to the true posterior ${p}_{{\boldsymbol{\theta }}}({\boldsymbol{\zeta }}| {\bf{x}})$ [69], with variational parameters ${\boldsymbol{\phi }}$ . Both decoder ${p}_{{\boldsymbol{\theta }}}({\bf{x}}| {\boldsymbol{\zeta }})$ and encoder ${q}_{{\boldsymbol{\phi }}}({\boldsymbol{\zeta }}| {\bf{x}})$ are commonly implemented by neural networks, known as generative and recognition (inference) networks, respectively.

To define an objective function for optimizing parameters ${\boldsymbol{\theta }}$ and ${\boldsymbol{\phi }}$ , we can replace ${p}_{{\boldsymbol{\theta }}}({\boldsymbol{\zeta }}| {\bf{x}})$ with ${q}_{{\boldsymbol{\phi }}}({\boldsymbol{\zeta }}| {\bf{x}})$ in equation (3):

$\begin{eqnarray}\begin{array}{rcl}{ \mathcal L }({\boldsymbol{\theta }},{\boldsymbol{\phi }}) & \equiv & {{\mathbb{E}}}_{{\bf{x}}\sim {p}_{\mathrm{data}}}[{ \mathcal L }({\boldsymbol{\theta }},{\boldsymbol{\phi }},{\bf{x}})]\equiv \\ & = & {{\mathbb{E}}}_{{\bf{x}}\sim {p}_{\mathrm{data}}}\left[{{\mathbb{E}}}_{{\boldsymbol{\zeta }}\sim {q}_{{\boldsymbol{\phi }}}({\boldsymbol{\zeta }}| {\bf{x}})}[\mathrm{log}{p}_{{\boldsymbol{\theta }}}({\bf{x}}| {\boldsymbol{\zeta }})]\right.\left.-{D}_{{\rm{KL}}}\left({q}_{{\boldsymbol{\phi }}}({\boldsymbol{\zeta }}| {\bf{x}})| | {p}_{{\boldsymbol{\theta }}}({\boldsymbol{\zeta }})\right)\right].\end{array}\end{eqnarray} \tag{ 5 }$

Although ${ \mathcal L }({\boldsymbol{\theta }},{\boldsymbol{\phi }})$ is not equal to the log-likelihood, it provides a lower bound:

$\begin{eqnarray}&&{ \mathcal L }({\boldsymbol{\theta }},{\boldsymbol{\phi }})\leqslant {{\mathbb{E}}}_{{\bf{x}}\sim {p}_{\mathrm{data}}}[\mathrm{log}{p}_{{\boldsymbol{\theta }}}({\bf{x}})],\end{eqnarray} \tag{ 6 }$

as we show below. Because of this important property, ${ \mathcal L }({\boldsymbol{\theta }},{\boldsymbol{\phi }})$ is called the evidence (variational) lower bound (ELBO). To prove equation (6), we note from equation (5) that

$\begin{eqnarray}\begin{array}{rcll}{ \mathcal L }({\boldsymbol{\theta }},{\boldsymbol{\phi }},{\bf{x}}) & = & {{\mathbb{E}}}_{{\boldsymbol{\zeta }}\sim {q}_{{\boldsymbol{\phi }}}({\boldsymbol{\zeta }}| {\bf{x}})}\left[\mathrm{log}{p}_{{\boldsymbol{\theta }}}({\bf{x}}| {\boldsymbol{\zeta }})-\mathrm{log}\displaystyle \frac{{q}_{{\boldsymbol{\phi }}}({\boldsymbol{\zeta }}| {\bf{x}})}{{p}_{{\boldsymbol{\theta }}}({\boldsymbol{\zeta }})}\right]\,= & {{\mathbb{E}}}_{{\boldsymbol{\zeta }}\sim {q}_{{\boldsymbol{\phi }}}({\boldsymbol{\zeta }}| {\bf{x}})}\left[\mathrm{log}\displaystyle \frac{{p}_{{\boldsymbol{\theta }}}({\bf{x}},{\boldsymbol{\zeta }})}{{q}_{{\boldsymbol{\phi }}}({\boldsymbol{\zeta }}| {\bf{x}})}\right],\end{array}\end{eqnarray} \tag{ 7 }$

where we have used ${p}_{{\boldsymbol{\theta }}}({\bf{x}},{\boldsymbol{\zeta }})={p}_{{\boldsymbol{\theta }}}({\boldsymbol{\zeta }}){p}_{{\boldsymbol{\theta }}}({\bf{x}}| {\boldsymbol{\zeta }})$ . Equation (7) is a compact way of expressing the ELBO, which will be used later. One may further use ${p}_{{\boldsymbol{\theta }}}({\bf{x}},{\boldsymbol{\zeta }})={p}_{{\boldsymbol{\theta }}}({\bf{x}}){p}_{{\boldsymbol{\theta }}}({\boldsymbol{\zeta }}| {\bf{x}})$ to obtain yet another way of writing the ELBO:

$\begin{eqnarray}\begin{array}{rcll}{ \mathcal L }({\boldsymbol{\theta }},{\boldsymbol{\phi }},{\bf{x}}) & = & {\rm{log}}{p}_{{\boldsymbol{\theta }}}({\bf{x}})-{{\mathbb{E}}}_{{\boldsymbol{\zeta }}\sim {q}_{{\boldsymbol{\phi }}}({\boldsymbol{\zeta }}| {\bf{x}})}\left[{\rm{log}}\displaystyle \frac{{q}_{{\boldsymbol{\phi }}}({\boldsymbol{\zeta }}| {\bf{x}})}{{p}_{{\boldsymbol{\theta }}}({\boldsymbol{\zeta }}| {\bf{x}})}\right]\,=\, & {\rm{log}}{p}_{{\boldsymbol{\theta }}}({\bf{x}})-{D}_{{\rm{KL}}}({q}_{{\boldsymbol{\phi }}}({\boldsymbol{\zeta }}| {\bf{x}})| | {p}_{{\boldsymbol{\theta }}}({\boldsymbol{\zeta }}| {\bf{x}}))].\end{array}\end{eqnarray} \tag{ 8 }$

Since KL divergence is always non-negative, we obtain

$\begin{eqnarray}&&{ \mathcal L }({\boldsymbol{\theta }},{\boldsymbol{\phi }},{\bf{x}})\leqslant \mathrm{log}{p}_{{\boldsymbol{\theta }}}({\bf{x}}),\end{eqnarray} \tag{ 9 }$

which immediately gives equation (6).

It is evident from equation (8) that the difference between the ELBO and the true log-likelihood, i.e., the tightness of the bound, depends on the distance between the approximate and true posteriors. Maximizing the ELBO, therefore, increases the log-likelihood and decreases the distance between the two posterior distributions at the same time. Success in minimizing the bound between the log-likelihood and ELBO depends on the flexibility and representational power of ${q}_{{\boldsymbol{\phi }}}({\boldsymbol{\zeta }}| {\bf{x}})$ . However, increasing the representational power of ${q}_{{\boldsymbol{\phi }}}({\boldsymbol{\zeta }}| {\bf{x}})$ does not guarantee success in encoding the information in the latent space. In other words, the widespread problem [70–74] of 'ignoring the latent code' in VAEs is not completely an artifact of choosing a family of approximating posterior distributions with limited representational power. As we discussed before, it is rather an intrinsic feature of generative models with latent variables due to the clash of the two terms in the objective function defined in equation (3).

3.1. The reparameterization trick

The objective function in equation (7) contains expectation values of functions of the latent variables ${\boldsymbol{\zeta }}$ under the posterior distribution ${q}_{{\boldsymbol{\phi }}}({\boldsymbol{\zeta }}| {\bf{x}})$ . To train the model, we need to calculate the derivatives of these terms with respect to ${\boldsymbol{\theta }}$ and ${\boldsymbol{\phi }}$ . However, evaluating the derivatives with respect to ${\boldsymbol{\phi }}$ is problematic because the expectations of equation (7) are estimated using samples that are generated according to a probability distribution that depends on ${\boldsymbol{\phi }}$ . A naive solution to the problem of calculating ${\partial }_{{\boldsymbol{\phi }}}$ of the expected value of an arbitrary function ${{\mathbb{E}}}_{{\boldsymbol{\zeta }}\sim {q}_{\phi }}[f({\boldsymbol{\zeta }})]$ , is to use the identity ${\partial }_{{\boldsymbol{\phi }}}{q}_{\phi }={q}_{\phi }{\partial }_{{\boldsymbol{\phi }}}\mathrm{log}{q}_{\phi }$ , to write

$\begin{eqnarray}&&{\partial }_{{\boldsymbol{\phi }}}{{\mathbb{E}}}_{{\boldsymbol{\zeta }}\sim {q}_{\phi }}\left[f({\boldsymbol{\zeta }})\right]={{\mathbb{E}}}_{{\boldsymbol{\zeta }}\sim {q}_{\phi }}\left[f({\boldsymbol{\zeta }}){\partial }_{{\boldsymbol{\phi }}}\mathrm{log}{q}_{\phi }\right].\end{eqnarray} \tag{ 10 }$

Here, for simplicity we assumed that f does not depend on ${\boldsymbol{\phi }}$ . This approach is known as the REINFORCE. However, the expectation of equation (10) has high variance and requires intricate variance-reduction mechanisms to be of practical use [75].

A better approach is to write the random variable ${\boldsymbol{\zeta }}$ as a deterministic function of the distribution parameters ${\boldsymbol{\phi }}$ and of an additional auxiliary random variable ${\boldsymbol{\rho }}$ . The latter is given by a probability distribution $p({\boldsymbol{\rho }})$ that does not depend on ${\boldsymbol{\phi }}$ . This reparameterization, ${\boldsymbol{\zeta }}({\boldsymbol{\phi }},{\boldsymbol{\rho }})$ , can be used to write ${{\mathbb{E}}}_{{\boldsymbol{\zeta }}\sim {q}_{\phi }}[f({\boldsymbol{\zeta }})]={{\mathbb{E}}}_{{\boldsymbol{\rho }}\sim p({\boldsymbol{\rho }})}[f({\boldsymbol{\zeta }}({\boldsymbol{\phi }},{\boldsymbol{\rho }}))]$ . Therefore, we can move the derivative inside the expectation with no difficulty:

$\begin{eqnarray}&&{\partial }_{{\boldsymbol{\phi }}}{{\mathbb{E}}}_{{\boldsymbol{\zeta }}\sim {q}_{\phi }}\left[f({\boldsymbol{\zeta }})\right]={{\mathbb{E}}}_{{\boldsymbol{\rho }}\sim p({\boldsymbol{\rho }})}\left[{\partial }_{{\boldsymbol{\phi }}}f({\boldsymbol{\zeta }}({\boldsymbol{\phi }},{\boldsymbol{\rho }}))\right].\end{eqnarray} \tag{ 11 }$

This is called the reparameterization trick [53] and is mostly responsible for the recent success and proliferation of VAEs. When applied to equation (7), we have:

$\begin{eqnarray}\begin{array}{rcll}{ \mathcal L }({\boldsymbol{\theta }},{\boldsymbol{\phi }},{\bf{x}}) & = & {{\mathbb{E}}}_{{\boldsymbol{\zeta }}\sim {q}_{{\boldsymbol{\phi }}}({\boldsymbol{\zeta }}| {\bf{x}})}\left[\mathrm{log}\displaystyle \frac{{p}_{{\boldsymbol{\theta }}}({\bf{x}},{\boldsymbol{\zeta }})}{{q}_{{\boldsymbol{\phi }}}({\boldsymbol{\zeta }}| {\bf{x}})}\right]\,= & {{\mathbb{E}}}_{{\boldsymbol{\rho }}\sim p({\boldsymbol{\rho }})}\left[\mathrm{log}\displaystyle \frac{{p}_{{\boldsymbol{\theta }}}({\bf{x}},{\boldsymbol{\zeta }}({\boldsymbol{\phi }},{\boldsymbol{\rho }}))}{{q}_{{\boldsymbol{\phi }}}({\boldsymbol{\zeta }}({\boldsymbol{\phi }},{\boldsymbol{\rho }})| {\bf{x}})}\right],\end{array}\end{eqnarray} \tag{ 12 }$

where we have suppressed the inclusion of ${\bf{x}}$ in the arguments of the reparameterized ${\boldsymbol{\zeta }}$ to keep the notation uncluttered.

It is now important to find a function ${\boldsymbol{\zeta }}({\boldsymbol{\phi }},{\boldsymbol{\rho }})$ such that ${\boldsymbol{\rho }}$ becomes ${\boldsymbol{\phi }}$ -independent. Let us define a function ${\bf{F}}$

$\begin{eqnarray}&&{\boldsymbol{\rho }}\equiv {{\bf{F}}}_{{\boldsymbol{\phi }}}({\boldsymbol{\zeta }}).\end{eqnarray} \tag{ 13 }$

The probability distributions $p({\boldsymbol{\rho }})$ and ${q}_{{\boldsymbol{\phi }}}({\boldsymbol{\zeta }}| {\bf{x}})$ should satisfy $p({\boldsymbol{\rho }}){\rm{d}}{\boldsymbol{\rho }}={q}_{{\boldsymbol{\phi }}}({\boldsymbol{\zeta }}| {\bf{x}}){\rm{d}}{\boldsymbol{\zeta }}$ , therefore

$\begin{eqnarray}&&p({\boldsymbol{\rho }})=\displaystyle \frac{{q}_{{\boldsymbol{\phi }}}({\boldsymbol{\zeta }}| {\bf{x}})}{{\rm{d}}{\boldsymbol{\rho }}/{\rm{d}}{\boldsymbol{\zeta }}}=\displaystyle \frac{{q}_{{\boldsymbol{\phi }}}({\boldsymbol{\zeta }}| {\bf{x}})}{{\rm{d}}{{\bf{F}}}_{{\boldsymbol{\phi }}}({\boldsymbol{\zeta }})/{\rm{d}}{\boldsymbol{\zeta }}}.\end{eqnarray} \tag{ 14 }$

To have $p({\boldsymbol{\rho }})$ independent of ${\boldsymbol{\phi }}$ we need

$\begin{eqnarray}&&{{\bf{F}}}_{{\boldsymbol{\phi }}}({\boldsymbol{\zeta }})={\int }_{0}^{{\boldsymbol{\zeta }}}{q}_{{\boldsymbol{\phi }}}({\boldsymbol{\zeta }}^{\prime} | {\bf{x}}){\rm{d}}{\boldsymbol{\zeta }}^{\prime} .\end{eqnarray} \tag{ 15 }$

Now, by choosing ${{\bf{F}}}_{{\boldsymbol{\phi }}}$ to be the cumulative distribution function (CDF) of ${q}_{{\boldsymbol{\phi }}}({\boldsymbol{\zeta }}| {\bf{x}})$ , $p({\boldsymbol{\rho }})$ becomes a uniform distribution ${ \mathcal U }(0,1)$ for ${\boldsymbol{\rho }}\in [0,1]$ . We can thus write

$\begin{eqnarray}&&{\boldsymbol{\zeta }}({\boldsymbol{\phi }},{\boldsymbol{\rho }})={F}_{{\boldsymbol{\phi }}}^{-1}({\boldsymbol{\rho }}).\end{eqnarray} \tag{ 16 }$

To derive equation (16), we have implicitly assumed that the latent variables are continuous and that the posterior factorizes: ${q}_{{\boldsymbol{\phi }}}({\boldsymbol{\zeta }}| {\bf{x}})={\prod }_{l}{q}_{{\boldsymbol{\phi }}}({\zeta }_{l}| {\bf{x}})$ . It is possible to extend the reparameterization trick to include discrete latent variables (see next section) and more complicated approximate posteriors (see appendix C).

4. VAE with discrete latent space

Most of the VAEs studied so far have continuous latent spaces due to the difficulty of propagating derivatives through discrete variables. Nonetheless, discrete stochastic units are indispensable to representing distributions in supervised and unsupervised learning, attention models, language modeling and reinforcement learning [76]. Some noteworthy examples include application of discrete units in learning distinct semantic classes [62] and in semi-supervised generation [77] to learn more meaningful hierarchical VAEs. In [78], when the latent space is composed of discrete variables, the representations learn to disentangle content and style information of images in an unsupervised fashion.

Due to the non-differentiability of discrete stochastic units, several methods that involve variational inference use the REINFORCE method, equation (10), from the reinforcement learning literature [75, 79, 80]. However, these methods yield noisy estimates of gradients that need to be mitigated using several variance reduction techniques such as finding appropriate control variates. Another approach involves using biased derivatives for the Bernoulli variables [81]. There are also two approaches that extend the reparameterization trick to discrete variables. References [76, 82] concurrently came up with a relaxation of categorical discrete units into continuous variables by adding Gumbel noise to the logits inside a softmax function, with a temperature hyper-parameter. The softmax function transforms into a non-differentiable argmax function obtaining unbiased samples in the limit of zero temperature. However, in this limit the training stops since variables become truly discrete. Therefore, an annealing schedule is used for the temperature throughout the training to obtain less noisy, yet biased, estimates of gradients [76].

Here we follow the approach proposed in [55], which yields reparameterizable and unbiased estimates of gradients. As discussed in the previous section, the generative process in a VAE involves sampling a set of continuous variables ${\boldsymbol{\zeta }}\sim {p}_{{\boldsymbol{\theta }}}({\boldsymbol{\zeta }})$ . To implement a DVAE, we assume the prior distribution is now defined on a set of discrete variables ${\bf{z}}\sim {p}_{{\boldsymbol{\theta }}}({\bf{z}})$ , with ${\bf{z}}\in \{0,1\}{}^{L}$ . Once again we use ${\boldsymbol{\theta }}$ to denote collective parameters of the generative side of the model. To propagate the gradients through the discrete variables, we keep the variables ${\boldsymbol{\zeta }}$ as an auxiliary set of continuous variables [55]. The full prior is chosen as follows (figure 1(c)):

$\begin{eqnarray}&&{p}_{{\boldsymbol{\theta }}}({\boldsymbol{\zeta }},{\bf{z}})\equiv r({\boldsymbol{\zeta }}| {\bf{z}}){p}_{{\boldsymbol{\theta }}}({\bf{z}})\equiv \left(\displaystyle \prod _{l=1}^{L}r({\zeta }_{l}| {z}_{l})\right){p}_{{\boldsymbol{\theta }}}({\bf{z}}).\end{eqnarray} \tag{ 17 }$

The newly introduced term $r({\boldsymbol{\zeta }}| {\bf{z}})$ acts as a smoothing probability distribution that enables the implementation of the reparameterization trick. The structure of the DVAE is completed by considering a particular form for the approximating posterior and marginal distributions (figure 1(c)):

$\begin{eqnarray}\begin{array}{rcl}{q}_{{\boldsymbol{\phi }}}({\boldsymbol{\zeta }},{\bf{z}}| {\bf{x}}) & \equiv & r({\boldsymbol{\zeta }}| {\bf{z}}){q}_{{\boldsymbol{\phi }}}({\bf{z}}| {\bf{x}})\\ {p}_{{\boldsymbol{\theta }}}({\bf{x}}| {\boldsymbol{\zeta }},{\bf{z}}) & \equiv & {p}_{{\boldsymbol{\theta }}}({\bf{x}}| {\boldsymbol{\zeta }}),\end{array}\end{eqnarray} \tag{ 18 }$

where for now we assume ${q}_{{\boldsymbol{\phi }}}({\bf{z}}| {\bf{x}})={\prod }_{l}{q}_{{\boldsymbol{\phi }}}({z}_{l}| {\bf{x}})$ is a product of Bernoulli probabilities for the discrete variable z_l (see again appendix C for the case where hierarchies are present in the posterior). With the above choice, the ELBO bound can be written as

$\begin{eqnarray}\begin{array}{rcl}{ \mathcal L }({\boldsymbol{\theta }},{\boldsymbol{\phi }},{\bf{x}}) & = & {{\mathbb{E}}}_{{q}_{{\boldsymbol{\phi }}}({\boldsymbol{\zeta }}| {\bf{x}})}[\mathrm{log}{p}_{{\boldsymbol{\theta }}}({\bf{x}}| {\boldsymbol{\zeta }})]-{D}_{{\rm{KL}}}({q}_{{\boldsymbol{\phi }}}({\bf{z}}| {\bf{x}})| | {p}_{{\boldsymbol{\theta }}}({\bf{z}})),\end{array}\end{eqnarray} \tag{ 19 }$

where ${q}_{{\boldsymbol{\phi }}}({\boldsymbol{\zeta }}| {\bf{x}})$ is the approximate posterior marginalized over the discrete variables. In the equation above we have used the fact that the KL term does not explicitly depend on ${\boldsymbol{\zeta }}$ while the autoencoding term does not explicitly depend on ${\bf{z}}$ .

4.1. The reparameterization trick for DVAE

We can apply the inverse CDF reparameterization trick, equation (16), to the autoencoding term in equation (19) if we choose the function $r({\boldsymbol{\zeta }}| {\bf{z}})$ such that the CDF of the approximating posterior marginalized over the discrete variables

$\begin{eqnarray}&&{\bf{F}}({\boldsymbol{\zeta }})\equiv {\int }_{0}^{{\boldsymbol{\zeta }}}{q}_{{\boldsymbol{\phi }}}({\boldsymbol{\zeta }}^{\prime} | {\bf{x}}){\rm{d}}{\boldsymbol{\zeta }}^{\prime} \end{eqnarray} \tag{ 20 }$

can be inverted:

$\begin{eqnarray}&&{{\mathbb{E}}}_{{q}_{{\boldsymbol{\phi }}}({\boldsymbol{\zeta }}| {\bf{x}})}[\mathrm{log}{p}_{{\boldsymbol{\theta }}}({\bf{x}}| {\boldsymbol{\zeta }})]={{\mathbb{E}}}_{{\boldsymbol{\rho }}\sim { \mathcal U }}[\mathrm{log}{p}_{{\boldsymbol{\theta }}}({\bf{x}}| {{\bf{F}}}^{-1}({\boldsymbol{\rho }})].\end{eqnarray} \tag{ 21 }$

An appropriate choice for $r({\zeta }_{l}| {z}_{l})$ is, for example, the spike-and-exponential transformation:

$\begin{eqnarray}\begin{array}{rcl}r({\zeta }_{l}| {z}_{l}=0) & = & \delta ({\zeta }_{l})\\ r({\zeta }_{l}| {z}_{l}=1) & = & \left\{\begin{array}{ll}\beta \displaystyle \frac{{{\rm{e}}}^{\beta {\zeta }_{l}}}{{{\rm{e}}}^{\beta }-1}, & \mathrm{if}\quad 0\lt {\zeta }_{l}\leqslant 1\\ 0, & \mathrm{otherwise}.\end{array}\right.\end{array}\end{eqnarray} \tag{ 22 }$

For this distribution we can write:

$\begin{eqnarray}&&{{\rm{F}}}_{l}({\zeta }_{l})={\int }_{0}^{{\zeta }_{l}}{q}_{{\boldsymbol{\phi }}}(\zeta {{\prime} }_{l}| {\bf{x}}){\rm{d}}\zeta {{\prime} }_{l}={\int }_{0}^{{\zeta }_{l}}\displaystyle \sum _{{z}_{l}=0,1}{q}_{{\boldsymbol{\phi }}}({z}_{l}| {\bf{x}})r(\zeta {{\prime} }_{l}| {z}_{l}){\rm{d}}\zeta {{\prime} }_{l}.\end{eqnarray} \tag{ 23 }$

Using equation (22) with Bernoulli distribution ${q}_{{\boldsymbol{\phi }}}({z}_{l}=1| {\bf{x}})={q}_{l}$ and ${q}_{{\boldsymbol{\phi }}}({z}_{l}=0| {\bf{x}})=1-{q}_{l}$ , we find

$\begin{eqnarray}&&{\rho }_{l}={q}_{l}\displaystyle \frac{{{\rm{e}}}^{\beta {\zeta }_{l}}-1}{{{\rm{e}}}^{\beta }-1}+(1-{q}_{l}),\end{eqnarray} \tag{ 24 }$

which can be easily inverted to obtain ${\zeta }_{l}$

$\begin{eqnarray}&&{\zeta }_{l}({\rho }_{l},{q}_{l})=\displaystyle \frac{1}{\beta }\mathrm{log}\left[\left(\displaystyle \frac{\max ({\rho }_{l}+{q}_{l}-1,0)}{{q}_{l}}\right)({{\rm{e}}}^{\beta }-1)+1\right].\end{eqnarray} \tag{ 25 }$

The virtue of the spike-and-exponential smoothing distribution is that z_l can be deterministically obtained from ${\zeta }_{l}$ and thus ${\rho }_{l}$ :

$\begin{eqnarray}&&{z}_{l}({\rho }_{l},{q}_{l})=\mathrm{sign}({\zeta }_{l}({\rho }_{l},{q}_{l}))={\rm{\Theta }}({\rho }_{l}+{q}_{l}-1),\end{eqnarray} \tag{ 26 }$

which follows from equations (22) and (25). This property is crucial to apply the reparameterization trick to the KL term, as shown below, and evaluating its derivatives as shown in appendix D.

For later convenience, we note that the KL term can be written as the difference between an entropy term, $H({q}_{{\boldsymbol{\phi }}}({\bf{z}}| {\bf{x}}))$ , and a cross-entropy term, $H({q}_{{\boldsymbol{\phi }}}({\bf{z}}| {\bf{x}}),{p}_{{\boldsymbol{\theta }}}({\bf{z}}))$ :

$\begin{eqnarray*}&&{D}_{{\rm{KL}}}({q}_{{\boldsymbol{\phi }}}({\bf{z}}| {\bf{x}})| | {p}_{{\boldsymbol{\theta }}}({\bf{z}}))=\mathop{\underbrace{{{\mathbb{E}}}_{{q}_{{\boldsymbol{\phi }}}}[{\rm{log}}{q}_{{\boldsymbol{\phi }}}]}}\limits_{-H({q}_{{\boldsymbol{\phi }}})}-\mathop{\underbrace{{{\mathbb{E}}}_{{q}_{{\boldsymbol{\phi }}}}[{\rm{log}}{p}_{{\boldsymbol{\theta }}}]}}\limits_{-H({q}_{{\boldsymbol{\phi }}},{p}_{{\boldsymbol{\theta }}})}.\end{eqnarray*}$

Herein, for simplicity, we use ${q}_{\phi }$ and p_θ in place of ${q}_{{\boldsymbol{\phi }}}({\bf{z}}| {\bf{x}})$ and ${p}_{{\boldsymbol{\theta }}}({\bf{z}})$ , respectively, in unambiguous cases. Using equation (26), the reparameterization trick can be applied to the entropy term:

$\begin{eqnarray}&&H({q}_{{\boldsymbol{\phi }}})\equiv -{{\mathbb{E}}}_{{\bf{z}}\sim {q}_{{\boldsymbol{\phi }}}}[\mathrm{log}{q}_{{\boldsymbol{\phi }}}]=-{{\mathbb{E}}}_{{\boldsymbol{\rho }}\sim { \mathcal U }}[\mathrm{log}{q}_{{\boldsymbol{\phi }}}({\bf{z}}({\boldsymbol{\rho }},{\boldsymbol{\phi }})| {\bf{x}})],\end{eqnarray} \tag{ 27 }$

where we have explicitly shown the dependence of ${\bf{z}}$ on ${\boldsymbol{\rho }}$ and ${\boldsymbol{\phi }}$ . Note that in the simple case of a factorial Bernoulli distribution, we do not need to use the reparameterization trick and can use the analytic form of the entropy; i.e., $H({q}_{{\boldsymbol{\phi }}})=-{\sum }_{l=1}^{L}\left({q}_{l}\mathrm{log}{q}_{l}+(1-{q}_{l})\mathrm{log}(1-{q}_{l})\right)$ (see appendix B for more details). Similarly, applying the reparameterization trick to the cross-entropy leads to:

$\begin{eqnarray}&&-H({q}_{{\boldsymbol{\phi }}},{p}_{{\boldsymbol{\theta }}})\equiv {{\mathbb{E}}}_{{\bf{z}}\sim {q}_{{\boldsymbol{\phi }}}}[\mathrm{log}{p}_{{\boldsymbol{\theta }}}]={{\mathbb{E}}}_{{\boldsymbol{\rho }}\sim { \mathcal U }}[\mathrm{log}{p}_{{\boldsymbol{\theta }}}({\bf{z}}({\boldsymbol{\rho }},{\boldsymbol{\phi }}))].\end{eqnarray} \tag{ 28 }$

It is a common practice to use hierarchical distributions to achieve more powerful approximating posteriors. Briefly, the latent variables are compartmentalized into several groups, and the probability density function of each group depends on the values of the latent variables in the preceding groups; i.e., ${q}_{{\boldsymbol{\phi }}}({z}_{l}| {\zeta }_{m\lt l},{\bf{x}})$ . This creates a more powerful approximating posterior able to represent more complex correlations between latent variables, as compared to a simple factorial distribution. See appendix C for more details.

4.2. DVAE with Boltzmann machines

Boltzmann machines are probabilistic models able to represent complex multi-modal probability distributions [83], and are thus attractive candidates for the latent space of a VAE. This approach is also appealing with regards to the machine-learning application of quantum computers. The probability distribution realized by an RBM is

$\begin{eqnarray}\begin{array}{rcl}{p}_{{\boldsymbol{\theta }}}({\bf{z}}) & \equiv & {{\rm{e}}}^{-{E}_{{\boldsymbol{\theta }}}({\bf{z}})}/{Z}_{{\boldsymbol{\theta }}},\quad {Z}_{{\boldsymbol{\theta }}}\equiv \displaystyle \sum _{{\bf{z}}}{{\rm{e}}}^{-{E}_{{\boldsymbol{\theta }}}({\bf{z}})},\\ {E}_{{\boldsymbol{\theta }}}({\bf{z}}) & = & \displaystyle \sum _{l}{z}_{l}{h}_{l}+\displaystyle \sum _{l\lt m}{W}_{{lm}}{z}_{l}{z}_{m},\quad {\bf{h}},{\bf{W}}\in \{{\boldsymbol{\theta }}\}.\end{array}\end{eqnarray} \tag{ 29 }$

The negative cross-entropy term $-H({q}_{{\boldsymbol{\phi }}},{p}_{{\boldsymbol{\theta }}})={{\mathbb{E}}}_{{\bf{z}}\sim {q}_{{\boldsymbol{\phi }}}}[\mathrm{log}{p}_{{\boldsymbol{\theta }}}]$ is the log-likelihood of ${\bf{z}}$ sampled from the approximating posterior ${\bf{z}}\sim {q}_{{\boldsymbol{\phi }}}({\bf{z}}| {\bf{x}})$ under the model ${p}_{{\boldsymbol{\theta }}}$ . After reparameterization, we have

$\begin{eqnarray}\begin{array}{rcll}H({q}_{{\boldsymbol{\phi }}},{p}_{{\boldsymbol{\theta }}}) & = & -{{\mathbb{E}}}_{{\boldsymbol{\rho }}\sim { \mathcal U }}\left[\mathrm{log}{p}_{{\boldsymbol{\theta }}}\left({\bf{z}}({\boldsymbol{\rho }},{\boldsymbol{\phi }})\right)\right]\,= & {{\mathbb{E}}}_{{\boldsymbol{\rho }}\sim { \mathcal U }}\left[{E}_{{\boldsymbol{\theta }}}\left({\bf{z}}({\boldsymbol{\rho }},{\boldsymbol{\phi }})\right)\right]+\mathrm{log}{Z}_{{\boldsymbol{\theta }}}.\end{array}\end{eqnarray} \tag{ 30 }$

Gradients can thus be computed as usual as the difference between a positive and negative phase, in which the latter is computed via Boltzmann sampling from the BM:

$\begin{eqnarray}&&\partial H({q}_{{\boldsymbol{\phi }}},{p}_{{\boldsymbol{\theta }}})={{\mathbb{E}}}_{{\boldsymbol{\rho }}\sim { \mathcal U }}[\partial {E}_{{\boldsymbol{\theta }}}({\bf{z}}({\boldsymbol{\rho }},{\boldsymbol{\phi }}))]-{{\mathbb{E}}}_{{\bf{z}}\sim {p}_{{\boldsymbol{\theta }}}}\left[\partial {E}_{{\boldsymbol{\theta }}}({\bf{z}})\right].\end{eqnarray} \tag{ 31 }$

Notice that the positive phase (the first term above) involves the expectation over the approximating posterior, but it is explicitly written in terms of the discrete variables ${\bf{z}}({\boldsymbol{\rho }},{\boldsymbol{\phi }})$ . We thus need to calculate the derivatives through these variables. We discuss the computation of the positive phase in the most general case in appendix D.

4.3. Experimental results with DVAE

In this section, we show that the DVAE model introduced in section 4 achieves state-of-the-art performance, for variational inference models with only latent variables, on the MNIST dataset [48]. We perform experiments with restricted Boltzmann machines, in which the hidden and visible units are placed at the two sides of a bipartite graph. Notice that in a DVAE setup, all the units of the (classical) RBM are latent variables (there is technically no distinction between visible and hidden units as for standalone RBMs). We still use an RBM to exploit its bipartite structure enabling efficient Gibbs block-sampling. This allows us to train DVAEs with RBMs with up to 256 units per layer.

Figure 2 shows generated and reconstructed MNIST digits for a DVAE with RBMs with 32 and 256 units per layer. In table 1, we report the best results for the ELBO and log-likelihood (LL) we obtained with RBMs of 32, 64, 128, and 256 units per layer. For 256 units, we obtained an LL of −83.5 ± 0.2, with the reported error being a conservative estimate of our statistical uncertainty. In all cases, the negative phase of the RBMs was estimated using persistent contrastive divergence, with 1000 chains and 200 block-Gibbs updates per gradient evaluation. We have chosen an approximating posterior with 8 levels of hierarchies (the number of units that each level of hierarchy represents is the total number of latent units divided by 8); each Bernoulli probability ${q}_{{\boldsymbol{\phi }}}({z}_{l}| {\zeta }_{m\lt l},{\bf{x}})$ is a sigmoidal output of a feed-forward neural network with two hidden rectified linear unit (ReLU) layers containing 2000 deterministic units.

The model is prone to overfitting when representing the decoder distribution ${p}_{{\boldsymbol{\theta }}}({\bf{x}}| {\boldsymbol{\zeta }})$ with deep networks. We considered ${p}_{{\boldsymbol{\theta }}}({\bf{x}}| {\boldsymbol{\zeta }})$ to be sigmoidal outputs of a ReLu network with one layer and the number of deterministic units that vary between 250 and 2000. Typically, a larger RBM required a smaller number of hidden units in the decoder network to avoid overfitting. Our implementation included annealing schedules for both the learning rate (exponential decay) and the β parameter (linear increase) in equation (22). Batch normalization [84] was used to expedite the training process. The value of β was annealed throughout the training from 1.0 to 10 during 2000 epochs with a batch size of 200. We used the ADAM stochastic optimization method with a learning rate of ${10}^{-3}$ and the default extra parameters [85]. To calculate the LL in table 1, we used importance weighting to compute a multi-sample ELBO, as delineated in [54], with 30 000 samples in the latent space for each input image in the test set. It can be shown that the value of the multi-sample ELBO asymptotically reaches the true LL when the number of samples approaches infinity [54]. The $\mathrm{log}Z$ was computed using population annealing [86, 87] (see also appendix E for the quantum partition function). In all our experiments we have verified that the statistical error on the evaluation of $\mathrm{log}Z$ is negligible.

In table 1, we also report the results of some other algorithms that use discrete variables in variational inference. NVIL [75] and its importance weighted analog, VIMCO [88], use the REINFORCE trick, equation (10) along with carefully designed control variates to reduce the variance of the estimation. CONCRETE [82] and Gumbel-Softmax [76] are two concurrently developed methods that are based on applying the reparameterization trick to discrete latent variables. RWS [89] is a multi-sampled and improved version of the wake-sleep algorithm [8], which can be considered as a variational approximation (since an encoder or 'inference network' is present) with different loss functions in the wake and sleep phases of training. REBAR [90] is the application of the CONCRETE method to create control variates for the REINFORCE approach. All algorithms reported in table 1, excluding DVAE, implement a latent space with independent discrete units distributed according to a set of independent Bernoulli distributions. The result reported for CONCRETE, for example, includes 200 independent latent units. The presence of a well-trained RBM in the latent space of DVAE is critical to achieve the results quoted in table 1. In particular, our implementation of DVAE is able to match the result obtained with the CONCRETE method by using only 64 + 64 latent units rather than 200. A direct demonstration of the necessity to have a well-trained RBM to achieve state-of-the-art performance with DVAE is also given in table 2 of [91].

Table 1. Comparison of variational generative models with stochastic discrete variables on the validation set of the MNIST dataset. The best results are denoted by boldface font. GS stands for Gumbel-Softmax. The confidence level for the DVAE results is smaller than ±0.2 in all cases.

MNIST (static binarization)
		ELBO	LL
DVAE	RBM_32×32	−99.3 ± 0.2	−90.8 ± 0.2
	RBM_64×64	−92.4	−85.5
	RBM_128×128	−90.4	−84.7
	RBM_256×256	−89.2	−83. 5
VIMCO [88]			−91.9
NVIL [75]			−93.5
CONCRETE [82]			−85.7
GS [76]		−101.5
RWS [89]			− 88.9
REBAR [90]		−98.8

5. Quantum variational autoencoders

We now introduce the QVAE by implementing the prior distribution in the latent space of a VAE as a QBM. Similar to a classical BM, a QBM is an energy model defined as follows [51]:

$\begin{eqnarray}\begin{array}{rcl}{p}_{{\boldsymbol{\theta }}}({\bf{z}}) & \equiv & {\rm{Tr}}[{{\rm{\Lambda }}}_{{\bf{z}}}{{\rm{e}}}^{-{{ \mathcal H }}_{{\boldsymbol{\theta }}}}]/{Z}_{{\boldsymbol{\theta }}},\quad {Z}_{{\boldsymbol{\theta }}}\equiv {\rm{Tr}}[{{\rm{e}}}^{-{{ \mathcal H }}_{{\boldsymbol{\theta }}}}],\\ {{ \mathcal H }}_{{\boldsymbol{\theta }}} & = & \displaystyle \sum _{l}{\sigma }_{l}^{x}{{\rm{\Gamma }}}_{l}+\displaystyle \sum _{l}{\sigma }_{l}^{z}{h}_{l}+\displaystyle \sum _{l\lt m}{W}_{{lm}}{\sigma }_{l}^{z}{\sigma }_{m}^{z},\quad {\boldsymbol{\Gamma }},{\bf{h}},{\bf{W}}\in \{{\boldsymbol{\theta }}\},\end{array}\end{eqnarray} \tag{ 32 }$

where ${{\rm{\Lambda }}}_{{\bf{z}}}\equiv | {\bf{z}}\rangle \langle {\bf{z}}|$ is the projector on the classical state ${\bf{z}}$ and ${\sigma }_{l}^{x,z}$ are Pauli operators. States ${\bf{z}}$ are distributed according to ${p}_{{\boldsymbol{\theta }}}({\bf{z}});$ e.g., a quantum Boltzmann distribution for the quantum system given by ${{ \mathcal H }}_{{\boldsymbol{\theta }}}$ . Similar to the classical case, the ELBO includes the following cross-entropy term:

$\begin{eqnarray}&&H({q}_{{\boldsymbol{\phi }}},{p}_{{\boldsymbol{\theta }}})=-{{\mathbb{E}}}_{{\bf{z}}\sim {q}_{{\boldsymbol{\phi }}}}[{\rm{log}}(\mathrm{Tr}[{{\rm{\Lambda }}}_{{\bf{z}}}{{\rm{e}}}^{-{{ \mathcal H }}_{{\boldsymbol{\theta }}}}])]+{\rm{log}}{Z}_{{\boldsymbol{\theta }}}.\end{eqnarray} \tag{ 33 }$

Unfortunately, the gradients of the first term in the equation above are intractable. A QBM can still be trained using a lower-bound to the cross-entropy that can be obtained via the Golden–Thompson inequality [51]:

$\begin{eqnarray}&&{\rm{Tr}}[{{\rm{e}}}^{A}{{\rm{e}}}^{B}]\geqslant {\rm{Tr}}[{{\rm{e}}}^{A+B}]\,,\end{eqnarray} \tag{ 34 }$

which holds for any two Hermitian matrices. The equality is satisfied if and only if the two matrices commute. Using this inequality we can write for the cross-entropy:

$\begin{eqnarray}\begin{array}{rcl}H({q}_{{\boldsymbol{\phi }}},{p}_{{\boldsymbol{\theta }}}) & \geqslant & \mathop{\overbrace{-{{\mathbb{E}}}_{{\bf{z}}\sim {q}_{{\boldsymbol{\phi }}}}\left[{\rm{log}}\left(\mathrm{Tr}\left[{{\rm{e}}}^{-{{ \mathcal H }}_{{\boldsymbol{\theta }}}+\mathrm{ln}{{\rm{\Lambda }}}_{{\bf{z}}}}\right]\right)\right]+{\rm{log}}{Z}_{{\boldsymbol{\theta }}}}}\limits^{\tilde{H}({q}_{{\boldsymbol{\phi }}},{p}_{{\boldsymbol{\theta }}})}\\ & = & {{\mathbb{E}}}_{{\boldsymbol{\rho }}\sim { \mathcal U }}\left[{{ \mathcal H }}_{{\boldsymbol{\theta }}}\left({\bf{z}}\left({\boldsymbol{\rho }},{\boldsymbol{\phi }}\right)\right)\right]+{\rm{log}}{Z}_{{\boldsymbol{\theta }}},\end{array}\end{eqnarray} \tag{ 35 }$

where in the second line we have used the reparameterization trick and the fact that the contribution to the trace of all states different than ${\bf{z}}$ is infinitely suppressed. In the equation above we have defined ${{ \mathcal H }}_{{\boldsymbol{\theta }}}({\bf{z}})\equiv \langle {\bf{z}}| {{ \mathcal H }}_{{\boldsymbol{\theta }}}| {\bf{z}}\rangle$ . Using the lower-bound $\tilde{H}({q}_{{\boldsymbol{\phi }}},{p}_{{\boldsymbol{\theta }}})$ , we obtain a tractable quantum bound (Q-ELBO) to the true ELBO and the QVAE can be trained by estimating the gradients via sampling from the QBM [51]:

$\begin{eqnarray}&&\partial \tilde{H}({q}_{{\boldsymbol{\phi }}},{p}_{{\boldsymbol{\theta }}})={{\mathbb{E}}}_{{\boldsymbol{\rho }}\sim { \mathcal U }}\left[\partial {{ \mathcal H }}_{{\boldsymbol{\theta }}}\left({\bf{z}}\left({\boldsymbol{\rho }},{\boldsymbol{\phi }}\right)\right)\right]-{{\mathbb{E}}}_{{\bf{z}}\sim {p}_{\theta }}\left[\partial {{ \mathcal H }}_{{\boldsymbol{\theta }}}({\bf{z}})\right].\end{eqnarray} \tag{ 36 }$

The use of the Q-ELBO and its gradients precludes the training of the transverse fields ${\boldsymbol{\Gamma }}$ [51], which is treated as a constant (hyper-parameter) throughout the training.

5.1. Experimental results with QVAE

In this section, we show that QVAEs can be effectively trained via the looser quantum bound (Q-ELBO). While it is computationally unfeasible to train a QVAE that has a large QBM in the latent space with QMC, sampling from large QBMs is possible with the use of quantum annealing devices. Given the results of the previous and present sections, we thus expect the possibility of using quantum annealing devices to sample from large QBMs in the latent space of QVAEs to achieve competitive results on datasets such as MNIST.

To perform experiments with QVAEs, we have considered exactly the same models used in the case of DVAEs, exchanging the RBMs with QBMs. As explicated by equations (35) and (36), we train the VAE by maximizing the lower bound Q-ELBO to the true ELBO. To compute the negative phase in equation (36), we have used population annealing (PA) for CT-QMC (see appendix E for more details). We have considered a population of 1000 samples and 5 sweeps per gradient evaluation. Despite being one of the most effective sampling methods we considered, population annealing CT-QMC is still numerically expensive and prevented us from fully training QVAEs with large QBMs. We thus considered (restricted) QBMs with 16, 32, and 64 units per layer and estimated the ELBO obtained with different values of the transverse field Γ. Table 2 shows the ELBO and Q-ELBO for different sizes of the QBMs and values of Γ. To estimate Q-ELBO, we use the classical energy function in the positive phase (with no transverse field) and a quantum partition function (see appendix E). The ELBO is calculated using the true quantum probability of a state in the positive phase. The difference between ELBO and Q-ELBO is due to equation (34). We emphasize that the results of table 2 correspond to ELBOs obtained at different stages during training (800, 250, and 50 epochs for the cases with 16, 32, and 64 units per layer, respectively). These numbers simply correspond to the largest number of epochs we were able to train each model during the preparation of this work. Also, note that we have chosen not report LL in this table, since this requires using importance sampling which is computationally expensive; this requires calculating the quantum probabilities for a large number of latent samples per input image.

Table 2. Evaluation on validation set: RBM_16×16 at 800 epochs, RBM_32×32 at 250 epochs, RBM_64×64 at 50 epochs. The confidence level for the numerical results is smaller than ±0.2 in all cases.

MNIST (static binarization)
			ELBO	Q-ELBO
QVAE:	Γ = 0	RBM_16×16	−109.3 ± 0.2	−109.3 ± 0.2
	Γ = 1	QBM_16×16	−110.5	−120.6
	Γ = 2		−115.3	−135.8
QVAE:	Γ = 0	RBM_32×32	−101.8	−101.8
	Γ = 1	QBM_32×32	−103.6	−117.9
	Γ = 2		−112.1	−139.7
QVAE:	Γ = 0	RBM_64×64	−105.7	−105.7
	Γ = 1	QBM_64×64	−108.7	−133.9
	Γ = 2		−120.0	−165.2

Our results show, as expected, that Q-ELBO becomes looser as the transverse field is increased. Still, we observe that the corresponding ELBO we obtained during training is much tighter and closer to the classical case. This explain why we are able to effectively train QVAE with values of the transverse field as large as 2. We consider this value of the transverse field to be relatively large, since the typical scale of the trained couplings in the classical part of the Hamiltonian is of order 1. Figure 3 shows MNIST images generated with QVAE and CT-QMC sampling. We see that the quality of the generated samples is satisfactory for the considered values of the transverse field (up to Γ = 2).

**Figure 3.** Comparison of generated MNIST digits with different values of the transverse field Γ at the same stage of training (90 epochs).
Download figure:
Standard image High-resolution image

It is important to stress that the deterioration of performance we observe in table 2 is mostly due to that fact that the Q-ELBO used for training becomes looser as the transverse field increases. This is not necessarily an intrinsic limitation of QVAE. Indeed, in [51] it was shown that small quantum Boltzmann machines perform better than their classical counterparts if training is performed via direct maximization of the LL.

6. Conclusions

We proposed a variational inference model, QVAE, that uses QBMs to implement the generative process in its latent space. We showed that this infrastructure can be powerful and at higher dimensions of latent space can give state-of-the-art results (for variational inference-based models with stochastic discrete units) on MNIST dataset. We used CT-QMC to sample from the QBMs in the latent space and were limited to smaller dimensions (up to a 64 × 64 dimensional QBM) due to computational cost. Introduction of QBMs in the latent space of our model introduces an additional quantum bound on the ELBO objective function of VAEs. However, we demonstrated empirically that QVAEs have generally similar performance to their classical limit where the transverse field is absent. On important open question for future work is whether it is possible to improve the performance of QVAE by using bounds to the LL that tighter than the Q-ELBO used in this work.

During training, both RBM and QBM develop well-defined modes that make sampling via Markov Chain Monte Carlo methods very inefficient. Quantum annealers could provide a computational advantage by exploiting quantum tunneling to accelerate mixing between different modes. A computational advantage of this type was observed, with respect to quantum Monte Carlo methods, in [92]. The successful use of quantum annealers will likely require tailored implementations that mitigate physical limitations of actual devices such as control errors, limited coupling range and connectivity. This is a promising line of research which we are exploring for upcoming works.

This work is an attempt to add quantum algorithms to powerful existing classical frameworks to obtain new competitive generative models. It can be considered as a bedrock on which the next-generation of quantum annealers can be efficiently used in order to solve realistic problems in machine learning.

Acknowledgments

The authors would like to thank Hossein Sadeghi, Arash Vahdat, and Jason T Rolfe for useful discussions during the preparation of this work.

Appendix A.: VAE with Guassian variables

In its simplest version, a VAE's prior and approximate posterior are a product of normal distributions (both with diagonal covariance matrix) chosen as follows:

$\begin{eqnarray*}\begin{array}{rcl}{p}_{{\boldsymbol{\theta }}}({\boldsymbol{\zeta }}) & = & { \mathcal N }({\boldsymbol{\zeta }};{\bf{0}},{\bf{1}})\equiv \displaystyle \prod _{l=1}^{L}{ \mathcal N }({\zeta }_{l};0,1)\,\\ {q}_{{\boldsymbol{\phi }}}({\boldsymbol{\zeta }}| {\bf{x}}) & = & { \mathcal N }({\boldsymbol{\zeta }};{\boldsymbol{\mu }},{\boldsymbol{\sigma }})\equiv \displaystyle \prod _{l=1}^{L}{ \mathcal N }({\zeta }_{l};{\mu }_{l},{\sigma }_{l}),\end{array}\end{eqnarray*}$

where the prior is independent of the parameters ${\boldsymbol{\theta }}$ and the means and variances ${\boldsymbol{\mu }}$ and ${{\boldsymbol{\sigma }}}^{2}$ are functions of the inputs ${\bf{x}}$ and of the parameters ${\boldsymbol{\phi }};$ the dependence on ${\boldsymbol{\phi }}$ is sometimes left implicit when the variable indices are shown. The mean and variance are usually the outputs of a deep neural network. The diagonal Guassians allow for an easy implementation of the reparameterization trick:

$\begin{eqnarray*}&&{\rho }_{l}\sim { \mathcal N }({\zeta }_{l};0,1),\,{\zeta }_{l}={\mu }_{l}+{\sigma }_{l}\rho \,\,\Rightarrow \,\,{\zeta }_{l}\sim { \mathcal N }({\zeta }_{l};{\mu }_{l},{\sigma }_{l}).\end{eqnarray*}$

The KL divergence is the sum of two simple Guassian integrals ${D}_{{\rm{KL}}}({q}_{{\boldsymbol{\phi }}}({\boldsymbol{\zeta }}| {\bf{x}})| | {p}_{{\boldsymbol{\theta }}}({\boldsymbol{\zeta }}))$ = $-H({q}_{{\boldsymbol{\phi }}})+H({q}_{{\boldsymbol{\phi }}},{p}_{{\boldsymbol{\theta }}})$ :

$\begin{eqnarray}\begin{array}{rcl}H({q}_{{\boldsymbol{\phi }}}) & = & -\displaystyle \int {q}_{{\boldsymbol{\phi }}}({\boldsymbol{\zeta }}){\rm{log}}{q}_{{\boldsymbol{\phi }}}({\boldsymbol{\zeta }}){\rm{d}}{\boldsymbol{\zeta }}\,=\,\displaystyle \frac{1}{2}\displaystyle \sum _{l=1}^{L}\left({\rm{log}}(2\pi )+1+{\rm{log}}({\sigma }_{l}^{2})\right)\\ H({q}_{{\boldsymbol{\phi }}},{p}_{{\boldsymbol{\theta }}}) & = & -\displaystyle \int {q}_{{\boldsymbol{\phi }}}({\boldsymbol{\zeta }}){\rm{log}}{p}_{{\boldsymbol{\theta }}}({\boldsymbol{\zeta }}){\rm{d}}{\boldsymbol{\zeta }}\,=\,-\displaystyle \frac{1}{2}\displaystyle \sum _{l=1}^{L}\left({\rm{log}}(2\pi )+{\mu }_{l}^{2}+{\sigma }_{l}^{2}\right).\end{array}\end{eqnarray} \tag{ A1 }$

The only term that requires the reparamaterization trick to obtain a low-variance estimate of the gradient is then the autoencoding term:

$\begin{eqnarray}&&{{\mathbb{E}}}_{{q}_{{\boldsymbol{\phi }}}({\boldsymbol{\zeta }}| {\bf{x}})}[\mathrm{log}{p}_{{\boldsymbol{\theta }}}({\bf{x}}| {\boldsymbol{\zeta }})]\equiv {{\mathbb{E}}}_{{\boldsymbol{\rho }}}[\mathrm{log}{p}_{{\boldsymbol{\theta }}}({\bf{x}}| {\boldsymbol{\mu }}+{\boldsymbol{\sigma }}{\boldsymbol{\rho }})].\end{eqnarray} \tag{ A2 }$

Appendix B.: DVAE with Bernoulli variables

The simplest DVAE can be implemented by assuming that the prior and the approximating posterior are both products of Bernoulli distributions

$\begin{eqnarray*}\begin{array}{rcl}{p}_{{\boldsymbol{\theta }}}({z}_{l}=1) & = & {p}_{l}\\ {q}_{{\boldsymbol{\phi }}}({z}_{l}=1| {\bf{x}}) & = & {q}_{l},\end{array}\end{eqnarray*}$

where the Bernoulli probabilities q_l are functions of the inputs ${\bf{x}}$ and of the parameters ${\boldsymbol{\phi }}$ and are the outputs of a deep feed-forward network. We have already presented the following expression for the entropy in section 4.1:

$\begin{eqnarray}\begin{array}{rcll}H({q}_{{\boldsymbol{\phi }}}) & \equiv & -{{\mathbb{E}}}_{z\sim {q}_{{\boldsymbol{\phi }}}}[\mathrm{log}{q}_{{\boldsymbol{\phi }}}]\,= & -\displaystyle \sum _{l=1}^{L}\left({q}_{l}\mathrm{log}{q}_{l}+(1-{q}_{l})\mathrm{log}(1-{q}_{l})\right).\end{array}\end{eqnarray} \tag{ B1 }$

The cross-entropy can be derived similarly:

$\begin{eqnarray}\begin{array}{rcll}H({q}_{{\boldsymbol{\phi }}},{p}_{{\boldsymbol{\theta }}}) & \equiv & -{{\mathbb{E}}}_{z\sim {q}_{{\boldsymbol{\phi }}}}[\mathrm{log}{p}_{{\boldsymbol{\theta }}}]\,= & -\displaystyle \sum _{l=1}^{L}\left({q}_{l}\mathrm{log}{p}_{l}+(1-{q}_{l})\mathrm{log}(1-{p}_{l})\right).\end{array}\end{eqnarray} \tag{ B2 }$

Similar to the fully Guassian case of the previous section, the only term that requires the reparameterization trick to obtain a low-variance estimate of the gradient is the autoencoding term as in equation (21) of the main text.

Appendix C.: Hierarchical approximating posterior

Explain-away effects [6] introduce complicated dependencies in the approximating posterior ${q}_{{\boldsymbol{\phi }}}({\boldsymbol{\zeta }}| {\bf{x}})$ , which cannot be fully captured by products of independent distributions as we have considered so far. More powerful variational approximations of the posterior can be considered by including hierarchical structures. In the case of DVAEs, a hierarchical approximating posterior may be chosen as follows:

$\begin{eqnarray}&&{q}_{{\boldsymbol{\phi }}}({z}_{l},{\zeta }_{l}| {\bf{x}})=r({\zeta }_{l}| {z}_{l}){q}_{{\boldsymbol{\phi }}}({z}_{l}| {\zeta }_{m\lt l},{\bf{x}}).\end{eqnarray} \tag{ C1 }$

A multivariate generalization of the reparameterization trick can be introduced by considering the conditional-marginal CDF defined as follows:

$\begin{eqnarray}&&{{\rm{F}}}_{l}({\zeta }_{m\leqslant l})={\int }_{0}^{{\zeta }_{l}}{q}_{{\boldsymbol{\phi }}}({\zeta }_{l}^{{\prime} }| {\zeta }_{m\lt l},{\bf{x}}){\rm{d}}{\zeta }_{l}^{{\prime} },\end{eqnarray} \tag{ C2 }$

where in the expression above we assume the ${\zeta }_{m\ne l}$ are kept fixed. Thanks to the hierarchical structure of the approximating posterior, the ${{\rm{F}}}_{l}({\zeta }_{m\leqslant l})$ functions are formally the same functions of ζ_l and q_l as in the case without hierarchies. The dependence of the functions ${{\rm{F}}}_{l}({\zeta }_{m\leqslant l})$ on the continuous variables ζ_m<l is encoded in the functions ${q}_{{\boldsymbol{\phi }}}({\bf{z}},{\boldsymbol{\zeta }}| {\bf{x}})$ :

$\begin{eqnarray}&&{{\rm{F}}}_{l}({\zeta }_{m\leqslant l})={{\rm{F}}}_{l}({\zeta }_{l},{q}_{l}({\zeta }_{m\lt l})).\end{eqnarray} \tag{ C3 }$

The reparameterization trick is again applied thanks to:

$\begin{eqnarray*}&&{\rho }_{l}\sim { \mathcal U },{\zeta }_{l}={{\rm{F}}}_{l}^{-1}({\rho }_{m\leqslant l})\quad \Rightarrow \quad {\zeta }_{l}\sim {q}_{{\boldsymbol{\phi }}}({\zeta }_{l}| {\bf{x}}).\end{eqnarray*}$

The KL divergence is:

$\begin{eqnarray*}&&\begin{array}{l}\quad {D}_{{\rm{KL}}}({q}_{{\boldsymbol{\phi }}}({\bf{z}},{\boldsymbol{\zeta }}| {\bf{x}})| | {p}_{{\boldsymbol{\theta }}}({\bf{z}},{\boldsymbol{\zeta }}))=\,\displaystyle \sum _{l=1}^{L}{D}_{{\rm{KL}}}({q}_{{\boldsymbol{\phi }}}({z}_{l}| {\zeta }_{m\lt l},{\bf{x}}))| | {p}_{{\boldsymbol{\theta }}}({z}_{l}))\\ \quad =\,\displaystyle \sum _{l=1}^{L}{{\mathbb{E}}}_{{q}_{{\boldsymbol{\phi }}}}[{z}_{l}\mathrm{log}{q}_{l}+(1-{z}_{l})\mathrm{log}(1-{q}_{l})]-\,{{\mathbb{E}}}_{{q}_{{\boldsymbol{\phi }}}}[\mathrm{log}{p}_{{\boldsymbol{\theta }}}({\bf{z}})].\end{array}\end{eqnarray*}$

Notice that, due to the hierarchical structure of the approximating posterior, the expectations above cannot be performed analytically and must be statistically estimated with the use of the reparameterization trick.

Appendix D.: Computing the derivatives

As shown in the previous section, the KL divergence generally includes a term that depends explicitly on the discrete variables z_l. When computing the gradients for back-propagation, we must account for the dependence of the discrete variables on the ${\boldsymbol{\phi }}$ parameters through the various hierarchical terms of the approximating posterior. Remembering that z_l = Θ(ρ_l + q_l − 1) and using the chain rule, we have:

$\begin{eqnarray}&&\partial {z}_{l}={\partial }_{{q}_{l}}{z}_{l}{\partial }_{{\boldsymbol{\phi }}}{q}_{l}=\delta ({\rho }_{l}+{q}_{l}-1)\partial {q}_{l}.\end{eqnarray} \tag{ D1 }$

The gradient of the expectation over ρ of a generic function of z can then be calculated as follows:

$\begin{eqnarray*}\begin{array}{rcl}{\partial }_{{\boldsymbol{\phi }}}{{\mathbb{E}}}_{{\boldsymbol{\rho }}}\left[f({\bf{z}})\right] & = & {{\mathbb{E}}}_{{\boldsymbol{\rho }}}\left[{\partial }_{{\boldsymbol{\phi }}}f({\bf{z}})\right]=\displaystyle \sum _{l=1}^{L}{{\mathbb{E}}}_{\rho \sim { \mathcal U }}\left[{\partial }_{{z}_{l}}f({\bf{z}}){\partial }_{{q}_{l}}{z}_{l}{\partial }_{{\boldsymbol{\phi }}}{q}_{l}\right]\\ & = & \displaystyle \sum _{l=1}^{L}{{\mathbb{E}}}_{{\rho }_{k\ne l}}{\left[{\partial }_{{z}_{l}}f({\bf{z}}){\partial }_{{\boldsymbol{\phi }}}{q}_{l}\right]}_{{\rho }_{l}=1-{q}_{l}\Rightarrow {z}_{l}=0}\\ & = & \displaystyle \sum _{l=1}^{L}{{\mathbb{E}}}_{{\boldsymbol{\rho }}}\left[{\partial }_{{z}_{l}}f({\bf{z}})\displaystyle \frac{1-{z}_{l}}{1-{q}_{l}}{\partial }_{{\boldsymbol{\phi }}}{q}_{l}\right]\\ & = & \displaystyle \sum _{l=1}^{L}{{\mathbb{E}}}_{{\boldsymbol{\rho }}}\left[{\partial }_{{z}_{l}}f({\bf{z}})({z}_{l}-1){\partial }_{{\boldsymbol{\phi }}}\mathrm{log}(1-{q}_{l})\right],\end{array}\end{eqnarray*}$

where, to go from the second to the third row, we have reinstated the expectation over ρ_l by noticing that q_l does not depend on ρ_l and that the condition z_l = 0 may be automatically enforced with the factor 1 − z_l. The term 1 − q_l accounts for the fact that z_l = 0 with probability 1 − q_l. This term is necessary to account for the statistical dependence of z_l, and thus of $f({\bf{z}})$ , on variables z_m<l that come before in the hierarchy. The equation derived above is useful to compute the derivatives of the positive phase in the case of a hierarchical posterior and an RBM or a QRBM as priors:

$\begin{eqnarray}&&\partial {{\mathbb{E}}}_{{\boldsymbol{\rho }}}\left[\mathrm{log}{p}_{{\boldsymbol{\theta }}}({\bf{z}})\right]={{\mathbb{E}}}_{{\boldsymbol{\rho }}}[\partial {E}_{{\boldsymbol{\theta }}}({\bf{z}})]-{{\mathbb{E}}}_{{p}_{\theta }}\left[\partial {E}_{{\boldsymbol{\theta }}}({\bf{z}})\right],\end{eqnarray} \tag{ D2 }$

with

$\begin{eqnarray}&&f({\bf{z}})={E}_{{\boldsymbol{\theta }}}({\bf{z}})\quad \mathrm{or}\quad {{ \mathcal H }}_{{\boldsymbol{\theta }}}({\bf{z}}).\end{eqnarray} \tag{ D3 }$

Appendix E.: Population-annealed CT-QMC

To sample from the quantum distribution equation (32) we use a CT-QMC algorithm [93] together with a population annealing sampling heuristic [86, 87].

The CT-QMC algorithm is based on the representation of the quantum system with the Hamiltonian equation (32) in terms of a classical system with an additional dimension of size M called imaginary time. The classical configuration z is replaced with M configurations z^a, a = 1, ..., M that are coupled to each other in a periodic manner. The quantum partition function ${Z}_{{\boldsymbol{\theta }}}=\mathrm{Tr}[{{\rm{e}}}^{-{{ \mathcal H }}_{{\boldsymbol{\theta }}}}]$ can be written as:

$\begin{eqnarray}&&{Z}_{{\boldsymbol{\theta }}}\simeq \displaystyle \sum _{\{{{\bf{z}}}^{a}\}}\exp \left\{\mathrm{log}\displaystyle \frac{{\rm{\Gamma }}}{M}\displaystyle \sum _{i,a}\displaystyle \frac{1-{z}_{i}^{a}{z}_{i}^{a+1}}{2}-\displaystyle \frac{1}{M}\displaystyle \sum _{a=1}^{M}{{ \mathcal H }}_{0}\left({{\bf{z}}}^{a}\right)\right\}.\end{eqnarray} \tag{ E1 }$

Here ${{ \mathcal H }}_{0}$ is the classical energy ${H}_{0}({\bf{z}})={\sum }_{l}{z}_{l}{h}_{l}+{\sum }_{l\lt m}{W}_{{lm}}{z}_{l}{z}_{m}$ and periodicity along imaginary time implies z^M+1 ≡ z¹.

CT-QMC defines a Metropolis-type transition operator acting on extended configurations ${T}_{{\boldsymbol{\theta }}}:{{\bf{z}}}^{a}\to {{\bf{z}}}^{a}{\prime}$ . We use cluster updates [93] where clusters may grow only along the imaginary time direction. These updates satisfy detailed balance conditions for the distribution

$\begin{eqnarray}\begin{array}{rcl}{p}_{{\boldsymbol{\theta }}}\left({{\bf{z}}}^{a}\right) & = & {{\rm{e}}}^{-{E}_{q}\left({{\bf{z}}}^{a}\right)-{E}_{{\rm{cl}}}\left({{\bf{z}}}^{a}\right)}/{Z}_{{\boldsymbol{\theta }}}\\ {E}_{q}\left({{\bf{z}}}^{a}\right) & = & -\mathrm{log}\displaystyle \frac{{\rm{\Gamma }}}{M}\displaystyle \sum _{i,a}\displaystyle \frac{1-{z}_{i}^{a}{z}_{i}^{a+1}}{2}\\ {E}_{{\rm{cl}}}\left({{\bf{z}}}^{a}\right) & = & \displaystyle \frac{1}{M}\displaystyle \sum _{a=1}^{M}{H}_{0}\left({{\bf{z}}}^{a}\right).\end{array}\end{eqnarray} \tag{ E2 }$

Equilibrium samples from equation (E2) allow us to compute the gradient of the bound on the log-likelihood in equation (36) as

$\begin{eqnarray}&&{{\mathbb{E}}}_{{p}_{\theta }({\bf{z}})}\left[\partial {{ \mathcal H }}_{{\boldsymbol{\theta }}}({\bf{z}})\right]={{\mathbb{E}}}_{{p}_{\theta }({{\bf{z}}}^{a})}\left[\partial {{ \mathcal H }}_{{\boldsymbol{\theta }}}({{\bf{z}}}^{1})\right].\end{eqnarray} \tag{ E3 }$

To obtain approximate samples from equation (E2), we use PA, which also gives an estimate of the quantum partition function [86, 87]. We choose a linear schedule in the space of parameters ${{\boldsymbol{\theta }}}_{t}=t{\boldsymbol{\theta }},t\in [0,1]$ and anneal an ensemble of N particles ${{\bf{z}}}_{n}^{a},n=1,\,\ldots ,\,N$ with periodic resampling.

Finally, we must evaluate the quantum cross-entropy equation (33), which involves computing probabilities of classical configuration $\bar{{\bf{z}}}$ under the quantum distribution ${p}_{{\boldsymbol{\theta }}}(\bar{{\bf{z}}})\equiv \mathrm{Tr}\left[{{\rm{\Lambda }}}_{\bar{{\bf{z}}}}{{\rm{e}}}^{-{{ \mathcal H }}_{{\boldsymbol{\theta }}}}\right]$ . This is done by noticing that

$\begin{eqnarray}&&\begin{array}{l}\mathrm{Tr}\left[{{\rm{\Lambda }}}_{\bar{{\bf{z}}}}{{\rm{e}}}^{-{{ \mathcal H }}_{{\boldsymbol{\theta }}}}\right]=\langle \bar{{\bf{z}}}| {{\rm{e}}}^{-{{ \mathcal H }}_{{\boldsymbol{\theta }}}}| \bar{{\bf{z}}}\rangle \simeq \displaystyle \sum _{\{{{\bf{z}}}^{a},a=2..M\}}\exp \left\{-{E}_{q}\left({{\bf{z}}}^{a}\right)-{E}_{{\rm{cl}}}\left({{\bf{z}}}^{a}\right)-{E}_{{\rm{boundary}}}\left({{\bf{z}}}^{a}\right)\right\},\\ \,\,\,{E}_{{\rm{boundary}}}\left({{\bf{z}}}^{a}\right)=-\mathrm{log}\displaystyle \frac{{\rm{\Gamma }}}{M}\displaystyle \sum _{i}\displaystyle \frac{2-{\bar{z}}_{i}{z}_{i}^{2}-{\bar{z}}_{i}{z}_{i}^{M}}{2}.\end{array}\end{eqnarray} \tag{ E4 }$

Thus, to obtain ${p}_{{\boldsymbol{\theta }}}(\bar{{\bf{z}}})$ , we must compute the partition function $\langle \bar{{\bf{z}}}| {{\rm{e}}}^{-{{ \mathcal H }}_{{\boldsymbol{\theta }}}}| \bar{{\bf{z}}}\rangle$ of a 'clamped' system, where the first slice of imaginary time is fixed ${{\bf{z}}}^{1}\equiv \bar{{\bf{z}}}$ and we integrate out the rest of the slices taking into account the external field acting on slices 2 and M.

Quantum variational autoencoder

Article metrics

Submit

Permissions

Author e-mails

Author affiliations

Author notes

ORCID iDs

Dates

Peer review information

Abstract

1. Introduction

2. Generative models with latent variables

3. Variational autoencoders

3.1. The reparameterization trick

4. VAE with discrete latent space

4.1. The reparameterization trick for DVAE

4.2. DVAE with Boltzmann machines

4.3. Experimental results with DVAE

5. Quantum variational autoencoders

5.1. Experimental results with QVAE

6. Conclusions

Acknowledgments

Appendix A.: VAE with Guassian variables

Appendix B.: DVAE with Bernoulli variables

Appendix C.: Hierarchical approximating posterior

Appendix D.: Computing the derivatives

Appendix E.: Population-annealed CT-QMC

Quantum variational autoencoder

Article metrics

Submit

Permissions

Share this article

Author e-mails

Author affiliations

Author notes

ORCID iDs

Dates

Peer review information

Abstract

1. Introduction

2. Generative models with latent variables

3. Variational autoencoders

3.1. The reparameterization trick

4. VAE with discrete latent space

4.1. The reparameterization trick for DVAE

4.2. DVAE with Boltzmann machines

4.3. Experimental results with DVAE

5. Quantum variational autoencoders

5.1. Experimental results with QVAE

6. Conclusions

Acknowledgments

Appendix A.: VAE with Guassian variables

Appendix B.: DVAE with Bernoulli variables

Appendix C.: Hierarchical approximating posterior

Appendix D.: Computing the derivatives

Appendix E.: Population-annealed CT-QMC