In this post we’ll be seeking to understand some of the principals underlying SteinGAN, a method for reducing sampling complexity by training a model (often called the generator) that leverages Stein variational gradient descent. Specifically, let us denote by $\xi$ a random variable drawn from a prior noise distribution; then the objective of SteinGAN is to produce a model $G\left(\xi; \nu\right)$ such that the output of $G$ is distributed according to a target distribution $p\left(x;\theta\right)$. In this setup, both $\nu$, the parameters of the generative model, and $\theta$, the parameters of the target distribution, are both unknown and need to be estimated.

In this note, we’ll denote by $p_{\theta}\left(x\right) = p\left(x\vert\theta\right)$ a distribution within the exponential family as follows, \begin{align} p_{\theta}\left(x\right) = \frac{1}{Z} \exp\left(f\left(x;\theta\right)\right), \end{align} where $Z$ is a normalizing constant and $f\left(x;\theta\right)$ is a negative energy function. In the case of a normal distribution, we simply have that $Z=\sqrt{2\pi\sigma^2}$ and $f\left(x;\theta\right) = -\frac{\left(x-\mu\right)^2}{2\sigma^2}$ where $\theta=\left(\mu,\sigma\right)$.

Learning to sample from an unknown probability distribution is a difficult problem; nonetheless, it has recently been approached from the perspective of generative adversarial networks, which seek to balance fidelity of the generator $G\left(\cdot; \nu\right)$ with discriminative power of the probability model $p_\theta$. The essential idea behind SteinGAN specifically is that if $x_i = G\left(\xi_i;\nu\right)$ are particles seeking to approximate a random draw from a probability distribution $p_\theta$ then the “corrected” particles $x_i + \epsilon \hat{\phi}\left(x_i\right)$ represents an even better configuration, where,
\begin{align}
\hat{\phi}\left(y\right) = \frac{1}{n} \sum_{j=1}^{n} k\left(x_j, y\right)\nabla_x\log p_\theta\left(x_j\right)+\nabla_x k\left(x_j, y\right),
\end{align}
and $k\left(\cdot,\cdot\right)$ is a kernel function such as the squared exponential. Therefore, the parameters of the generative model, $\nu$, should be updated in order to better compute $x_i + \epsilon\hat{\phi}\left(x_i\right)$. In other words, we are leveraging the fact that $\hat{\phi}$ represents an optimal correction in a RKHS when functions live in a unit ball in terms of decreasing the KL-divergence between the samples and the target distribution. On this principle, it was proposed to update the parameters of the generative model as follows (where $x_i$ are regarded as fixed and *not* a function of generative parameters $\nu$),
\begin{align}
\nu \leftarrow \text{argmin}_{\nu} \sum_{i=1}^n \left|\left|G\left(\xi_i;\nu\right) - x_i - \epsilon\hat{\phi}\left(x_i\right)\right|\right|_2^2.
\end{align}
Differentiating with respect to $\nu$ yields the following update formula,
\begin{align}
\nu \leftarrow \nu + \epsilon \sum_{i=1}^n \left(\nabla_{\nu} G\left(\xi_i;\nu\right)\right) \hat{\phi}\left(x_i\right).
\end{align}

But this is only part of the problem in learning to sample. The generative model gives a mechanism to efficiently sample from the probability distribution $p_\theta$, but it does not enable us to determine what the model parameters $\theta$ should be in the first place. In my previous post A Likelihood Identity for Exponential Families I demonstrated that the gradient of the log-likelihood of a distribution in the exponential family could be written out as,
\begin{align}
\frac{\nabla_{\theta} \log L}{n} = \frac{1}{n}\sum_{i=1}^{n} \nabla_{\theta} f\left(x_i;\theta\right) - \mathbb{E}_{x\sim p}\left[\nabla_{\theta} f\left(x;\theta\right)\right].
\end{align}
The difficulty in computing this gradient for arbitrary exponential families lies in the second term involving a theoretical expectation of the gradient of the negative energy function. This is because computing the expectation involves the partition function, which is generally intractable to compute for arbitrary distributions in the exponential family. However, if we had a means of efficiently sampling from $p_{\theta}$, then we could efficiently compute a stochastic estimate of the gradient by simply considering the empirical average of $\nabla_{\theta} f\left(x;\theta\right)$ over samples. Therefore, the proposed update of the model parameters is,
\begin{align}
\theta\leftarrow \theta + \frac{\kappa}{n} \sum_{i=1}^n \nabla_{\theta} f\left(x_i^{+};\theta\right) - \nabla_{\theta} f\left(x_i^{-};\theta\right),
\end{align}
where $\kappa>0$ is a learning rate for updating the model parameters, $x_{i}^{+}$ is a *true* observation from the target distribution $p_{\theta}$ and $x_{i}^{-} = G\left(\xi_i;\nu\right)$ is an *approximate* draw from $p_{\theta}$ using amortized sampling from the generative model. SteinGAN proposes to estimate the parameters of the generative model and the probability model by alternating between optimizing $\nu$ and optimizing $\theta$. In a future post, I’ll be implementing SteinGAN to sample from a simple distribution and understand the practical considerations necessary for designing neural networks that learn to sample.