(WIP) Generative modeling using Diffusion map particle system
Introduction
This review aims at providing a self-contained, pedantic overview of deep generative modelling using score based diffusion, culminating with discussions on the recent paper from Youssef Marzouk.
Generative models are manifestations of this widely held philosophical belief:
Clearly, there are infinitely many possibilities that may ensue from such a generative process. Assuming that we can enumerate all of these outcomes, say with set \(\Omega\), then a generative process \(\mathcal{G}\) underlying a system \(\mathcal{S}\) can be effectively described using probability theory. Therefore, the problem of generative modelling is in-effect the problem of probability density (distribution) estimation.
In practice, there are several restrictions to effectively estimating probability density, \(p_{\mathcal{G}}(z) \: \forall z \in \Omega\), from a finite set of observations \(\Omega_f \subset \Omega\). Regardless, any method \(m\) that claims to do so must satisfy these conditions.
- For a suitably chosen functional norm, \(\lim_{\Omega_f \mapsto \Omega} \|p_{\mathcal{G}}^{m}(z) - p_{\mathcal{G}}(z)\|_f \mapsto 0\)
- It must be computationally effcient to sample from \(p_{\mathcal{G}}^{m}(z)\).
- It must be computationally effcient to evaluate \(p_{\mathcal{G}}^{m}(z)\).
Traditional functional approximation methods can satisfy (1) with an arbitrary level of accuracy, yet fail with respect to (2) and (3) due to the curse of dimensionality. Deep learning models on the other hand, are adept at circumventing this problem. Popular deep-generative models (in no particular order) include
2. Variational Autoencoders
3. Normalizing Flows
4. Score based Diffusion Models
5. Autoregressive Models
6. Energy based Models
to name a few.
In what follows, a tutorial styled introduction to score based diffusion algorithms is presented and are compared with the method using diffusion map particle system.
Score based diffusion
Diffusion models are based on certain assumptions and results, which we will assume are axiomatic.
- Let \(p_f\) be the probability distribution defined over a certain function space \(\mathcal{F}\) such that \(\forall f \in \mathcal{F}, f:x \mapsto y\) where \(x,y \in \mathbb{R}^d\).
- Let there exist a time dependent stochastic differential operator \(\mathcal{L}_t\) which transforms \(f(x) \sim p_f\) such that \(\lim_{t \mapsto \infty} \mathcal{L}_t f(x) = g(x)\) where \(g(x) \sim \mathcal{GP}(\bar{x},k(x,x'))\).
- For every such forward operator, there exists an equivalent reverse time stochastic differential operator \(\tilde{\mathcal{L}_t}\) such that \(\lim_{t \mapsto \infty} \tilde{\mathcal{L}_t} g(x) = f(x)\).
2 and 3, taken together results in differentiable and invertible map from a distribution over a trivial function space (noisy functions in this case) to a distribution over functions with certain structure. This forms the crux of generative modeling using score based diffusion. The key idea is to sample or evaluate the density of functions on the trivial distribution and transform the samples back and forth by solving stochastic differential equations. This further leads to the following questions.
- For a given \(\mathcal{F}\), what is the most appropriate choice of \(\mathcal{L}\) in terms of computational tractablity? That is,
- Which stochastic differential equation, reaches its steady state relatively quickly?
- Which numerical integration scheme preserves the structure of any \(f \sim \mathcal{F}\)?
- What is the dual operator \(\tilde{\mathcal{L}}\) for a given \(\mathcal{L}\)?
For better or worse, there is no clear resolution on the “best SDE” and the “best numerical integrator” for a given function class. Most of today’s state of the art, in this regard, is based on heuristics.
Computational model
Let \(p_0(x)\) be the target distribution. Given sample \(x_i^0 \in \mathbb{R}^d \sim p_0(x)\), we proceed as follows.
- Numerically solve \(dx^t = f(x^t,t)\:dt + g(t)\:dW(t)\) with initial condition \(x_i^0\) for \([0,T]\).
- Here, \(f(x,t) \textrm{ and } g(t)\) are known as the drift and the diffusivity terms respectively.
- \(W(t)\) is a Wiener process.
- Parmeterize a neural network \(\mathcal{S_{\Theta}}:\mathbb{R}^d \mapsto \mathbb{R}^d\).
- Numerically solve \(dx^t = [f(x^t,t) - g(t)^2 \mathcal{S_{\Theta}}]\:dt + g(t)\:dW(t)\) for \([T,0]\).
Use this computational routine to solve
\[\begin{equation} \arg \min_{\Theta} \mathbb{E}_{x \sim p(x)}[\|m^{0 \mapsto T \mapsto 0}(x) - x\|] \end{equation}\]Common choices for \(f,g\) in literature include:
SLDM
\(\begin{equation} f(x,t) = -\frac{\beta(t)x}{2} \end{equation}\)
\[\begin{equation} g(t) = \sqrt{\beta(t)} \end{equation}\]DDPM
\[\begin{equation} f(x,t) = 0 \end{equation}\] \[\begin{equation} g(t) = \sqrt{\frac{d \sigma(t)}{dt}} \end{equation}\]for some suitably chosen functions \(\sigma, \beta\).