---
## Exponential family distributions
We will consider exponential family variational posteriors with
natural parameters $\psi$,
dual (moment) parameters $\rho$,
sufficient statistics $T(\theta)$,
and log-partition function $\Phi(\psi)$:
$$
\begin{aligned}
q_{\psi}(\psi) &= \exp(\psi^\intercal T(\theta) - \Phi(\psi)) \\
\rho &= E_{\theta \sim q_{\psi}}[T(\theta)]
= \nabla_{\psi} \Phi(\psi)
\end{aligned}
$$
Example: Gaussian distribution
$$
\begin{aligned}
\psi_t^{(1)} &= \Sigma_t^{-1} \mu_t \\
\psi_t^{(2)} &= -\frac{1}{2} \Sigma_t^{-1} \\
\rho_t^{(1)} &= \mu_t \\
\rho_t^{(2)} &= \mu_t \mu_t^\intercal + \Sigma_t
\end{aligned}
$$
---
NGD = preconditioned gradient descent
$$
\begin{aligned}
\psi &:=
\psi + \alpha F_{\psi}^{-1} \nabla_{\psi} L(\psi)
\end{aligned}
$$
where
For exponential families, we have $$ \begin{aligned} F_{\psi} &= \frac{\partial \rho}{\partial \psi} \ F_{\psi}^{-1} \nabla_{\psi} L(\psi) &= \nabla_{\rho} L(\rho) \end{aligned} $$
Bayesian Learning Rule (Khan and Rue, 2023) uses multiple iterations
of natural gradient descent (NGD) on the VI objective
(Evidence Lower Bound).
In the online setting, we get the following
iterative update at each step
Bayes By Backprop (Blundell et al, 2015) is similar to BLR but uses GD, not NGD. $$ \begin{aligned} \psi_{t,i} &= \psi_{t,i-1} + \alpha \nabla_{\psi_{t,i-1}} L_t(\psi_{t,i-1}) \end{aligned} $$
- (NGD or GD) x (Implicit reg. or KL reg)
Switching State Space Model (SSM).
::right::
Definitions
Plot
Linear Gaussian model
$$
p_t(y_t|\theta_t) = N(y_t|H_t \theta_t, R_t)
$$
where
Special case: Linear Regression (
Binary logistic Regression $$ p(y_t|\theta_t, x_t) = {\rm Bern}(y_t|\sigma(x_t^\intercal \theta_t)) $$
Multinomial logistic Regression $$ p(y_t|\theta_t, x_t) = {\rm Cat}(y_t|{\cal S}(\theta_t x_t)) $$
MLP classifier $$ p(y_t|\theta_t, x_t) = {\rm Cat}(y_t|{\cal S} (\theta_t^{(1)} \text{relu}(\theta_t^{(1)} x_t))) = {\rm Cat}(y_t|h(\theta_t,x_t)) $$
- 4 update rules: (NGD or GD) x (Implicit reg. or KL reg)
- 4 gradient computations: (MC or Lin) x (Hess or EF)
- 4 update rules: (NGD or GD), (Implicit reg. or KL reg)
- 4 gradient computations: (MC or Lin), (Hess or EF)