PRML Probability Distributions


1. Bernoulli Distribution

1.1

If we are tossing coins.
Let
$$
p(x=1|\mu )=\mu \\
p(x=0|\mu )=1-\mu \tag{1}
$$
Where 0 $\leq$ $\mu$ $\leq$ 1
So it’s easy to know
$$
E[x]= \mu \\
var[x]=\mu (1-\mu) \tag{2}
$$
Let
$D$ = {$x_1$ ,…,$x_N$} as observed values of x, we can get the likelihood function and differentiate it
$$
ln(p(D|\mu))= \sum_{n=1}^Nln(p(x_n|\mu))= \sum_{n=1}^N{x_nln\mu+(1-x_n)ln(1-\mu)}\tag{3}
$$
set the derivative of $ln(p(D|\mu))=0$, we obtain the maximum likelihood estimator.
$$
\mu_{ML}=\frac{1}{N}\sum_{n=1}^Nx_n \tag{4}
$$
In order to obtain thenormalization coefficient we note that out of N coin flips, we have to add up all of the possible ways of obtaining m heads, so that the binomial distribution can be written
$$
Bin(m|N,\mu)={N\choose m}\mu ^m(1-\mu)^{N-m}\tag{5}
$$

1.2 Mean and Variance

Thus, we have
$$
E[m]=\sum_{m=0}^N{mBin(m|N,\mu)}=N\mu \tag{6}
$$
$$
var[m]=\sum_{m=0}^N{(m-E[m])^2}Bin(m|N,\mu)=N\mu(1-\mu) \tag{7}
$$

2. Beta Distribution

2.1

In Binary Variables, we can see if the dataset is too small, it may lead to serious overfitting, so we should consider a form of prior distribution, which made me consider beta distribution as a probability’s probability or the density of the success rate of Bernoulli trials.
Let a = number of successes, b = number of failures (we are still tossing coins) and we can choose a prior, called the beta distribution, given by

$$
Beta( \mu | a , b ) = \frac{ \Gamma(a+b)}{\Gamma(a)\Gamma(b)}\mu^{a-1}(1-\mu)^{b-1}=\frac{1}{B(a,b) } \mu ^ { a-1} (1- \mu )^{b-1} \tag{8}
$$

B(a,b), also called Beta function, ensures that the beta distribution is normalized, so that
$$
\int_0^1Beta(\mu|a,b)d\mu=1 \tag{9}
$$

2.2 Mean and Variance

Then we can get the mean and the variance

$$
E[\mu]=\int_0^1Beta(x|a,b)dx=\int_0^1x\frac{x^{a-1}(1-x)^{b-1}}{Beta(a,b)}dx\\
=\frac{B(a+1,b)}{B(a,b)}\\
=\frac{\Gamma(a+1)\Gamma(b)\Gamma(a+b)}{\Gamma(a+b+1)\Gamma(a)\Gamma(b)}\\
=\frac{a}{a+b}
\tag{10}
$$

To calc $Var[\mu]$, we should calc $E[\mu^2]$ first beacuse $Var[\mu]=E[\mu^2]-E[\mu]^2$.

$$
E[\mu^2]=\int_{-\infty}^{+\infty}{x^2}{Beta(x)dx}\\
=\int_0^1{x^2}\frac{1}{B(a,b)}{x^{a-1}}{(1-x)^{b-1}}dx\\
=\frac{B(a+2,b)}{B(a,b)}\
=\frac{\Gamma(a+2)\Gamma(b)}{\Gamma(a+b+2)}\frac{\Gamma(a+b)}{\Gamma(a)\Gamma(b)}\\
=\frac{(a+1)a}{(a+b+1)(a+b)}\tag{11}
$$

Then we can get $Var[\mu]$ readily
$$
Var[\mu]=\frac{(a+1)a}{(a+b+1)(a+b)}-(\frac{a}{a+b})^2\
=\frac{ab}{(a+b+1)(a+b)^2} \tag{12}
$$

If our target is to better predict next result of $x$, that’s equal to pretict
$$
p(x=1|D)=\int_0^1p(x=1|\mu)p(\mu|D)d\mu\
=\int_0^1\mu p(\mu|D)d\mu\
=E[\mu|D] \tag{13}
$$
So we can get
$$
p(x=1|D)=\frac{m+a}{m+a+l+b} \tag{14}
$$
It tell us when the number of tests ( m , l ) is big enough, the result is Maximum Likelihood Estimate.
For limited dataset, the estimate of $\mu$ is between prior and Maximum Likelihood Estimate.

3. The Gaussian Distribution

3.1 The Gaussian Distribution

As we all know, the Gaussian distribution can be written in the form of
$$
N( x | \mu , \sigma^2 )=\frac{1} {(2\pi\sigma^2)^ {1/2} }exp{ {-\frac{1}{2\sigma^2}(x-\mu)^2} } \tag{15}
$$
where $\mu$ is the mean and $\sigma^2$ is the variane.
If x is $D$-dimensional vector x, it take the form
$$
N ( x | \mu , \Sigma ) = \frac{1}{(2 \pi)^ {D/2} } \frac {1} { |\Sigma| ^ {1/2} }exp{ {-\frac{1}{2}(x-\mu)^T\Sigma^{-1} (x-\mu) } } \tag {16}
$$
Let
$$
\Delta^2=(x-\mu)^T\Sigma^{-1}(x-\mu) \tag{17}
$$
and the quantity $\Delta$ is called the Mahalanobis distance
from µ to x and reduces to the Euclidean distance when Σ is the identity matrix.

Usually, we assume $\Sigma$ is a symmetric matrix, so we can Singular Value Decomposition it.
Geo expression of Gaussian Distribution:
Imagine an elliptical surface of constant probability density for a Gaussian in a two-dimensional space x = ($x_1,x_2$) on which the density is exp(−1/2) of its value at x = $\mu$. The major axes of the ellipse are defined by the eigenvectors $u_i$ of the covariance matrix, with corresponding eigenvalues $\lambda_i$ .

3.2 Conditional Gaussian Distribution

An important property of the multivariate Gaussian distribution is that if two
sets of variables are jointly Gaussian, then the conditional distribution of one set
conditioned on the other is again Gaussian.
Split $x$ in two

$$
x={x_a \choose x_b}
$$

$$
\mu= {\mu_{a }\choose\mu_{b } }
$$

$$
\Sigma = {\Sigma_{aa } \quad \Sigma_{ab } \choose \Sigma_{ba } \quad \Sigma_{bb } }
$$
Where $\mu$ is mean and $\Sigma$ is covariance matrix.
Let
$$
\Lambda = \Sigma ^{-1 }
$$
which is known as the precision matrix
so
$$
\Lambda = {\Lambda_{aa } \quad \Lambda_{ab } \choose \Lambda_{ba } \quad \Lambda_{bb } }
$$

We just need remember the mean and variance of conditional Gaussian distribution
$$
\mu_{a|b } = \mu _a + \Sigma _{ab } + \Sigma _{bb }^{-1} (x_b-\mu _b ) \tag{18}
$$

$$
\Sigma_{a|b } = \Sigma_{aa}- \Sigma_{ab } \Sigma_{bb } ^{-1} \Sigma_{ba } \tag{19}
$$
Prove:
We don’t use the method in PRML but another constructive method raised by shuhuai008
Let
$$
x_{b·a}=x_b-\Sigma_{ba}\Sigma_{aa}^{-1}x_a \tag{20}
$$
$$
\mu_{b·a}=\mu_b-\Sigma_{ba}\Sigma_{aa}^{-1}\mu_a \tag{21}
$$
$$
\Sigma_{bb·a } = \Sigma_{bb } - \Sigma_{ba } \Sigma_{aa } ^ {-1 } \Sigma_{ab } \tag{22}
$$
We call $\Sigma_{bb·a }$ is $\Sigma_{aa }$’s Schur Complementary
Rewrite (20)
$$
x_{b·a } = (-\Sigma_{ba }\Sigma_{aa }^{-1 } \quad I_m){x_a \choose x_b} \tag{23}
$$
As we all know, if $x$$N(\mu,\Sigma)$, $y=Ax+b$, then $y$$N(A\mu +B,A\Sigma A^{-1 })$
So $-\Sigma_{ba}\Sigma_{aa}^{-1 }$ is $A$ we are finding.
Then we can calc
$$
E[x_{b·a }] = (-\Sigma_{ba }\Sigma_{aa }^{-1 } \quad I_m){\mu_a \choose \mu_b } = \mu_{b·a} \tag{24}
$$
$$
Var[x_{b·a } ]=(-\Sigma_{ba }\Sigma_{aa }^{-1 } \quad I_m){\Sigma_{aa } \quad \Sigma_{ab } \choose \Sigma_{ba } \quad \Sigma_{bb } } { {-\Sigma_{ba } \Sigma_{aa}^{-1 } } \choose {I_m } } = \Sigma_{bb·a} \tag{25}
$$

3.3 Marginal Gaussian Distribution

We have seen that if a joint distribution $p(x_a,x_b)$ is Gaussian,then the conditional distribution $p(x_a|x_b)$ will again be Gaussian. Now we turn to a discussion of the marginal distribution given by

$$
p(x_a) = \int p(x_a,x_b)dx_b
$$
we can get
$$
E[x_a] = \mu_a \tag{26}
$$

$$
cov[x_a]=\Sigma_{aa} \tag{27}
$$

prove:
target: value of $p(y)$ $p(x|y)$
for $p(y)$
let
$$
y=Ax+b+\epsilon
$$
and $\epsilon$$N(0,L^{-1 } )$,so
$$
E[y]=E[Ax+b+\epsilon ]=E[Ax+b]+E[\epsilon]=A\mu+b
$$
$$
Var[y]=Var[Ax+b+\epsilon]=Var[Ax+b]+Var[\epsilon]
$$
$$
=A·\Lambda^{-1 } ·A^{T } +L^{-1 } \tag{28}
$$
so $y$
$N(A\mu+b,A·\Lambda^{-1 } ·A^{T } + L^{-1 } )$
for $p(x|y)$
let
$$
z={x \choose y } \sim N({\mu \choose A\mu +b},{\Lambda^{-1 } \quad \Lambda^{-1 }A^{T } \choose A^{T }\Lambda^{-1 } \quad A·\Lambda^{-1 } ·A^{T } + L^{-1 }} )
$$
so the result is
$$
E[x|y] = \mu + \Lambda^{-1 }A^{T }(A·\Lambda^{-1 } ·A^{T } + L^{-1 } )^{-1 } (y-A\mu-b)
$$
$$
Var[x|y]=\Lambda^{-1 } - \Lambda^{-1 }A^{T }(A·\Lambda^{-1 } ·A^{T } + L^{-1 } )^{-1 } A\Lambda^{-1 } \tag{29}
$$

3.4 Student’s t-distribution

It equal to combined by many Gaussian distribution with same mean and diffenert variance.

4 The Exponential Family

The probability distributions that we have studied so far in this chapter (with the
exception of the Gaussian mixture) are specific examples of a broad class of distri-
butions called the exponential family
The exponential family of distributions over x, given parameters η, is defined to
be the set of distributions of normal form
$$
p(x|\eta) = h(x)g(\eta)exp(\eta^Tu(x)) \tag{30}
$$
$u(x)$ is called sufficient statistic, which store all of the message of the whole dataset.
$\eta$ is called natural parameter of the distribution, when $p(x|\theta)$ is limited, the set of $\eta$ is called natural parameter space.
$g(\eta)$ is called log-partition function, which function is to normalize the int.
Where
$$
g( \eta ) \int h(x) exp[ \eta^T u(x) ]dx = 1
$$
The target of ExpFamily is rearrangement different distributions into normal forms.

4.1 Bernoulli distribution

$$
p(x|\mu) = Bern(x|\mu) = \mu^x(1-\mu)^{1-x }
$$
$$
=(1-\mu)exp[ln(\frac{\mu } {1-\mu })x]
$$
let
$$
\eta = ln(\frac{\mu }{1-\mu } )
$$
rearrangement it and let $\mu = \sigma(\eta)$
$$
\sigma(\eta)=\frac{1 } {1+exp(-\eta) }
$$
which is called logistic sigmoid function
Thus we can write the Bernoulli distribution using the standard representation in the form
$$
p(x|\eta)=\sigma(-\eta)exp(\eta x)
$$

4.2 Gaussian distribution

$$
p( x | \mu , \sigma^2 )=\frac{1} {(2\pi\sigma^2)^ {1/2} }exp{ {-\frac{1}{2\sigma^2}(x-\mu)^2} }
$$
$$
=\frac{1}{(2\pi \sigma^2)^{1/2}}exp(-\frac{1}{2\sigma^2}(-2\mu \quad 1){x \choose x^2} -\frac{\mu^2}{2\sigma^2})
$$
So comparing to 30
$$
\eta = {\frac{\mu }{ \sigma^2 } \choose -\frac{1 } {2\sigma^2 } } = {\eta_1 \choose \eta_2 }
$$
$$
u(x)={x \choose x^2}
$$
$$
h(x)=(2\pi)^{-1/2}
$$
$$
g(\eta)=(-2\eta_2)^{1/2}exp(\frac{\eta_1^2}{4\eta_2})
$$


评论
  TOC