Bayesian

Likelihood

考虑linear regression问题 $$y = \theta^T x + \epsilon$$。

Assume $$\epsilon \sim \mathcal{N}(\mu, \sigma^2)$$, that is $$p(\epsilon) = \frac{1}{\sqrt{2\pi} \sigma} exp \big( - \frac{\epsilon^2}{2\sigma^2} \big)$$.

This implies that $$p(y|x;\theta) = \frac{1}{\sqrt{2\pi}\sigma} exp \big( - \frac{( y - \theta^T x)^2}{2\sigma^2} \big)$$, which indicates the distribution of y given x and parameterized by $$\theta$$ (not condition on $$\theta$$ since it's not a random variable).

Or we can write $$y | x; \theta \sim \mathcal{N}(\theta^T x, \sigma^2)$$.

因此给定$$\theta$$和$$x$$,我们有了一个关于$$y$$的概率分布。 这个quantity可以被当作关于给定$$\theta$$,关于$$y$$的一个函数。 而我们想换一个角度,把它当作是关于$$\theta$$的函数,我们就把它叫做likelihood函数。

$$ L(\theta) = L(\theta; X,y) = p(y | X;\theta)

$$

Suppose we have $$m$$ data points, so we want to instead maximize the log likelihood $$\ell (\theta)$$

$$ \begin{aligned} \ell(\theta) & = log L (\theta)\ & = log \prod{i=1}^m \frac{1}{\sqrt{2\pi}\sigma} exp \big( - \frac{( y_i - \theta^T x_i)^2}{2\sigma^2} \big)\ & = \sum{i=1}^m log \frac{1}{\sqrt{2\pi}\sigma} exp \big( - \frac{( yi - \theta^T x_i)^2}{2\sigma^2} \big)\ & = m log \frac{1}{\sqrt{2\pi}\sigma} - \frac{1}{2\sigma^2} \cdot \sum{i=1}^m (y_i - \theta^T x_i) \end{aligned}

$$

因此,如果要最大化$$\ell(\theta)$$就等于最小化$$\frac{1}{2} \sum_{i=1}^m (y_i - \theta^T x_i)$$。

通过MLE,将线性拟合问题转换为了least square问题。

Regularization

从bayesian(probabilistic)的角度来理解regularization。

前面我们把$$\theta$$当作constant-valued but unknown,是frequentist stat思考模式。 而w如果从bayesian的角度考虑,我们会把$$\theta$$当作random variable whose value is unknown。在这种情况下,我们就会考虑在$$\theta$$上定义一个先验prior,而后通过训练进行调整。 (稍后我们会看到这些先验知识在优化目标上就表现为regularization。)

注意1,因为这里我们将$$\theta$$当作变量,所以考虑的是$$p(y|x,\theta)$$,而不是$$p(y|x;\theta)$$.

注意2,下面的推导不仅限于linear regression,可以是任意的 $$f$$ 或者 $$p(y|x,\theta)$$。

给定整个数据集$$S = {(x_1, y_1), ..., (x_n, y_n)}$$,有

$$ \begin{aligned} p(\theta | S) & = \frac{p(S|\theta) \cdot p(\theta)}{p(S)}\ & = \frac{p(S|\theta) \cdot p(\theta)}{\int\theta p(S|\theta) p(\theta) d\theta}\ & = \frac{\prod{i=1}^m p(yi | x_i, \theta) \cdot p(\theta)}{\int\theta \big( \prod_{i=1}^m p(y_i | x_i, \theta) \cdot p(\theta) \big) d\theta} \end{aligned}

$$

因此,当给定新的数据点$$(x,y)$$的时候,就可以根据$$\theta$$的posterior distribution计算$$p(y|x,S) = \int_\theta p(y|x,\theta) p(\theta|S) d\theta$$,其中$$p(\theta|S)$$就是前面计算的$$\theta$$的后验概率。

以上展现的“完整Bayesian”过程。有一个问题是$$\theta$$的后验概率非常难以计算,所以实际经常用single point estimate来代替posterior distribution。(也就是分子部分)由此我们有了MAP (maximum a posterior) estimate for $$\theta$$

$$ \theta_{MAP} = \underset{\theta}{arg max} \prod p(y_i|x_i, \theta) p(\theta)

$$

回到linear regression的例子,$$y=\theta^T x + \epsilon$$, where $$\epsilon \sim \mathcal{N}(0, 1)$$,我们再讨论两个情形。

1. Suppose $$\theta \sim \mathcal{N}(0, \lambda)$$

Suppose $$\theta \in \mathbb{R}^p$$, and $$p(\theta) = \frac{1}{\sqrt{2\pi\lambda}} exp\big( -\frac{\theta^2}{2\lambda} \big)$$。

$$ \begin{aligned} \theta{MAP} & = \underset{\theta}{arg max} \,\, \prod p(y_i|x_i, \theta) p(\theta)\ & = \underset{\theta}{arg max} \,\, log \Big( \prod{i=1}^m \frac{1}{\sqrt{2\pi}\sigma} exp \big( - \frac{( yi - \theta^T x_i)^2}{2\sigma^2} \big) \cdot \frac{1}{\sqrt{2\pi\lambda}} exp\big( -\frac{\theta^2}{2\lambda} \big) \Big)\ & = \underset{\theta}{arg max} \,\, \frac{m}{\sqrt{2\pi}\sigma} + \frac{m}{\sqrt{2\pi\lambda}} + \sum{i=1}^m - \frac{( yi - \theta^T x_i)^2}{2\sigma^2} - \sum{j=1}^p \frac{\thetaj^2}{2\lambda}\ & = \underset{\theta}{arg min} \,\, \sum{i=1}^m \frac{( yi - \theta^T x_i)^2}{2\sigma^2} + \sum{j=1}^p \frac{\thetaj^2}{2\lambda}\ & = \underset{\theta}{arg min} \,\, \sum{i=1}^m \frac{1}{2} ( y_i - \theta^T x_i)^2 + \frac{\sigma^2}{2\lambda} |\theta|_2^2 \end{aligned}

$$

Therefore, norm distribution introduces L2 norm, in the bayesian regression.

2. Suppose $$\theta \sim Lap(0, \lambda)$$

Similarly for Laplacian distribution, where $$\theta = \frac{1}{2\lambda} exp \big( -\frac{|\theta|}{\lambda} \big)$$。

$$ \begin{aligned} \theta{MAP} & = \underset{\theta}{arg max} \,\, \prod p(y_i|x_i, \theta) p(\theta)\ & = \underset{\theta}{arg max} \,\, log \Big( \prod{i=1}^m \frac{1}{\sqrt{2\pi}\sigma} exp \big( - \frac{( yi - \theta^T x_i)^2}{2\sigma^2} \big) \cdot \frac{1}{2\lambda} exp \big( -\frac{|\theta|}{\lambda} \big) \Big)\ & = \underset{\theta}{arg max} \,\, log \Big( \prod{i=1}^m exp \big( - \frac{( yi - \theta^T x_i)^2}{2\sigma^2} \big) \cdot exp \big( -\frac{|\theta|}{\lambda} \big) \Big)\ & = \underset{\theta}{arg max} \,\, \sum{i=1}^m - \frac{( yi - \theta^T x_i)^2}{2\sigma^2} - \sum{j=1}^p \frac{|\thetaj|}{\lambda}\ & = \underset{\theta}{arg min} \,\, \sum{i=1}^m \frac{( yi - \theta^T x_i)^2}{2\sigma^2} + \sum{j=1}^p \frac{|\thetaj|}{\lambda}\ & = \underset{\theta}{arg min} \,\, \sum{i=1}^m \frac{1}{2} ( y_i - \theta^T x_i)^2 + \frac{\sigma^2}{\lambda} |\theta|_1 \end{aligned}

$$

Therefore, Laplacian distribution introduces L1 norm, in the bayesian regression.

Appendix

参考内容 1

参考内容 2

参考内容 3 (还是大佬的课件清楚啊)

results matching ""

    No results matching ""