Gradient Descent and Gradient Boosting

手写gradient descent和gradient boosting...

ML里的很多问题都可以简化为求解优化问题，让loss function最小的 $\hat{\theta}=\underset{\theta}{\arg \min } L(\theta)$ 。

简单的model中L可以直接求解得到，但是对于参数多的L，求解过程困难。因为实际的loss function多是连续、可导的，那就有了gradient descent求解的思路。

Step 1: Take the derivative (gradient) of the loss function for each parameter in it. Step 2: Pick random values for the parameters. Step 3: Plug the parameter values into the derivatives(gradient) Step 4: Calculate the step sizes, step size = slope*learning_rate Step 5: Calculate the new parameter: new parameter = old parameter-step size Back to Step 3 until Step_Size=small, or reach the maximum number of steps

\begin{array}{l}{\text { 1: } \theta \leftarrow \theta_{0}} \\ {\text { 2: while } \nabla L(\theta)>\epsilon \text { do }} \\ {\text { 3: } \quad \theta \leftarrow \theta-\alpha \cdot \nabla L(\theta)} \\ {\text { 4: return } \theta}\end{array}

$\epsilon$ 是tolerance，很小的数值（1e-6）， $\alpha$ 是learning rate。

use gradient to descent to the lowest point in the loss function (sum of the squared residuals); gradient descent can be very sensitive to the learning rate

基于导数的优化问题（不只是gradient descent）都要求提前做好standardization

实现Gradient Descent

import numpy as np

#gradient descent
def gd(theta0, L, D, alpha, epsilon):
    theta = theta0
    it=0
    while True:
        it+=1
        l=L(theta)
        d=D(theta)
        if np.linalg.norm(d) < epsilon: #eucladian distance
            return theta
        theta-=alpha*d
        if (it%100)==0:
            print(f'{it}, theta: {theta}, l:{l}')

#Test Case L(x,y) = (x-3)**2 + (y-9)**2
ret = gd(
theta0=np.array([10.0, 20.0]),
L = lambda a:(a[0]-3)**2+(a[1]-9)**2,
D = lambda x: np.array([2*(x[0]-3), 2*(x[1]-9)]),
alpha = 1e-2,
epsilon = 1e-7
)

print(ret)

Stochastic Gradient Descent

从上面的过程可以看出，如果要实现一次gradient descent，需要load进去所有的x。如果我们只load部分的dataset，每次换不同的batch去算derivative，这个时候我们叫它out of core learning 在sklearn就是partial_fit

Gradient Boosting

从general的boosting vs bagging说起

Bagging: independent classifier并行运算，一群high variance, low bias的model，用一些avg的技巧得到最后的结果（因为low bias，所以大家犯不同的错误取平均后应该得到一个真实值），比如weighted average, majority vote, normal average 最后的结果是reduce variance，variance变成了原来的1/N。

Boosting: sequential classifiers串行运算，可以理解成精英教育，不停培养同一个model，让这个model的error最小。每一个model都在学习在此之前每一个model的mistake。Boosting的问题在于我们需要设置一个stop time or iteration time，因为它有overfitting的问题。