Notes by Louisa
Notes by Louisa
Notes by Louisa
  • Introduction
  • Chapter1 Python Cheatsheet
    • Reference, Deep Copy and Shallow Copy
    • Iterators
    • List Comprehensions
    • Numpy
    • Pandas
    • Data Visualization
    • DateTime
    • Python Good to knows
  • Chapter2 Java Cheatsheet
    • Fundamentals to Java
    • Interface, Abstract Class, Access Modifier, Exceptions
    • Linked List and Java List
    • Java Queue, Stack and Deque
    • Binary Tree
    • Heap in Java
    • Map/Set/Hash
    • OOD
  • Chapter3 Algorithm
    • Fundamental Knowledge
    • Binary Search
    • Basic Sorting
    • Advanced Sorting
    • Linked List
    • Recursion 1
    • HashTable
    • Queue
    • Sliding Window
    • Stack
    • Binary Tree
    • Binary Search Tree
    • Heap
    • String
    • Graph Search DFS1 (Back Tracking)
    • Recursion II and Memoization
    • Dynamic Programming
    • Complete Binary Tree, Segment Tree, Trie Tree
    • Graph Search BFS
    • Graph Search BFS 2
    • Graph Search DFS2
    • Problems from 'JianZhi Offer'
    • Problems Categorized
    • Bit Operations
  • Chapter4 Model
    • Linear Regression
    • Logistic Regression
    • Regularization and Feature Selection
    • Model Evaluation
    • Nonlinear Models
    • PCA
    • Unsupervised Learning
    • Gradient Descent and Gradient Boosting
    • XG Boost and Light GBD
    • Deep Learning
    • Tensorflow/Keras
    • RNN
  • Chapter5 Statistics and A/B Testing
    • Inference about independence
    • Probability, Sampling and Randomization with Python
    • A/B Testing
    • Stats Interview Review
    • Statistics Glossary
  • Chapter6 SQL
    • Student Scores Query
    • Order Query
    • Movie Rating Query
    • Social-Network Query
    • LeetCode SQL题目总结
    • Spark SQL
  • Chapter7 Big Data and Spark
    • Introduction to Pyspark
    • Data Cleaning with Apache Spark
    • Feature Engineering with Pyspark
    • Building Recommendation Engines with Pyspark
    • Building Data Engineering Pipelines in Python
    • Hadoop MapReduce
    • Big Data Related Paper
  • Chapter8 Code Walk-Throughs
    • Python
    • R
    • Shell
  • Chapter9 Special Topics
    • Anomaly Detection
    • E Commerce
    • Supply Chain
    • Social Network Analysis
    • NLP intro
    • Time Series
    • Challenge Prophet with LSTM models
  • Project: The Winning Recipes to an Oscar Award
  • Project: A Crime Analysis of the Last Decade NYC
  • Project: Predict User Type Based on Citibike Data
  • GeoSpark/GeoSparkVis for Geospatial Big Data
  • Single Scattering Albedo
  • Sea Ice Albedo Retrievals
  • Lidar Project
Powered by GitBook
On this page
  • 实现Gradient Descent
  • Stochastic Gradient Descent
  • Gradient Boosting
  • 从general的boosting vs bagging说起
  • Gradient Boosting
  • Adaboost
  • 其它
  1. Chapter4 Model

Gradient Descent and Gradient Boosting

手写gradient descent和gradient boosting...

ML里的很多问题都可以简化为求解优化问题,让loss function最小的 θ^=arg⁡min⁡θL(θ)\hat{\theta}=\underset{\theta}{\arg \min } L(\theta)θ^=θargmin​L(θ) 。

简单的model中L可以直接求解得到,但是对于参数多的L,求解过程困难。因为实际的loss function多是连续、可导的,那就有了gradient descent求解的思路。

Step 1: Take the derivative (gradient) of the loss function for each parameter in it. Step 2: Pick random values for the parameters. Step 3: Plug the parameter values into the derivatives(gradient) Step 4: Calculate the step sizes, step size = slope*learning_rate Step 5: Calculate the new parameter: new parameter = old parameter-step size Back to Step 3 until Step_Size=small, or reach the maximum number of steps

 1: θ←θ0 2: while ∇L(θ)>ϵ do  3: θ←θ−α⋅∇L(θ) 4: return θ\begin{array}{l}{\text { 1: } \theta \leftarrow \theta_{0}} \\ {\text { 2: while } \nabla L(\theta)>\epsilon \text { do }} \\ {\text { 3: } \quad \theta \leftarrow \theta-\alpha \cdot \nabla L(\theta)} \\ {\text { 4: return } \theta}\end{array} 1: θ←θ0​ 2: while ∇L(θ)>ϵ do  3: θ←θ−α⋅∇L(θ) 4: return θ​

ϵ\epsilonϵ 是tolerance,很小的数值(1e-6), α\alphaα 是learning rate。

use gradient to descent to the lowest point in the loss function (sum of the squared residuals); gradient descent can be very sensitive to the learning rate

基于导数的优化问题(不只是gradient descent)都要求提前做好standardization

实现Gradient Descent

import numpy as np

#gradient descent
def gd(theta0, L, D, alpha, epsilon):
    theta = theta0
    it=0
    while True:
        it+=1
        l=L(theta)
        d=D(theta)
        if np.linalg.norm(d) < epsilon: #eucladian distance
            return theta
        theta-=alpha*d
        if (it%100)==0:
            print(f'{it}, theta: {theta}, l:{l}')

#Test Case L(x,y) = (x-3)**2 + (y-9)**2
ret = gd(
theta0=np.array([10.0, 20.0]),
L = lambda a:(a[0]-3)**2+(a[1]-9)**2,
D = lambda x: np.array([2*(x[0]-3), 2*(x[1]-9)]),
alpha = 1e-2,
epsilon = 1e-7
)

print(ret)

Stochastic Gradient Descent

从上面的过程可以看出,如果要实现一次gradient descent,需要load进去所有的x。如果我们只load部分的dataset,每次换不同的batch去算derivative,这个时候我们叫它out of core learning 在sklearn就是partial_fit

Gradient Boosting

从general的boosting vs bagging说起

Bagging: independent classifier并行运算,一群high variance, low bias的model,用一些avg的技巧得到最后的结果(因为low bias,所以大家犯不同的错误取平均后应该得到一个真实值),比如weighted average, majority vote, normal average 最后的结果是reduce variance,variance变成了原来的1/N。

Boosting: sequential classifiers串行运算,可以理解成 精英教育,不停培养同一个model,让这个model的error最小。每一个model都在学习在此之前每一个model的mistake。Boosting的问题在于我们需要设置一个stop time or iteration time,因为它有overfitting的问题。

Gradient Boosting

我们此前的想法都是given x,predict y。y本身有一个predicted value, 也有一个true value. 现在转变一下想法,我们还有一个叫做residual的参数 rj,ir_{j, i} rj,i​ , residual of fjf_{j}fj​ on data i.

r0,i←yi−f0(xi)r_{0, i} \leftarrow y_{i}-f_{0}\left(x_{i}\right)r0,i​←yi​−f0​(xi​)

已知 xix_{i}xi​ ,预测每个rj,ir_{j, i} rj,i​ , f1:xi→r0,if 1 : x_{i} \rightarrow r_{0, i}f1:xi​→r0,i​ , 用预测结果改进 f0f_{0}f0​.

  1: f0:x→0 2: f←f0 3: i←1 4: while i<T do  5: ei←yi−f(xi) 6:  fit fi using {<xi,ei>} 7: f(x)←f(x)+fi(x) 8:  i ←i+1 9: return f\begin{array}{l}{\text { }} \\ {\text{ 1: } f_{0} : x \rightarrow 0 } \\ {\text { 2: } f \leftarrow f_{0}} \\ {\text { 3: } i \leftarrow 1} \\ {\text { 4: while } i<T \text { do }} \\ {\text { 5: } e_{i} \leftarrow y_{i}-f\left(x_{i}\right)} \\ {\text { 6: } \text { fit } f_{i} \text { using }\left\{<x_{i}, e_{i}>\right\}} \\ {\text { 7: } f(x) \leftarrow f(x)+f_{i}(x)} \\ {\text { 8: } \text { i } \leftarrow i+1} \\ {\text { 9: return } f}\end{array}  1: f0​:x→0 2: f←f0​ 3: i←1 4: while i<T do  5: ei​←yi​−f(xi​) 6:  fit fi​ using {<xi​,ei​>} 7: f(x)←f(x)+fi​(x) 8:  i ←i+1 9: return f​

其实,MSE是 L=∑i(yi−yi^)2L=\sum_{i}\left(y_{i}-\hat{y_{i}}\right)^{2}L=∑i​(yi​−yi​^​)2 , 而 dLdyi^=2(yi−yi^)=2×residual\frac{d L}{d \hat{y_{i}}}=2\left(y_{i}-\hat{y_{i}}\right) = 2\times residual dyi​^​dL​=2(yi​−yi​^​)=2×residual

 1: f←f0 2: i←1 3: while i<T do  4: ei←∇L(f) 5:  fit fi using {<xi,ei>} 6: f(x)←f(x)−αi⋅fi(x) 7: i←i+1 8: return f\begin{array}{l}{\text { 1: } f \leftarrow f_{0}} \\ {\text { 2: } i \leftarrow 1} \\ {\text { 3: while } i<T \text { do }} \\ {\text { 4: } e_{i} \leftarrow \nabla L(f)} \\ {\text { 5: } \text { fit } f_{i} \text { using }\left\{<x_{i}, e_{i}>\right\}} \\ {\text { 6: } f(x) \leftarrow f(x)-\alpha_{i} \cdot f_{i}(x)} \\ {\text { 7: } i \leftarrow i+1} \\ {\text { 8: return } f}\end{array} 1: f←f0​ 2: i←1 3: while i<T do  4: ei​←∇L(f) 5:  fit fi​ using {<xi​,ei​>} 6: f(x)←f(x)−αi​⋅fi​(x) 7: i←i+1 8: return f​

所以 Boosting 有三个要素:

A loss function to be optimized

A weak learner to make predictions. eg. Tree model

An additive model: 将多个弱学习器累加起来组成强学习器,进而使目标损失函数达到极小。

之所以称为 Gradient,是因为在添加新模型时使用了梯度下降算法来最小化的损失

Adaboost

用exponential loss的gradient boosting

  1. 让所有的weight都一样,可以是1/N或者是1

  2. 让dependent variable变成Y={-1,1} 因为在后期公式里是看正负号判断data的positive/negative

  3. 增加错误部分的weights

其它

PreviousUnsupervised LearningNextXG Boost and Light GBD

Last updated 4 years ago

深度学习中的优化算法知乎专栏
Logo