Notes by Louisa
Notes by Louisa
Notes by Louisa
  • Introduction
  • Chapter1 Python Cheatsheet
    • Reference, Deep Copy and Shallow Copy
    • Iterators
    • List Comprehensions
    • Numpy
    • Pandas
    • Data Visualization
    • DateTime
    • Python Good to knows
  • Chapter2 Java Cheatsheet
    • Fundamentals to Java
    • Interface, Abstract Class, Access Modifier, Exceptions
    • Linked List and Java List
    • Java Queue, Stack and Deque
    • Binary Tree
    • Heap in Java
    • Map/Set/Hash
    • OOD
  • Chapter3 Algorithm
    • Fundamental Knowledge
    • Binary Search
    • Basic Sorting
    • Advanced Sorting
    • Linked List
    • Recursion 1
    • HashTable
    • Queue
    • Sliding Window
    • Stack
    • Binary Tree
    • Binary Search Tree
    • Heap
    • String
    • Graph Search DFS1 (Back Tracking)
    • Recursion II and Memoization
    • Dynamic Programming
    • Complete Binary Tree, Segment Tree, Trie Tree
    • Graph Search BFS
    • Graph Search BFS 2
    • Graph Search DFS2
    • Problems from 'JianZhi Offer'
    • Problems Categorized
    • Bit Operations
  • Chapter4 Model
    • Linear Regression
    • Logistic Regression
    • Regularization and Feature Selection
    • Model Evaluation
    • Nonlinear Models
    • PCA
    • Unsupervised Learning
    • Gradient Descent and Gradient Boosting
    • XG Boost and Light GBD
    • Deep Learning
    • Tensorflow/Keras
    • RNN
  • Chapter5 Statistics and A/B Testing
    • Inference about independence
    • Probability, Sampling and Randomization with Python
    • A/B Testing
    • Stats Interview Review
    • Statistics Glossary
  • Chapter6 SQL
    • Student Scores Query
    • Order Query
    • Movie Rating Query
    • Social-Network Query
    • LeetCode SQL题目总结
    • Spark SQL
  • Chapter7 Big Data and Spark
    • Introduction to Pyspark
    • Data Cleaning with Apache Spark
    • Feature Engineering with Pyspark
    • Building Recommendation Engines with Pyspark
    • Building Data Engineering Pipelines in Python
    • Hadoop MapReduce
    • Big Data Related Paper
  • Chapter8 Code Walk-Throughs
    • Python
    • R
    • Shell
  • Chapter9 Special Topics
    • Anomaly Detection
    • E Commerce
    • Supply Chain
    • Social Network Analysis
    • NLP intro
    • Time Series
    • Challenge Prophet with LSTM models
  • Project: The Winning Recipes to an Oscar Award
  • Project: A Crime Analysis of the Last Decade NYC
  • Project: Predict User Type Based on Citibike Data
  • GeoSpark/GeoSparkVis for Geospatial Big Data
  • Single Scattering Albedo
  • Sea Ice Albedo Retrievals
  • Lidar Project
Powered by GitBook
On this page
  • Recap supervised learning
  • XGBoost
  • Ref Links
  1. Chapter4 Model

XG Boost and Light GBD

对decision tree做boosting

PreviousGradient Descent and Gradient BoostingNextDeep Learning

Last updated 5 years ago

Recap supervised learning

Kernels: kernel trick, 在svm、knn、linear regression中都可以用到。空间映射。

混合feature:NN、SVM、KNN对于“数值和categorical mixed”的情况不太行,需要转化,但是Decision Tree不会因此受到影响。

缺失值:NN和SVM受missing value影响比较敏感,尤其SVM,如果missing是在kernel附近的。但如果k选的足够好,就对missing value不敏感。

计算复杂度:NN、SVM、KNN的计算复杂度都很大,而tree都是和数据点的个数有关的。

线性关系:NN和SVM都能捕捉到feature中的线性关系,因为本质是做线性变换。而tree考察的是不同属性,不知道不同属性之间的联系。

可解释性:tree model还有一个feature importance,但是其他model没有什么可解释性。

从表的比较来看,tree除了某些点之外,其他都还不错~

XGBoost

eXtreme Gradient Boosting 其实也是对上一节的Gradient Boosting的一种实现。但是因为Gradient Boosting每一步对所有数据求梯度、每次构建一个新的tree加入原模型序列,比较慢;XG Boost因为这几个原因相比之下会更快:

Parallelization: 训练时可以用所有的 CPU 内核来并行化建树。

Distributed Computing : 用分布式计算来训练非常大的模型。

Out-of-Core Computing: 对于非常大的数据集还可以进行 Out-of-Core Computing。

Cache Optimization of data structures and algorithms: 更好地利用硬件。

Ref Links

and

Chen Tianqi的原paper
YouTube 2016.6.2 Talk
Slides
Code
详解&参数
调参
调参
调参
Guolin Ke vs Tianqi Chen 讨论
源码阅读
XGBoost Python Feature Walkthrough
LightGBM源码解读和文字介绍
http://wepon.me/files/gbdt.pdf