Notes by Louisa
Notes by Louisa
Notes by Louisa
  • Introduction
  • Chapter1 Python Cheatsheet
    • Reference, Deep Copy and Shallow Copy
    • Iterators
    • List Comprehensions
    • Numpy
    • Pandas
    • Data Visualization
    • DateTime
    • Python Good to knows
  • Chapter2 Java Cheatsheet
    • Fundamentals to Java
    • Interface, Abstract Class, Access Modifier, Exceptions
    • Linked List and Java List
    • Java Queue, Stack and Deque
    • Binary Tree
    • Heap in Java
    • Map/Set/Hash
    • OOD
  • Chapter3 Algorithm
    • Fundamental Knowledge
    • Binary Search
    • Basic Sorting
    • Advanced Sorting
    • Linked List
    • Recursion 1
    • HashTable
    • Queue
    • Sliding Window
    • Stack
    • Binary Tree
    • Binary Search Tree
    • Heap
    • String
    • Graph Search DFS1 (Back Tracking)
    • Recursion II and Memoization
    • Dynamic Programming
    • Complete Binary Tree, Segment Tree, Trie Tree
    • Graph Search BFS
    • Graph Search BFS 2
    • Graph Search DFS2
    • Problems from 'JianZhi Offer'
    • Problems Categorized
    • Bit Operations
  • Chapter4 Model
    • Linear Regression
    • Logistic Regression
    • Regularization and Feature Selection
    • Model Evaluation
    • Nonlinear Models
    • PCA
    • Unsupervised Learning
    • Gradient Descent and Gradient Boosting
    • XG Boost and Light GBD
    • Deep Learning
    • Tensorflow/Keras
    • RNN
  • Chapter5 Statistics and A/B Testing
    • Inference about independence
    • Probability, Sampling and Randomization with Python
    • A/B Testing
    • Stats Interview Review
    • Statistics Glossary
  • Chapter6 SQL
    • Student Scores Query
    • Order Query
    • Movie Rating Query
    • Social-Network Query
    • LeetCode SQL题目总结
    • Spark SQL
  • Chapter7 Big Data and Spark
    • Introduction to Pyspark
    • Data Cleaning with Apache Spark
    • Feature Engineering with Pyspark
    • Building Recommendation Engines with Pyspark
    • Building Data Engineering Pipelines in Python
    • Hadoop MapReduce
    • Big Data Related Paper
  • Chapter8 Code Walk-Throughs
    • Python
    • R
    • Shell
  • Chapter9 Special Topics
    • Anomaly Detection
    • E Commerce
    • Supply Chain
    • Social Network Analysis
    • NLP intro
    • Time Series
    • Challenge Prophet with LSTM models
  • Project: The Winning Recipes to an Oscar Award
  • Project: A Crime Analysis of the Last Decade NYC
  • Project: Predict User Type Based on Citibike Data
  • GeoSpark/GeoSparkVis for Geospatial Big Data
  • Single Scattering Albedo
  • Sea Ice Albedo Retrievals
  • Lidar Project
Powered by GitBook
On this page
  • Logistic Regression 直观引入:图形的角度理解
  • Logistic Regression 数学引入:值域的角度理解
  • 从Bernoulli Distribution到Logistic Regression
  • 翻硬币
  • 条件翻硬币
  • (面试)Logistic Regression中的Loss Function的推导
  • Logistic Regression 也可以有不止一个x, 也可以是多次方
  • 两个以上的Label的Logistic Regression
  • 区分Ordinary Linear和Logistic的Assumption
  • 经验之谈
  1. Chapter4 Model

Logistic Regression

直观引入、数学引入、与Linear Regression的区别

PreviousLinear RegressionNextRegularization and Feature Selection

Last updated 5 years ago

Logistic Regression 直观引入:图形的角度理解

  1. 直观理解,Linear Regression和Logistic Regression的区别,或者说,我们为什么要使用logistic?

如果我们在做一个基于肿瘤大小预测恶性肿瘤的预测,在数据组A中使用了我们唯一会的模型linear regression,拟合了这样一条曲线,似乎还不错。此时来了一个新的数据,我们依然用这个linear model,为了让model更make sense,这条线需要斜率变小。 但如果计算它此时的loss function(比如就用least square),就会很糟糕... 因为linear regression对于extreme observations比较敏感(不管它是不是outlier)。

2. 从1的例子中知道,linear regression把每一个数据点以相同的权重来对待,这并不是我们想要的。“区别对待“指的是中间的分界线变化剧烈,两边的斜率变化不太大。如此,不同的x对于P的contribution也就不同了。

在上面的例子中,y轴是一个P,概率值,所以x与y的关系转化成了x与P的关系,P与y之间的联系又是一个Bernouli:P(y=1)=p, P(y=0)=1-p. 那么我们就得到了

p=g(z)=ezez+1=11+e−zp=g(z)=\frac{e^{z}}{e^{z}+1}=\frac{1}{1+e^{-z}}p=g(z)=ez+1ez​=1+e−z1​

Logistic Regression 数学引入:值域的角度理解

想建立x和p的关系,从值域的角度,因为概率的值域是01之间,我们的y取值却是正负无穷。odds也有一个概率的含义在其中,所以可以取一个odds,凑半个无穷,再对odds取个log,凑出正负无穷

p→[0,1],1−p→[0,1], odds =p/(1−p)→[0, + infinite ]log⁡(odds)→[ -infinite, + infinite ]\begin{array}{l}{p \rightarrow[0,1], } {1-p \rightarrow[0,1],} \\ {\text { odds }=p /(1-p) \rightarrow[0, \text { + infinite }]} \\ {\log (\text {odds}) \rightarrow[\text { -infinite, }+\text { infinite }]}\end{array}p→[0,1],1−p→[0,1], odds =p/(1−p)→[0, + infinite ]log(odds)→[ -infinite, + infinite ]​

细节,可能性=feature的线性组合

log⁡p1−p=z=ax+b⇔p=11+e−αx−blog⁡p1−p=ax+blog⁡1−pp=−(ax+b)1−pp=e−(ax+b)1p=1+e−(ax+b)p=11+e−(ax+b)\begin{array}{c}{\log \frac{p}{1-p}=z=a x+b \Leftrightarrow p=\frac{1}{1+e^{-\alpha x-b}}} \\ {\log \frac{p}{1-p}=a x+b} \\ {\log \frac{1-p}{p}=-(a x+b)} \\ {\frac{1-p}{p}=e^{-(a x+b)}} \\ {\frac{1}{p}=1+e^{-(a x+b)}} \\ {p=\frac{1}{1+e^{-(a x+b)}}}\end{array}log1−pp​=z=ax+b⇔p=1+e−αx−b1​log1−pp​=ax+blogp1−p​=−(ax+b)p1−p​=e−(ax+b)p1​=1+e−(ax+b)p=1+e−(ax+b)1​​

从现象看是否能解释,从后往前推,这并不是在推导Logistic Regression。

总之, p=F(x)=11+e−(β0+β1x)p=F(x)=\frac{1}{1+e^{-\left(\beta_{0}+\beta_{1} x\right)}}p=F(x)=1+e−(β0​+β1​x)1​

(Good to know) Logistic Regression算线性模型(linear regression model)吗? 属于Generalized Linear Model.

Ordinary Linear: 只要服从 E(Y∣X)=XβE(Y | X)=X \betaE(Y∣X)=Xβ 就是linear model Generalized Linear: 如果数据经历一个transformation g可以得到E(Y∣X)=g−1(Xβ)E(Y | X)=g^{-1}(X \beta)E(Y∣X)=g−1(Xβ) ,其中g就叫link function. 这么一来,理论上任何一个model都可以是linear mode;但是因为link function在数学上很难找,所以现在发现的也没几个。就像是SVM,理论上都可以把数据放在更高维,但实际...

从Bernoulli Distribution到Logistic Regression

翻硬币

从data,model,induction principle的角度考虑一个“翻硬币”的问题,

 Data: <x1,x2,⋯ ,xn>,xi∈{0,1} Prior Knowledge: P(xi=1)=θ,P(xi=0)=1−θ Induction Principle: Maximum Likelihood θ^=arg⁡max⁡θ∏i=1nP(x=xi)\begin{array}{l}{\text { Data: }<x_{1}, x_{2}, \cdots, x_{n}>, x_{i} \in\{0,1\}} \\ {\text { Prior Knowledge: } P\left(x_{i}=1\right)=\theta, P\left(x_{i}=0\right)=1-\theta} \\ {\text { Induction Principle: Maximum Likelihood } \hat{\theta}=\underset{\theta}{\arg \max } \prod_{i=1}^{n} P\left(x=x_{i}\right)}\end{array} Data: <x1​,x2​,⋯,xn​>,xi​∈{0,1} Prior Knowledge: P(xi​=1)=θ,P(xi​=0)=1−θ Induction Principle: Maximum Likelihood θ^=θargmax​∏i=1n​P(x=xi​)​

直观理解,θ^\hat{\theta}θ^ 的结果应该是 avg(x)=∑ixinavg(x)=\frac{\sum_{i} x_{i}}{n}avg(x)=n∑i​xi​​ 。来推导一下对不对:

P(x=xi)={θ,xi=11−θ,xi=0=θxi⋅(1−θ)(1−xi)P\left(x=x_{i}\right)=\left\{\begin{array}{c}{\theta, x_{i}=1} \\ {1-\theta, x_{i}=0}\end{array}=\theta^{x_{i}} \cdot(1-\theta)^{\left(1-x_{i}\right)}\right.P(x=xi​)={θ,xi​=11−θ,xi​=0​=θxi​⋅(1−θ)(1−xi​)
θ^=arg⁡max⁡θ∏i=1nP(x=xi)=arg⁡max⁡θlog⁡(∏i=1nP(x=xi))=arg⁡max⁡θ∑i=1nlog⁡(P(x=xi))=arg⁡max⁡θ∑i=1nlog⁡(θxi⋅(1−θ)1−xi))=arg⁡max⁡θ∑i=1nxilog⁡θ+(1−xi)log⁡(1−θ)\begin{aligned} \hat{\theta} &=\underset{\theta}{\arg \max } \prod_{i=1}^{n} P\left(x=x_{i}\right) \\ &=\arg \max _{\theta} \log \left(\prod_{i=1}^{n} P\left(x=x_{i}\right)\right) \\ &=\underset{\theta}{\arg \max } \sum_{i=1}^{n} \log \left(P\left(x=x_{i}\right)\right) \\ &\left.=\underset{\theta}{\arg \max } \sum_{i=1}^{n} \log \left(\theta^{x_{i}} \cdot(1-\theta)^{1-x_{i}}\right)\right) \\ &=\underset{\theta}{\arg \max } \sum_{i=1}^{n} x_{i} \log \theta+\left(1-x_{i}\right) \log (1-\theta) \end{aligned}θ^​=θargmax​i=1∏n​P(x=xi​)=argθmax​log(i=1∏n​P(x=xi​))=θargmax​i=1∑n​log(P(x=xi​))=θargmax​i=1∑n​log(θxi​⋅(1−θ)1−xi​))=θargmax​i=1∑n​xi​logθ+(1−xi​)log(1−θ)​
dLdθ=∑i(xiθ−1−xi1−θ)=∑ixi(1−θ)−θ(1−xi)θ(1−θ)=∑ixi−xiθ−θ+xiθθ(1−θ)=∑i(xi−θ)θ(1−θ)=∑ixi−n⋅θθ(1−θ)\begin{aligned} \frac{d L}{d \theta} &=\sum_{i}\left(\frac{x_{i}}{\theta}-\frac{1-x_{i}}{1-\theta}\right) \\ &=\sum_{i} \frac{x_{i}(1-\theta)-\theta\left(1-x_{i}\right)}{\theta(1-\theta)} \\ &=\frac{\sum_{i} x_{i}-x_{i} \theta-\theta+x_{i} \theta}{\theta(1-\theta)} \\ &=\frac{\sum_{i} (x_{i}-\theta)}{\theta(1-\theta)} \\ &=\frac{\sum_{i} x_{i}-n \cdot \theta}{\theta(1-\theta)} \end{aligned}dθdL​​=i∑​(θxi​​−1−θ1−xi​​)=i∑​θ(1−θ)xi​(1−θ)−θ(1−xi​)​=θ(1−θ)∑i​xi​−xi​θ−θ+xi​θ​=θ(1−θ)∑i​(xi​−θ)​=θ(1−θ)∑i​xi​−n⋅θ​​

令上面的导数等于零,所以 ∑ixi=n⋅θ\sum_{i} x_{i}=n \cdot \theta∑i​xi​=n⋅θ ,得到 θ=∑ixin\theta=\frac{\sum_{i} x_{i}}{n}θ=n∑i​xi​​ 。

条件翻硬币

Data: ⟨(x1,y1),(x2,y2),…,(xn,yn)⟩y∈{0,1}xi∈RmModel: P(Y=y∣x;α,β)=11+e−(α+∑j=1mβjxj)α∈R,β∈Rm\begin{aligned} \text {Data: } \\& { \left\langle\left(x_{1}, y_{1}\right),\left(x_{2}, y_{2}\right), \ldots,\left(x_{n}, y_{n}\right)\right\rangle} \\& { y \in\{0,1\}} \\& {x_{i} \in \mathbb{R}^{m}} \end{aligned} \begin{aligned} \text{Model: } \\& {P(Y=y | x ; \alpha, \beta)=\frac{1}{1+e^{-\left(\alpha+\sum_{j=1}^{m} \beta_{j} x_{j}\right)}}} \\& {\alpha \in \mathbb{R}, \beta \in \mathbb{R}^{m}} \end{aligned}Data: ​⟨(x1​,y1​),(x2​,y2​),…,(xn​,yn​)⟩y∈{0,1}xi​∈Rm​Model: ​P(Y=y∣x;α,β)=1+e−(α+∑j=1m​βj​xj​)1​α∈R,β∈Rm​
Induction Principle: <α^,β^>=arg⁡max⁡α,β∏iP(yi∣xi;α,β)=arg⁡max⁡α,βlog⁡(∏iP(yi∣xi;α,β))=arg⁡max⁡α,β∑ilog⁡(P(yi∣xi;α,β))Li=∑ilog⁡(P(yi∣xi;α,β))\begin{aligned} \text {Induction Principle: } \\& <\hat{\alpha}, \hat{\beta}>=\underset{\alpha, \beta}{\arg \max } \prod_{i} P\left(y_{i} | x_{i} ; \alpha, \beta\right) \\ &=\underset{\alpha, \beta}{\arg \max } \log \left(\prod_{i} P\left(y_{i} | x_{i} ; \alpha, \beta\right)\right) \\ &=\underset{\alpha, \beta}{\arg \max } \sum_{i} \log \left(P\left(y_{i} | x_{i} ; \alpha, \beta\right)\right) \end{aligned} \\ L_{i}=\sum_{i} \log \left(P\left(y_{i} | x_{i} ; \alpha, \beta\right)\right)Induction Principle: ​<α^,β^​>=α,βargmax​i∏​P(yi​∣xi​;α,β)=α,βargmax​log(i∏​P(yi​∣xi​;α,β))=α,βargmax​i∑​log(P(yi​∣xi​;α,β))​Li​=i∑​log(P(yi​∣xi​;α,β))

所以现在的问题变成了 =log⁡({Pi,yi=11−Pi,yi=0)=log⁡(Piyi⋅(1−Pi)1−yi)\begin{array}{l}{=\log \left(\left\{\begin{array}{c}{P_{i}, y_{i}=1} \\ {1-P_{i}, y_{i}=0}\end{array}\right)\right.} \\ {=\log \left(P_{i}^{y_{i}} \cdot\left(1-P_{i}\right)^{1-y_{i}}\right)}\end{array}=log({Pi​,yi​=11−Pi​,yi​=0​)=log(Piyi​​⋅(1−Pi​)1−yi​)​ 。此时,把P和Z引入, Pi=11+e−zizi=α+∑kβkxik\begin{aligned} P_{i} &=\frac{1}{1+e^{-z_{i}}} \\ z_{i} &=\alpha+\sum_{k} \beta_{k} x_{i k} \end{aligned}Pi​zi​​=1+e−zi​1​=α+k∑​βk​xik​​

利用chain rule: δLδβj=∑i=1nδLiδPi⋅δPiδzi⋅δziδβj\frac{\delta L}{\delta \beta_{j}}=\sum_{i=1}^{n} \frac{\delta L_{i}}{\delta P_{i}} \cdot \frac{\delta P_{i}}{\delta z_{i}} \cdot \frac{\delta z_{i}}{\delta \beta_{j}}δβj​δL​=∑i=1n​δPi​δLi​​⋅δzi​δPi​​⋅δβj​δzi​​ ,每一项拿出来:

δLiδPi=yiPi−1−yi1−Pi=yi−yiPi−Pi+yiPiPi(1−Pi)=yi−PiPi(1−Pi)Pi=11+e−ziδPiδzi=−1(1+e−zi)2⋅(−1)⋅e−zi=11+e−zi⋅e−zi1+e−zi=Pi⋅(1−Pi)zi=α+∑k=1mβkxikδziδβj=xij\begin{aligned} \frac{\delta L_{i}}{\delta P_{i}} &=\frac{y_{i}}{P_{i}}-\frac{1-y_{i}}{1-P_{i}} \\ &=\frac{y_{i}-y_{i} P_{i}-P_{i}+y_{i} P_{i}}{P_{i}\left(1-P_{i}\right)} \\ &=\frac{y_{i}-P_{i}}{P_{i}\left(1-P_{i}\right)} \end{aligned} \begin{aligned} P_{i} &=\frac{1}{1+e^{-z_{i}}} \\ \frac{\delta P_{i}}{\delta z_{i}} &=-\frac{1}{\left(1+e^{-z_{i}}\right)^{2}} \cdot(-1) \cdot e^{-z_{i}} \\ &=\frac{1}{1+e^{-z_{i}}} \cdot \frac{e^{-z_{i}}}{1+e^{-z_{i}}} \\ &=P_{i} \cdot\left(1-P_{i}\right) \end{aligned} \begin{aligned} z_{i} &=\alpha+\sum_{k=1}^{m} \beta_{k} x_{i k} \\ \frac{\delta z_{i}}{\delta \beta_{j}} &=x_{i j} \end{aligned}δPi​δLi​​​=Pi​yi​​−1−Pi​1−yi​​=Pi​(1−Pi​)yi​−yi​Pi​−Pi​+yi​Pi​​=Pi​(1−Pi​)yi​−Pi​​​Pi​δzi​δPi​​​=1+e−zi​1​=−(1+e−zi​)21​⋅(−1)⋅e−zi​=1+e−zi​1​⋅1+e−zi​e−zi​​=Pi​⋅(1−Pi​)​zi​δβj​δzi​​​=α+k=1∑m​βk​xik​=xij​​

最后,

δLδβj=∑i=1nδLiδPi⋅δPiδzi⋅δziδβj=∑i=1nyi−PiPi(1−Pi)⋅Pi(1−Pi)⋅xij=∑i=1n(yi−Pi)⋅xijδLδβj=∑i=1n(yi−Pi)⋅xijδLδα=∑i=1n(yi−Pi)\begin{aligned} \frac{\delta L}{\delta \beta_{j}} &=\sum_{i=1}^{n} \frac{\delta L_{i}}{\delta P_{i}} \cdot \frac{\delta P_{i}}{\delta z_{i}} \cdot \frac{\delta z_{i}}{\delta \beta_{j}} \\ &=\sum_{i=1}^{n} \frac{y_{i}-P_{i}}{P_{i}\left(1-P_{i}\right)} \cdot P_{i}\left(1-P_{i}\right) \cdot x_{i j} \\ &=\sum_{i=1}^{n}\left(y_{i}-P_{i}\right) \cdot x_{i j} \end{aligned} \begin{aligned} \frac{\delta L}{\delta \beta_{j}} &=\sum_{i=1}^{n}\left(y_{i}-P_{i}\right) \cdot x_{i j} \\ \frac{\delta L}{\delta \alpha} &=\sum_{i=1}^{n}\left(y_{i}-P_{i}\right) \end{aligned}δβj​δL​​=i=1∑n​δPi​δLi​​⋅δzi​δPi​​⋅δβj​δzi​​=i=1∑n​Pi​(1−Pi​)yi​−Pi​​⋅Pi​(1−Pi​)⋅xij​=i=1∑n​(yi​−Pi​)⋅xij​​δβj​δL​δαδL​​=i=1∑n​(yi​−Pi​)⋅xij​=i=1∑n​(yi​−Pi​)​

从 δLδα=0\frac{\delta L}{\delta \alpha}=0δαδL​=0 可以推出 ∑i=1nyi=∑i=1nPi\sum_{i=1}^{n} y_{i}=\sum_{i=1}^{n} P_{i}∑i=1n​yi​=∑i=1n​Pi​ 。 所以说,logistic regression is well calibrated.

(面试)Logistic Regression中的Loss Function的推导

虽然也有使用信息熵Entropy的方式进行的定义,但是更直观的还是使用maximum likelihood。

第一步:伯努利分布的P

P(Y=yi∣xj)=pyi(1−p)1−yi,0<p<1,yi={0,1}p=hβ(xj)=11+e−(β0+β1x)\begin{array}{c}{P\left(Y=y_{i} | x_{j}\right)=p^{y_{i}}(1-p)^{1-y_{i}}, 0<p<1, y_{i}=\{0,1\}} \\ {p=h_{\beta}\left(x_{j}\right)=\frac{1}{1+e^{-\left(\beta_{0}+\beta_{1} x\right)}}}\end{array}P(Y=yi​∣xj​)=pyi​(1−p)1−yi​,0<p<1,yi​={0,1}p=hβ​(xj​)=1+e−(β0​+β1​x)1​​
 i.e. P(Y=yi∣xj)=hβ(xj)iyi(1−hβ(xj))1−yi\text { i.e. } P\left(Y=y_{i} | x_{j}\right)=h_{\beta}\left(x_{j}\right)_{i}^{y_{i}}\left(1-h_{\beta}\left(x_{j}\right)\right)^{1-y_{i}} i.e. P(Y=yi​∣xj​)=hβ​(xj​)iyi​​(1−hβ​(xj​))1−yi​

第二步: 用MLE算Loss Function

L(β^0,β^1)=P(Y1,Y2,…,Yn∣X)=P(Y1∣X1)∗P(Y2∣X2)∗…∗P(Yn∣Xn)L(β^0,β^1)=∏i=1nhβ(xj)yi(1−hβ(xj))1−yi\begin{array}{c}{L\left(\hat{\beta}_{0}, \hat{\beta}_{1}\right)=P\left(Y_{1}, Y_{2}, \ldots, Y_{n} | X\right)=P\left(Y_{1} | X_{1}\right) * P\left(Y_{2} | X_{2}\right) * \ldots * P\left(Y_{n} | X_{n}\right)} \\ {L\left(\hat{\beta}_{0}, \hat{\beta}_{1}\right)=\prod_{i=1}^{n} h_{\beta}\left(x_{j}\right)^{y_{i}}\left(1-h_{\beta}\left(x_{j}\right)\right)^{1-y_{i}}}\end{array}L(β^​0​,β^​1​)=P(Y1​,Y2​,…,Yn​∣X)=P(Y1​∣X1​)∗P(Y2​∣X2​)∗…∗P(Yn​∣Xn​)L(β^​0​,β^​1​)=∏i=1n​hβ​(xj​)yi​(1−hβ​(xj​))1−yi​​
log⁡(L(β^0,β^1))=∑i=1n[yilog⁡(hβ(xi))+(1−yi)log⁡(1−hβ(xi))]\log \left(L\left(\hat{\beta}_{0}, \hat{\beta}_{1}\right)\right)=\sum_{i=1}^{n}\left[y_{i} \log \left(h_{\beta}\left(x_{i}\right)\right)+\left(1-y_{i}\right) \log \left(1-h_{\beta}\left(x_{i}\right)\right)\right]log(L(β^​0​,β^​1​))=i=1∑n​[yi​log(hβ​(xi​))+(1−yi​)log(1−hβ​(xi​))]
argmax⁡β∑i=1n[yilog⁡(hβ(xi))+(1−yi)log⁡(1−hβ(xi))] argmin⁡β∑i=1n[−yilog⁡(hβ(xi))−(1−yi)log⁡(1−hβ(xi))]\begin{array}{l}{\operatorname{argmax}_{\beta} \sum_{i=1}^{n}\left[y_{i} \log \left(h_{\beta}\left(x_{i}\right)\right)+\left(1-y_{i}\right) \log \left(1-h_{\beta}\left(x_{i}\right)\right)\right]} \\ \ {\operatorname{argmin}_{\beta} \sum_{i=1}^{n}\left[-y_{i} \log \left(h_{\beta}\left(x_{i}\right)\right)-\left(1-y_{i}\right) \log \left(1-h_{\beta}\left(x_{i}\right)\right)\right]}\end{array}argmaxβ​∑i=1n​[yi​log(hβ​(xi​))+(1−yi​)log(1−hβ​(xi​))] argminβ​∑i=1n​[−yi​log(hβ​(xi​))−(1−yi​)log(1−hβ​(xi​))]​

Logistic Regression 也可以有不止一个x, 也可以是多次方

F(x)=g(β0+β1x1+β2x2+β3x12+β4x22)F(x)=g\left(\beta_{0}+\beta_{1} x_{1}+\beta_{2} x_{2}+\beta_{3} x_{1}^{2}+\beta_{4} x_{2}^{2}\right)F(x)=g(β0​+β1​x1​+β2​x2​+β3​x12​+β4​x22​)

每次对不同的beta求偏导数就行

Fun Fact: 这上面的蓝线都是怎么画的?

其实,它就是 P(Y=y∣X)=1/(1+exp(∗∗∗))=1/2P(Y=y | X) = 1/(1+exp(***))=1/2P(Y=y∣X)=1/(1+exp(∗∗∗))=1/2

两个以上的Label的Logistic Regression

  1. One Vs. All

  2. Softmax

(其实这两个差不多)

区分Ordinary Linear和Logistic的Assumption

Linear Regression: p(y∣x)p(y | x)p(y∣x) 是  mean =μ=β0+β1x\text { mean }=\mu=\beta_{0}+\beta_{1}x mean =μ=β0​+β1​x ,  variance =σ\text { variance }=\sigma variance =σ (与x无关的值)的高斯分布

Logistic Regression: y 服从 P(Y=y∣X=x)=11+e−(ax+b)P(Y=y|X=x)=\frac{1}{1+e^{-(a x+b)}}P(Y=y∣X=x)=1+e−(ax+b)1​ 的伯努利分布

面试问题:logistic默认是0.5为分类决策值吗?

默认是的,但其实可以调,根据具体的use case调整。 面试问题实际是“如何提高logistic regressor“的precision? 答案是改变分类决策值。

面试问题:logistic regression和linear regression的区别是什么?

一步步回答

(1)处理的问题不同,logistic的Y是离散的

(2)所以这就涉及到:x连续,y离散,那么我们就认为它服从伯努利分布,有一个变量p

(3)变量p和x都是连续的,于是就有了对应关系

(4)区别在于 linear是线性关系,而logistic是指数函数关系,对应的是logic function

如果追问:指数关系是什么关系,为什么指数?

本质是因为x与p有非线性关系,所以我们需要惩罚离分类决策面很远的x对分类决策的影响;换句话说,不同的x对于loss的contribution需要区别对待。

经验之谈

(1) Logistic Regression 适用于很多feature的dataset, feature少的更适合random forest。因为有效的信息在sparse tree里不怎么会用得到。

(2) Logistic Regression can handle cases where features have strong correlation.

(4) The IID assumption of Logistic Regression is weaker than Naive Bayes. For Naive Bayes, we need strong IID. But for Logistic Regression, all we need is 'given x, y is iid'.

(3) Logistic Regression is resilient to unbalanced labels (1:100 is fine, but beyond that, need down sampling).

以上的所有log都是自然对数,而不是10为底的对数,因为这种构造会让模型的使用比较简单。 Exponential Family是GLM的核心,是在给定限定条件下,作出最少假设的分布家族。 [参考]()

https://www.cs.ubc.ca/~murphyk/MLbook/