Notes by Louisa
Notes by Louisa
Notes by Louisa
  • Introduction
  • Chapter1 Python Cheatsheet
    • Reference, Deep Copy and Shallow Copy
    • Iterators
    • List Comprehensions
    • Numpy
    • Pandas
    • Data Visualization
    • DateTime
    • Python Good to knows
  • Chapter2 Java Cheatsheet
    • Fundamentals to Java
    • Interface, Abstract Class, Access Modifier, Exceptions
    • Linked List and Java List
    • Java Queue, Stack and Deque
    • Binary Tree
    • Heap in Java
    • Map/Set/Hash
    • OOD
  • Chapter3 Algorithm
    • Fundamental Knowledge
    • Binary Search
    • Basic Sorting
    • Advanced Sorting
    • Linked List
    • Recursion 1
    • HashTable
    • Queue
    • Sliding Window
    • Stack
    • Binary Tree
    • Binary Search Tree
    • Heap
    • String
    • Graph Search DFS1 (Back Tracking)
    • Recursion II and Memoization
    • Dynamic Programming
    • Complete Binary Tree, Segment Tree, Trie Tree
    • Graph Search BFS
    • Graph Search BFS 2
    • Graph Search DFS2
    • Problems from 'JianZhi Offer'
    • Problems Categorized
    • Bit Operations
  • Chapter4 Model
    • Linear Regression
    • Logistic Regression
    • Regularization and Feature Selection
    • Model Evaluation
    • Nonlinear Models
    • PCA
    • Unsupervised Learning
    • Gradient Descent and Gradient Boosting
    • XG Boost and Light GBD
    • Deep Learning
    • Tensorflow/Keras
    • RNN
  • Chapter5 Statistics and A/B Testing
    • Inference about independence
    • Probability, Sampling and Randomization with Python
    • A/B Testing
    • Stats Interview Review
    • Statistics Glossary
  • Chapter6 SQL
    • Student Scores Query
    • Order Query
    • Movie Rating Query
    • Social-Network Query
    • LeetCode SQL题目总结
    • Spark SQL
  • Chapter7 Big Data and Spark
    • Introduction to Pyspark
    • Data Cleaning with Apache Spark
    • Feature Engineering with Pyspark
    • Building Recommendation Engines with Pyspark
    • Building Data Engineering Pipelines in Python
    • Hadoop MapReduce
    • Big Data Related Paper
  • Chapter8 Code Walk-Throughs
    • Python
    • R
    • Shell
  • Chapter9 Special Topics
    • Anomaly Detection
    • E Commerce
    • Supply Chain
    • Social Network Analysis
    • NLP intro
    • Time Series
    • Challenge Prophet with LSTM models
  • Project: The Winning Recipes to an Oscar Award
  • Project: A Crime Analysis of the Last Decade NYC
  • Project: Predict User Type Based on Citibike Data
  • GeoSpark/GeoSparkVis for Geospatial Big Data
  • Single Scattering Albedo
  • Sea Ice Albedo Retrievals
  • Lidar Project
Powered by GitBook
On this page
  • K means
  • sklearn library 中的K means解读 (面试)
  • 面试题目:EM的算法和K Means的区别
  • 面试题目:分布式的K Means
  • Clustering
  1. Chapter4 Model

Unsupervised Learning

Clustering

K means

  1. K means不是一个稳定的算法 只有当中心点的变动距离小于某个数值时认为结束,所以有可能k means得到的结果是不好的。cannot make sure to converge very quickly; some cases, may not converge at all.

  2. 所以除了distance之外,还需要计算所有点距离中心点的距离和,然后再处以cluster size,让现在的这个值越小越好(其实是在惩罚很大的cluster)

number of cluster k的选取: 组间距离大,组内距离小 能得到最小的satisfy value的就是最好的k

缺点是有outlier的时候不太行,因为它会移动中心点,强行形成一个新的cluster;以及慢,因为它的时间复杂度是O(k*n*iteration) 好处是简单,很容易能把数据分成不同簇

sklearn library 中的K means解读 (面试)

class sklearn.cluster.KMeans (n_clusters=8, init='k-means++', n_init=10, max_iter=300, tol=0.0001, precompute_distance='auto', verbose=0, random_state=None, copy_x=True, n_jobs=1)

n_init: 代表做几次k means(可能是一开始的初始点就不同),max_iter=300 表示变几次中心点, tol=0.0001表示中心点移动距离小于万分之一。要特别清楚前两个参数的意义。

init='k-means++':优化普通的k means,避免了有时cluster的结果很糟糕的问题 第一个中心点从已知的数据点中随便取;第二个中心点,从剩下的数据点中选取,12之间的距离的平方就是第二个点被选到的概率;第三个点,每个数据点计算和前两个中心点的距离的平方的最小值。

面试题目:EM的算法和K Means的区别

面试题目:分布式的K Means

把数据放到所有nodes上分别作k means,也就是随机找centroid然后得到新的centroid点;把它们返回到master里再run一次k means做某种平均,再把找到的这批新的K means的reference返回到各个node,以此往复。

Clustering

  • Distance Based: K means

  • Density Based: DB scan

PreviousPCANextGradient Descent and Gradient Boosting

Last updated 5 years ago