Notes by Louisa
Notes by Louisa
Notes by Louisa
  • Introduction
  • Chapter1 Python Cheatsheet
    • Reference, Deep Copy and Shallow Copy
    • Iterators
    • List Comprehensions
    • Numpy
    • Pandas
    • Data Visualization
    • DateTime
    • Python Good to knows
  • Chapter2 Java Cheatsheet
    • Fundamentals to Java
    • Interface, Abstract Class, Access Modifier, Exceptions
    • Linked List and Java List
    • Java Queue, Stack and Deque
    • Binary Tree
    • Heap in Java
    • Map/Set/Hash
    • OOD
  • Chapter3 Algorithm
    • Fundamental Knowledge
    • Binary Search
    • Basic Sorting
    • Advanced Sorting
    • Linked List
    • Recursion 1
    • HashTable
    • Queue
    • Sliding Window
    • Stack
    • Binary Tree
    • Binary Search Tree
    • Heap
    • String
    • Graph Search DFS1 (Back Tracking)
    • Recursion II and Memoization
    • Dynamic Programming
    • Complete Binary Tree, Segment Tree, Trie Tree
    • Graph Search BFS
    • Graph Search BFS 2
    • Graph Search DFS2
    • Problems from 'JianZhi Offer'
    • Problems Categorized
    • Bit Operations
  • Chapter4 Model
    • Linear Regression
    • Logistic Regression
    • Regularization and Feature Selection
    • Model Evaluation
    • Nonlinear Models
    • PCA
    • Unsupervised Learning
    • Gradient Descent and Gradient Boosting
    • XG Boost and Light GBD
    • Deep Learning
    • Tensorflow/Keras
    • RNN
  • Chapter5 Statistics and A/B Testing
    • Inference about independence
    • Probability, Sampling and Randomization with Python
    • A/B Testing
    • Stats Interview Review
    • Statistics Glossary
  • Chapter6 SQL
    • Student Scores Query
    • Order Query
    • Movie Rating Query
    • Social-Network Query
    • LeetCode SQL题目总结
    • Spark SQL
  • Chapter7 Big Data and Spark
    • Introduction to Pyspark
    • Data Cleaning with Apache Spark
    • Feature Engineering with Pyspark
    • Building Recommendation Engines with Pyspark
    • Building Data Engineering Pipelines in Python
    • Hadoop MapReduce
    • Big Data Related Paper
  • Chapter8 Code Walk-Throughs
    • Python
    • R
    • Shell
  • Chapter9 Special Topics
    • Anomaly Detection
    • E Commerce
    • Supply Chain
    • Social Network Analysis
    • NLP intro
    • Time Series
    • Challenge Prophet with LSTM models
  • Project: The Winning Recipes to an Oscar Award
  • Project: A Crime Analysis of the Last Decade NYC
  • Project: Predict User Type Based on Citibike Data
  • GeoSpark/GeoSparkVis for Geospatial Big Data
  • Single Scattering Albedo
  • Sea Ice Albedo Retrievals
  • Lidar Project
Powered by GitBook
On this page
  • 基本解题思路
  • 面试
  • Cheatsheet

Chapter6 SQL

Sequential Query Language

基本解题思路

  • Metrics, what does the result look like

  • 需要什么数据

  • 怎么获取数据

  • 有了数据怎么计算(avg,count)

面试

  1. 电话面试:“2张表格” 借助草稿纸,口头说

  2. share doc,copy paste问题,规定时间内写SELECT FROM 实时看写语句的顺序、停顿的位置...

  3. On-Site: 白板

强推postgre SQL (相比较Oracle SQL/MySQL... )

非关系型数据库 HIVE Query也和SQL差不多

Cheatsheet

执行顺序 ROM-->WHERE-->GROUP BY-->HAVING-->SELECT-->ORDER BY

  1. Group By 1 可以当做一种简写,指的group by select语句中的第一列 同理,ORDER BY 2 也是一种简写

  2. HAVING是在select之后才操作的,所以Having里有Select里用到的aggregation function

  3. UNION,连接的是两个SELECT语句,需要SELECT的column名字一样,数量也一样,默认的是unique values,所以假如有两个row一模一样,它会合并掉;如果想保留,用UNION ALL

SELECT FROM 
    CASE WHEN THEN
        WHEN THEN
        ELSE END
        AS 
FROM AS
INNER JOIN AS
ON
WHERE IN       / WHERE BETWEEN AND
GROUP BY
ORDER BY

UNION, INTERSECT, EXCEPT

SELECT AS
FROM
UNION
SELECT
FROM
ORDER BY

subqueries:

SELECT
FROM
WHERE LIKE
    (SELECT
    FROM
    WHERE )
SELECT
FROM
WHERE 
    AND (SELECT AVG()
    FROM )
SELECT DISTINCT
    (SELECT COUNT (*)
    FROM
    WHERE) AS 
FROM

Window Function, row-number() over (partition by X order by Y) 这里容易犯的错误是partition by后面写很多column

SELECT category, product, sales
FROM (
    SELECT category, product, 
    row_number() OVER (PARTITION BY category, order by sales DESC) as row_num
    )
WHERE row_num<=2

Window Function, rank() over (partition by X oder by Y) row_number, rank, dense_rank的意义都是排序,但是区别是row_number给出单纯的行号,两种rank都给出排序后的号,两个相同数值给一样的rank;区别是rank的5、6名是同一个数字,则第七名标7,可是dense rank的第七名会标6,顺着之前的数字来。

Window Function, sum() over () 得到的是cumulative sum

Case When除了最上面的基本用法之外,更常用的是结合aggregation function使用。如果多个Case是同样的条件,那么只执行第一个条件。 +aggregation_function 比如:

SUM(CASE WHEN ... THEN ... ELSE ... END) AS ...
COUNT(DISTINCT(CASE WHEN ... THEN... ELSE ... END)) AS ...

# Eg. count total impressions and how many 'top' clicks for all campaigns, all impressions

SELECT ads_campaign, 
        COUNT(impression_time) as impressions, 
        COUNT(CASE WHEN click_time is not null then click_time END) as top_clicks
FROM table
Group by 1

另外window function还可以用来算一堆数据中的median,因为median具有一个特性:它的row_number的编号处在正中间,那么就可以通过从小到大排、从大到小排。如果是奇数,那么这两种排序后相等的那个就是中间的了,如果是偶数,会有两个index,分别是对方的+-1;不管如何,都可以对这部分匹配的。

row_number over (partition by group by) as row_num WHERE row_num=INT(COUNT(*)/2)

最后SQL也可以用来generate distribution或者bucketization。

t_pages

page_id

product_id

t_users

user_id

page_id

timestamp

# Q1: count unique users for product view for a given time
SELECT a.page_id, COUNT(DISTINCT b.user_id) as unique_users
FROM t_pages a
LEFT JOIN t_users b 
ON a.page_id=b.page_id
WHERE cast(timestamp as date) between AND 
#cast用来做数值类型的转换

# Q2:Product view distribution. eg. for product A, 10000 viewers saw it once, 22 twice

PreviousStatistics GlossaryNextStudent Scores Query

Last updated 5 years ago