Feature Engineering with Pyspark
EDA
Pyspark 更新很快,所以每次都要记得查版本,不然都不知道documentation不对死在哪里了!
在前期定义问题时,要知道自己数据的limitation,比如预测salescloseprice,如果只有一个district住宅楼的价格,无法预测更大面积;如果只有2017年的数据,无法判断时间演变规律。虽然parquet file定义好了自己的schema,但如果我们不是定义这个schema的人,最好自己检查一遍。
Verify Data Types
Visually Inspect data
df.describe(['LISTPRICE']).show()观察某一个column的分布,除此之外spark自己也有内置的mean、skewness、minimum、covariance、correlation
如果想要用seaborn之类的包画图,要转成pands dataframe,所以要注意先做sampling,sample with replacement
还可以用lmplot() 来画两个column name的linear model plot. If they do they are good candidates to include in our analysis. If they don't it doesn't mean that we should throw them out, it means we may have to process or wrangle them before they can be used.
还可以用joint plot画R2,知道两个变量之间的关系。


还有上图这种叫countplot
Dropping Data
Where can data go bad?
Recorded wrong
Unique events
Formatted incorrectly
Duplications
Missing
Not relevant
For this dataset, 'NO' auto-generated record number, 'UNITNUMBER' irrelevant data, 'CLASS' all constant. spark的dropna和pandas df的很像,except有个*。代表unpack the list and apply to them all。
也可以
Text Filtering
isin() is similar to like() but allows us to pass a list of values to use as a filter rather than a single one.
Outlier Filtering

Drop NA / NULL
DataFrame.dropna()
how :‘any’or‘all’.If ‘any’,drop a record if it contains any nulls. If ‘all’, drop a record only if all its values are null.
thresh :int, default None. If speci ed, drop records that have less than thresh non-null values. This overwrites the how parameter.
subset :optional list of column names to consider.
Drop Duplicates
因为spark的分布式存储,这个操作不一定drop第一个出现的dup。
Adjusting Data
"Data does not give up its secrets easily, it must be tortured to confess."——Jeff Hooper, Bell Labs.
Min-Max Scaling

写一个wrapper
Standardization - Z transform

Log Scaling
下图是典型的positive skewed

User Defined Scaling
比如对于房价,也可以用百分比来表示它在市场的时间。
用log后的值反映它的建成时间
Missing Value
How does data go missing in the digital age?
Data Collection - Broken Sensors
Data Storage Rules - 2017-01-01 vs January 1st, 2017
Joining Disparate Data Monthly to Weekly
Intentionally Missing - Privacy Concerns
Types of Missing
Missing completely at random
Missing Data is just a completely random subset
Missing at random
Missing conditionally at random based on another observation
Missing not at random
Data is missing because of how it is collected
When to drop rows with missing data?
Missing values are rare
Missing Completely at Random
isNull() df.where(df['ROOF'].isNull()).count()
True if the current expression is null.
Plotting Missing Values

比如这张图里,可以看出来第二个column全都是missing
Imputation of Missing Value
Process of replacing missing values
Rule Based: Value based on business logic
Statistics Based: Using mean, median, etc
Model Based: Use model to predict value
Imputation of Missing Values
** fillna(value, subset=None)
value the value to replace missings with
subset the list of column names to replace missings
Getting more data
External DataSource
Thoughts on External Data Sets
Pros
Cons
Add important predictors
May 'bog' analysis down
Supplement/replace values
Easy to induce data leakage
Cheap or easy to obtain
Become data set subject matter expert
Join - PySpark DataFrame
Join - SparkSQL Join
要注意的时候Join的时候需要注意数字的精确度,因为如果join的是longitude和latitude,那么精读不同会导致join完全不行
Generate More Features
可以通过对已有feature的加减乘除take ratio、取平均等方式来增加新的feature,但是有时候这样造出来的feature可能也不会有什么好的结果。
Automation of Features: FeatureTools & TSFresh
Date Time
Date
Please keep in mind that PySpark's week starts on Sunday, with a value of 1 and ends on Saturday, a value of 7.
Lagging Features
window()Returns a record based off a group of records
lag(col, count=1)Returns the value that is offset by rows before the current row
Extract Features
Extract with Text Match

Split Columns

Explode

Pivot

接下来,join回到原来的df
Binarizing, Bucketing & Encoding
Binarizing
把threshold以上的定为1,小于等于的0 注意被binarize的column的datatype需要是double
Bucketing
设置一批区间,把数据bin在一起 比如当看到dist()里后面的数据比较少,就可以考虑合并掉它们。

One Hot Encoding
Spark的onehotencoding分成两步,先用stringindexer标号,再用encoder来transform。
Choose Model

ml.regression
DecisionTreeRegressionGBTRegressionRandomForestRegressionGeneralizedLinearRegressionIsotonicRegressionLinearRegression
Time Series Test and Train Splits

写个wrapper
Time Series Data Leakage
Data leakage will cause your model to have very optimistic metrics for accuracy but once real data is run through it the results are often very disappointing.
DAYSONMARKET only reflects what information we have at the time of predicting the value. I.e., if the house is still on the market, we don't know how many more days it will stay on the market. We need to adjust our test_df to reflect what information we currently have as of 2017-12-10.
NOTE: This example will use the lit() function. This function is used to allow single values where an entire column is expected in a function call.
Thinking critically about what information would be available at the time of prediction is crucial in having accurate model metrics and saves a lot of embarrassment down the road if decisions are being made based off your results!
Dataframe Columns to Feature Vectors
Drop Columns with Low Observations
一般低于30就不要了,因为不具有统计意义。Removing low observation features is helpful in many ways. It can improve processing speed of model training, prevent overfitting by coincidence and help interpretability by reducing the number of things to consider.
Random Forest, Naively Handling Missing and Categorical Values
对missing value友好,不需要minmax scale,对skewness不敏感,不需要onehotencod。 Missing values are handled by Random Forests internally where they partition on missing values. As long as you replace them with something outside of the range of normal values, they will be handled correctly.
Likewise, categorical features only need to be mapped to numbers, they are fine to stay all in one column by using a StringIndexer as we saw in chapter 3. OneHot encoding which converts each possible value to its own boolean feature is not needed.
Building Model
Training /Predicting with a Random Forest
Evaluate a Model
写个wrapper
R^2 is comparable across predictions regardless of dependent variable.
RMSE is comparable across predictions looking at the same dependent variable.
RMSE is a measure of unexplained variance in the dependent variable.
Interpret a Model
Save and Load the Model
Last updated