Feature Engineering with Pyspark

EDA

Pyspark 更新很快,所以每次都要记得查版本,不然都不知道documentation不对死在哪里了!

在前期定义问题时,要知道自己数据的limitation,比如预测salescloseprice,如果只有一个district住宅楼的价格,无法预测更大面积;如果只有2017年的数据,无法判断时间演变规律。虽然parquet file定义好了自己的schema,但如果我们不是定义这个schema的人,最好自己检查一遍。

Verify Data Types

Visually Inspect data

df.describe(['LISTPRICE']).show()观察某一个column的分布,除此之外spark自己也有内置的mean、skewness、minimum、covariance、correlation

如果想要用seaborn之类的包画图,要转成pands dataframe,所以要注意先做sampling,sample with replacement

还可以用lmplot() 来画两个column name的linear model plot. If they do they are good candidates to include in our analysis. If they don't it doesn't mean that we should throw them out, it means we may have to process or wrangle them before they can be used.

还可以用joint plot画R2,知道两个变量之间的关系。

还有上图这种叫countplot

Dropping Data

Where can data go bad?

  • Recorded wrong

  • Unique events

  • Formatted incorrectly

  • Duplications

  • Missing

  • Not relevant

For this dataset, 'NO' auto-generated record number, 'UNITNUMBER' irrelevant data, 'CLASS' all constant. spark的dropna和pandas df的很像,except有个*。代表unpack the list and apply to them all。

也可以

Text Filtering

isin() is similar to like() but allows us to pass a list of values to use as a filter rather than a single one.

Outlier Filtering

Drop NA / NULL

DataFrame.dropna() how :‘any’or‘all’.If ‘any’,drop a record if it contains any nulls. If ‘all’, drop a record only if all its values are null. thresh :int, default None. If specied, drop records that have less than thresh non-null values. This overwrites the how parameter. subset :optional list of column names to consider.

Drop Duplicates

因为spark的分布式存储,这个操作不一定drop第一个出现的dup。

Adjusting Data

"Data does not give up its secrets easily, it must be tortured to confess."——Jeff Hooper, Bell Labs.

Min-Max Scaling

xi,j=xi,jxjminxjmaxxjminx_{i, j}^{*}=\frac{x_{i, j}-x_{j}^{\min }}{x_{j}^{\max }-x_{j}^{\min }}

写一个wrapper

Standardization - Z transform

Log Scaling

下图是典型的positive skewed

User Defined Scaling

比如对于房价,也可以用百分比来表示它在市场的时间。

用log后的值反映它的建成时间

Missing Value

How does data go missing in the digital age?

  • Data Collection - Broken Sensors

  • Data Storage Rules - 2017-01-01 vs January 1st, 2017

  • Joining Disparate Data Monthly to Weekly

  • Intentionally Missing - Privacy Concerns

Types of Missing

Missing completely at random

  • Missing Data is just a completely random subset

Missing at random

  • Missing conditionally at random based on another observation

Missing not at random

  • Data is missing because of how it is collected

When to drop rows with missing data?

  • Missing values are rare

  • Missing Completely at Random

isNull() df.where(df['ROOF'].isNull()).count()

True if the current expression is null.

Plotting Missing Values

比如这张图里,可以看出来第二个column全都是missing

Imputation of Missing Value

Process of replacing missing values

Rule Based: Value based on business logic

Statistics Based: Using mean, median, etc

Model Based: Use model to predict value

Imputation of Missing Values

** fillna(value, subset=None) value the value to replace missings with subset the list of column names to replace missings

Getting more data

External DataSource

Thoughts on External Data Sets

Pros

Cons

Add important predictors

May 'bog' analysis down

Supplement/replace values

Easy to induce data leakage

Cheap or easy to obtain

Become data set subject matter expert

Join - PySpark DataFrame

Join - SparkSQL Join

要注意的时候Join的时候需要注意数字的精确度,因为如果join的是longitude和latitude,那么精读不同会导致join完全不行

Generate More Features

可以通过对已有feature的加减乘除take ratio、取平均等方式来增加新的feature,但是有时候这样造出来的feature可能也不会有什么好的结果。

Automation of Features: FeatureTools & TSFresh

Date Time

Date

Please keep in mind that PySpark's week starts on Sunday, with a value of 1 and ends on Saturday, a value of 7.

Lagging Features

window()Returns a record based off a group of records

lag(col, count=1)Returns the value that is offset by rows before the current row

Extract Features

Extract with Text Match

Split Columns

Explode

Pivot

接下来,join回到原来的df

Binarizing, Bucketing & Encoding

Binarizing

把threshold以上的定为1,小于等于的0 注意被binarize的column的datatype需要是double

Bucketing

设置一批区间,把数据bin在一起 比如当看到dist()里后面的数据比较少,就可以考虑合并掉它们。

One Hot Encoding

Spark的onehotencoding分成两步,先用stringindexer标号,再用encoder来transform。

Choose Model

ml.regression

  • DecisionTreeRegression

  • GBTRegression

  • RandomForestRegression

  • GeneralizedLinearRegression

  • IsotonicRegression

  • LinearRegression

Time Series Test and Train Splits

写个wrapper

Time Series Data Leakage

Data leakage will cause your model to have very optimistic metrics for accuracy but once real data is run through it the results are often very disappointing.

DAYSONMARKET only reflects what information we have at the time of predicting the value. I.e., if the house is still on the market, we don't know how many more days it will stay on the market. We need to adjust our test_df to reflect what information we currently have as of 2017-12-10.

NOTE: This example will use the lit() function. This function is used to allow single values where an entire column is expected in a function call.

Thinking critically about what information would be available at the time of prediction is crucial in having accurate model metrics and saves a lot of embarrassment down the road if decisions are being made based off your results!

Dataframe Columns to Feature Vectors

Drop Columns with Low Observations

一般低于30就不要了,因为不具有统计意义。Removing low observation features is helpful in many ways. It can improve processing speed of model training, prevent overfitting by coincidence and help interpretability by reducing the number of things to consider.

Random Forest, Naively Handling Missing and Categorical Values

对missing value友好,不需要minmax scale,对skewness不敏感,不需要onehotencod。 Missing values are handled by Random Forests internally where they partition on missing values. As long as you replace them with something outside of the range of normal values, they will be handled correctly.

Likewise, categorical features only need to be mapped to numbers, they are fine to stay all in one column by using a StringIndexer as we saw in chapter 3. OneHot encoding which converts each possible value to its own boolean feature is not needed.

Building Model

Training /Predicting with a Random Forest

Evaluate a Model

写个wrapper

R^2 is comparable across predictions regardless of dependent variable.

RMSE is comparable across predictions looking at the same dependent variable.

RMSE is a measure of unexplained variance in the dependent variable.

Interpret a Model

Save and Load the Model

Last updated