Building Recommendation Engines with Pyspark
ALS
Collaborative Filtering vs. Content-based Filtering
Content-Based Filtering
Based on features of items
Genre: Comedy, Action, Drama
Animation: Animated, Not-animated
Language: English, Spanish, Korean
Decade Produced: 1950s. 1980s
Actors: Meryl Streep, Tom Hanks
Collaborative Filtering
Based on similar user preferences
implicit vs explicit ratings
Data Preparation
Get Integer IDs
Extract unique userIds and movieIds
Assign unique integers to each id
Rejoin unique integer ids back to the ratings data
ALS Parameters and Hyperparameters
Arguments
userCol: Name of column that contains user ids
itemCol: Name of column that contains item ids
ratingCol: Name of column that contains ratings
Hyperparameters
rank, k: number of latent features
maxIter: number of iterations
regParam: Lambda; regularization parameter, term added to error matrix to avoid overfitting the training data
alpha: Only used with implicit ratings. How much (in integer) should add to the model's confidence that a user actually likes the movie/song.
nonnegative = True: Ensures positive numbers
coldStartStrategy = "drop": Addresses issues with test/train split; only use users that have ratings in both training and testing set, and not to use users that only appear in the testing set to calculate RMSE.
Build RMSE Evaluator
An RMSE of 0.633 means that on average the model predicts 0.633 above or below values of the original ratings matrix.
Dataset
MovieLens - Explicit
Sparsity
Explore with aggregation function
ALS model buildout on MovieLens Data
ParamGridBuilder
CrossValidator
RandomSplit
In-order
Model Performance Evaluation and Cleanup
Dataset
MillionSongs-implicit
http://millionsongdataset.com/tasteprofile/
Data Exploration
因为这一次的data是implicit,所以需要filter那些非0的来过滤得到aggregation信息
Rank Ordering Error Metrics (ROEM)
现在就不能再用RMSE了,因为在implicit data下,我们没有true value,只有: the number of time that a certain song is played and confidence level (how much confident our model is that they like that song). 这个时候,就判断test set里,我们判断的值和if the prediction make sense, whether they played it more than once. ROEM的意义就是whether songs with higher number of plays have higher predictions.
Binary Implicit Ratings
implicit rating
如果是Binary Ratings,比如只是预测1和0,那可以在weighting上做更多工作。
Item Weighting: Movies with more user views = higher weight
User Weighting: Users that have seen more movies will have lower weights applied to unseen movies
Links
Last updated