Building Recommendation Engines with Pyspark

ALS

Collaborative Filtering vs. Content-based Filtering

Content-Based Filtering

Based on features of items

  • Genre: Comedy, Action, Drama

  • Animation: Animated, Not-animated

  • Language: English, Spanish, Korean

  • Decade Produced: 1950s. 1980s

  • Actors: Meryl Streep, Tom Hanks

Collaborative Filtering

Based on similar user preferences

implicit vs explicit ratings

Data Preparation

Get Integer IDs

  1. Extract unique userIds and movieIds

  2. Assign unique integers to each id

  3. Rejoin unique integer ids back to the ratings data

ALS Parameters and Hyperparameters

Arguments

  • userCol: Name of column that contains user ids

  • itemCol: Name of column that contains item ids

  • ratingCol: Name of column that contains ratings

Hyperparameters

  • rank, k: number of latent features

  • maxIter: number of iterations

  • regParam: Lambda; regularization parameter, term added to error matrix to avoid overfitting the training data

  • alpha: Only used with implicit ratings. How much (in integer) should add to the model's confidence that a user actually likes the movie/song.

  • nonnegative = True: Ensures positive numbers

  • coldStartStrategy = "drop": Addresses issues with test/train split; only use users that have ratings in both training and testing set, and not to use users that only appear in the testing set to calculate RMSE.

Build RMSE Evaluator

An RMSE of 0.633 means that on average the model predicts 0.633 above or below values of the original ratings matrix.

Dataset

MovieLens - Explicit

Sparsity

Sparsity=Number of Ratings in Matrix(Number of Users)×(Number of Movies)\text {Sparsity}=\frac{\text {Number of Ratings in Matrix}}{\text {(Number of Users)} \times( \text {Number of Movies)} }

Explore with aggregation function

ALS model buildout on MovieLens Data

ParamGridBuilder

CrossValidator

RandomSplit

In-order

Model Performance Evaluation and Cleanup

Dataset

MillionSongs-implicit

http://millionsongdataset.com/tasteprofile/

Data Exploration

因为这一次的data是implicit,所以需要filter那些非0的来过滤得到aggregation信息

Rank Ordering Error Metrics (ROEM)

ROEM=u,iru,itranku,iu,iru,it\mathrm{ROEM}=\frac{\sum_{u, i} r_{u, i}^{t} \operatorname{rank}_{u, i}}{\sum_{u, i} r_{u, i}^{t}}

现在就不能再用RMSE了,因为在implicit data下,我们没有true value,只有: the number of time that a certain song is played and confidence level (how much confident our model is that they like that song). 这个时候,就判断test set里,我们判断的值和if the prediction make sense, whether they played it more than once. ROEM的意义就是whether songs with higher number of plays have higher predictions.

Binary Implicit Ratings

implicit rating

如果是Binary Ratings,比如只是预测1和0,那可以在weighting上做更多工作。

Item Weighting: Movies with more user views = higher weight

User Weighting: Users that have seen more movies will have lower weights applied to unseen movies

Last updated