Recommender systems
Building a recommender system is easy with GraphLab Create: simply load data, create a recommender model, and start making recommendations. The data we will use for an example is sitting on an AWS S3 bucket as a csv file. We can load it into a GraphLab Create SFrame with read_csv()
, specifying the "rating" column to be loaded as integers. For other ways of creating an SFrame and doing data munging, see the SFrame chapter.
# Download data if you haven't already
import os
if os.path.exists('movie_ratings'):
data = graphlab.SFrame('movie_ratings')
else:
data = graphlab.SFrame.read_csv("http://s3.amazonaws.com/dato-datasets/movie_ratings/training_data.csv", column_type_hints={"rating":int})
data .save('movie_ratings')
data.head()
user | movie | rating |
---|---|---|
Jacob Smith | Flirting with Disaster | 4 |
Jacob Smith | Indecent Proposal | 3 |
Jacob Smith | Runaway Bride | 2 |
Jacob Smith | Swiss Family Robinson | 1 |
Jacob Smith | The Mexican | 2 |
Jacob Smith | Maid in Manhattan | 4 |
Jacob Smith | A Charlie Brown Thanksgiving / The ... |
3 |
Jacob Smith | Brazil | 1 |
Jacob Smith | Forrest Gump | 3 |
Jacob Smith | It Happened One Night | 4 |
We have the data. It's time to build a model. There are many good models for making recommendations, but sometimes even knowing the right names can be a challenge, much less typing them time after time.
This is why GraphLab Create provides a default recommender called ... recommender
. You can build a default recommender with recommender.create()
.
It requires a dataset to use for training the model, as well as the names of the columns that contain the user IDs, the item IDs, and the ratings (if present).
# The data needs to contain at least three columns: user, movie, and rating.
model = graphlab.recommender.create(data,
user_id="user",
item_id="movie",
target="rating")
Under the hood, the type of recommender is chosen based on the
provided data and whether the desired task is ranking (default) or
rating prediction. The default recommender for this type of data and
the default ranking task is a matrix factorization model, implemented
on top of the disk-backed SFrame data structure. The default solver
is stochastic gradient descent, and the recommender model used is the
RankingFactorizationModel, which balances rating prediction with
a ranking objective. The default create()
function does not allow
changes to the default parameters of a specific model, but it is just
as easy to build a recommender with your own parameters using
the appropriate model-specific create()
function.
The trained model can now make recommendations of new items for users.
To do so, call recommend()
with an SArray of user ids. If users
is set to None, then recommend()
will make recommendations for all
the users seen during training, automatically excluding the items that
are observed for each user. In other words, if data
contains a row
"Alice, The Great Gatsby", then recommend()
will not recommend "The
Great Gatsby" for user "Alice". It will return at most k
new items
for each user, sorted by their rank. It will return fewer than k
items if there are not enough items that the user has not already
rated or seen.
The score
column of the output contains the unnormalized
prediction scores for each user-item pair. The semantic meanings of
these scores may differ between models. For the linear regression
model, for instance, a higher average score for a user means that the
model thinks that this user is generally more enthusiastic than
others.
# You can now make recommendations for all the users you've just trained on
results = model.recommend()
The model can be saved for later use, either on the local machine or in an AWS S3 bucket. The saved model sits in its own directory, and can be loaded back in later to make more predictions.
# Save the model for later use
model.save("my_model")
Et voilà! You've just built your first recommender with GraphLab Create.
Implicit vs. Explicit data
The above example included ratings given to items by users. In situations where users do not provide ratings, a dataset would instead have just two columns -- user ID and item ID. We can still use collaborative filtering techniques to make recommendations. In this case we are leveraging "implicit" data about items that users watched, liked, etc., in contrast to the "explicit" ratings data in the previous example.
Training a model and making recommendations with such data is straightforward.
m = graphlab.recommender.create(data,
user_id='user',
item_id='movie')
recs = m.recommend()
When no target
is available, as above, then by default this returns an ItemSimilarityRecommender which computes the similarity between each pair of items and recommends items to each user that are closest to items she has already used or liked. It measures item similarity with either Jaccard or cosine distance, which can be set
manually using a keyword argument called similarity_type
when creating that
recommender directly:
m = graphlab.item_similarity_recommender.create(data,
user_id='user',
item_id='movie',
similarity_type='jaccard')
When a target
is provided, the default GraphLab Create recommender is a
matrix factorization model. The matrix factorization model can also be
called explicitly with
factorization_recommender.create.
When using the model-specific create function, other arguments can be
provided to better tune the model, such as num_factors
or
regularization
. See the documentation on
FactorizationRecommender for more information.
m = graphlab.factorization_recommender.create(data,
user_id='user',
item_id='movie',
target='rating',
regularization=0.05,
num_factors=16)
All recommender objects in the graphlab.recommender module expose a common set of methods, such as recommend and evaluate. These will be covered in the next few sections.
Side information for users, items, and observations
In many cases, additional information about the users or items can improve the quality of the recommendations. For example, including information about the genre and year of a movie can be useful information in recommending movies. We call this type of information user side data or item side data depending on whether it goes with the user or the item.
Including side data is easy with the user_data
or item_data
parameters to the recommender.create()
function. These arguments
are SFrames and must have a user or item column that corresponds to
the user_id
and item_id
columns in the observation data. Internally,
the data is joined to the particular user or item when training the
model, the data is saved with the model and also used to make
recommendations.
In particular, the
FactorizationRecommender and the RankingFactorizationRecommender both incorporate the side data into the prediction through additional interaction terms between the user, the item, and the side feature. For the actual formula, see the API docs for the FactorizationRecommender. Both of these models also allow you to obtain the parameters that have been learned for each of the side features via the m['coefficients']
argument.
Side data may also be provided for each observation. For example, it might be useful to have recommendations change based on the time at which the query is being made. To do so, you could create a model using an SFrame that contains a time column, in addition to a user and item column. For example, a "time" column could include a string indicating the hour; this will be treated as a categorical variable and the model will learn a latent factor for each unique hour.
# sf has columns: user_id, item_id, time
m = gl.ranking_factorization_recommender.create(sf)
In order to include this information when requesting observations, you may include the desired additional data as columns in an SFrame for the users
argument to m.recommend()
. In our example above, when querying for recommendations, you would include the time that you want to use for each set of recommendations.
users_query = gl.SFrame({'user_id': [1, 2, 3], 'time': ['10pm', '10pm', '11pm']})
m.recommend(users=user_query)
In this case, recommendations for user 1 and 2 would use the parameters learned from observations that occurred at 10pm, whereas the recommendations for user 3 would incorporate parameters corresponding to 11pm. For more details, check out recommend in the API docs.
You may check the number of columns used as side information by querying m['observation_column_names']
, m['user_side_data_column_names']
, and m['item_side_data_column_names']
. By printing the model, you can also see this information. In the following model, we had four columns in the observation data (two of which were user_id
and item_id
) and four columns in the SFrame passed to item_side_data
(one of which was item_id
):
Class : RankingFactorizationRecommender
Schema
------
User ID : user_id
Item ID : item_id
Target : None
Additional observation features : 2
Number of user side features : 0
Number of item side features : 3
If new side data exists when recommendations are desired, this can be passed in via the new_observation_data
, new_user_data
, and new_item_data
arguments. Any data provided there will take precedence over the user and item side data stored with the model.
Not all of the models make use of side data: the popularity_recommender
and item_similarity_recommender
create methods currently do not use it.