Recommendations with Simple Correlation Metrics on Implied Preference Data

recommendations_with_simple_correlation_metrics_article_image

When you sit down to write a recommendations system, there are quite a few well-practiced techniques you can use, and it's difficult to know in advance how well they are going to work out when applied to your data. Thanks to the Netflix prize, which was initiated in 2006 and awarded in 2009, a lot has been written on recommender systems for the Netflix data set. If you happen to have a product catalogue similar to Netflix's (those movies from the 60s are still being viewed and rated), and your users happen to have scored it with a 5-point explicit ratings system, there are some awesome advanced techniques and frameworks that you can take for a spin. Does that sound like you? Show of hands? I didn't think so. Our data is certainly nothing like that.

What to do? I decided to start with something simple, before our inevitable trek into the forests of matrix factorization, stochastic gradient descent, Markov clusters and other impressive-sounding stuff: more on all that in subsequent posts.

So George and I began with the most obvious available literature, the O'Reilly Book Programming Collective Intelligence by Toby Segaran (if there's an O'Reilly book on it, you must be able to just do it, right?), and with the simplest data set we could imagine: a set of relations between users and items, which we interpret as the user's preference for the item. This might be a item-view, an item-purchase (unless closely followed by a return), or any other event we think might come in handy. We need a general term for this activity, for discussion purposes, not 'view' or 'purchase': let's call it 'flagging'. This data is most like the book's del.icio.us example: people either linked to something or did not. We're also going to limit ourselves to the simplest possible tool, a sql interface to something that is more or less a bunch of tables. We have tried the following, or at least parts of it, on MS SQL Server, Netezza and Hive.

It's impractical for us to load our entire data set into memory, or even to represent the relationships of all users and all items as explicit data, so we look for a sparse durable representation: a relational table in which a record means that a user flagged an item within a certain context. The context depends on the data source: a user viewed an item in a particular month/week/day/session, a user purchased an item in a particular month/week/day/order, etc.

Now let's compute the following:

For each user-item pair: how many times did the user flag the item?
For each user: how many items were flagged? how many total flag events?
For each pair of items flagged by at least one user: how many users flagged both items? We call this 'overlap'. We exclude outliers at this point: users with too few items flagged, or too many flag events.
For each item: how many users flagged it? We call this 'popularity'.

Now we're ready to compute any of the correlation metrics in the book. Which ones make sense? Not Pearson correlation or Euclidean distance. You can compute them well enough, but try to imagine what they mean in the context of this data. What kind of straight line, or triangle's hypotenuse, are you fitting these data points to? None that I can picture. The data is too much of a degenerate case of anything to which those concepts might usefully apply: you get a lot of scores of exactly '1' or '0' or $\sqrt{2}$ . Raw frequency makes sense, but it's a bit of a blunt instrument. Prior to our setting up this system, there was something on the site that essentially used frequency along the lines we're talking about here. It overvalued very popular things, to be sure, but in the end people clicked and bought things off those recommendations, so it wasn't terrible. But what about Jaccard coefficient (sometimes called Tanimoto distance)? $J(A,B) = \frac{{A}\cap{B}}{{A}\cup{B}}$ . Sounds plausible. We'll interpret the Jaccard coefficient of our items A and B as 1 minus (overlap of A and B)/(popularity of A + popularity of B - overlap of A and B). Makes sense to me, and it's straightforward in sql! Our final table (let's call it 'item_affinity_jaccard') will have at least 3 columns: the id of A, the id of B, and the coefficient.

We placed those results in a test harness, and the results were visibly, obviously better than the frequency-based thing that was there before. But could we trust our eyes? Hard to know without trying it. We replaced it on the site, and clickthrough rose 18%. That we can trust! If you're starting out with recommendations, I'd say give that a try.

For extra points, let's move on to a less degenerate case. We'll add a new product C to our A and B, and observe that if A is connected to B, and B is connected to C, then A is connected, in a way, to C. This will be quick, dirty, and not scalable at all (scalable in the sense that, if you wanted to add a D, E or F, you would quickly be out of luck). But if you've got this far, and you've never gone through an exercise to convince yourself that graph processing gets ugly fast when your only tools are bunch of relational tables, try the following:

Summarize previous results in a table: for each item, compute the count of users who flagged it, total flags, and the number of other items for which you can compute the Jaccard coefficient (let's call this the 'recommendation count', and these items 'recommendable items').
Make an item_relationship_step2 table that contains all the connected pairs. Avoid combinatorial explosion by only including items where recommendation count is greater than 0 and less than something that excludes items for which you already have so many direct pairs that you don't really need the farther-away things.
Join item_affinity_jaccard to itself and then to item_relationship_step2, and compute the two-hop distance in whatever way you think best.