At Wayfair, we want to predict the performance of new sorting algorithms before we launch them. Why? Sorting algorithms make money!
Think about it: what’s the main difference between scrolling randomly through our ~18 million products versus actually finding the thing you want and buying it? A sort order that actually puts it in front of you.
E-commerce companies have a tried-and-true method of improvement called A/B testing. You just put your new algorithm (or website, or whatever) up against your old one, allow 50% of the traffic to flow to each, and see which one does better. So why can’t we just do that here?
Well, we have 18MP18M = [18M! / (18M - 18M)!] = 18M! = practically infinite permutations of our products. So that would take (practically) infinite time. And you know how your boss always says you have infinite time to get this project done? No? Same here!
So if we’re not going to actually show these sorts to customers, we run into another big problem: sparse (or entirely missing) data. That is, how do you evaluate how a product will perform in a particular position on a particular page if you’ve never seen it there before?
At this point we have some data to reason from (lots of customers are seeing the current sort), but relative to the total sort space it’s very sparse. If we want to evaluate totally new sorts (which we do), a lot of the data about how a product performs in a particular position will be missing. How can we approach this problem in a way that balances our need for information about different sorts with our desire to keep giving our customers a reasonably good experience?
Importance Sampling (IS)
+ Pareto smoothing
|Compare the relative probability of putting products in positions from the sorts we’ve seen to the ones we haven’t yet||Control the long tail / infinite variance problem inherent to IS by fitting a particularly useful distribution to it||Get more information about products in slightly different positions so that we can reason about performance|
What we can do with the data we actually have
Importance sampling is a well-studied [1,2] and intuitive way to estimate the properties of one probability distribution by observing another. In our case, this means estimating the way customers will react to a new, not-yet-used sorting algorithm (let’s call this the testing policy, or t) by observing their reactions to the sorting algorithm we currently have in place (the production policy, or p). We’ll do this by focusing on the cases where p happened to behave like t.
To make this a bit more concrete, imagine two simple ways of sorting three products:
Now imagine that we have some “true” percentage of people that will click on a particular product in a particular position. Generally higher positions on a product results page are better, but this position bias might make a bigger difference for some products than for others. Putting some numbers on this, let’s say the true Click-Through Rate (CTR) for each product in each position is:
Obviously, if we have a particular sorting policy in place on our site (say, policy p), we have quite a bit of customer behavior data from people interacting with our site every day. From this we can calculate the expected value Ep by simply multiplying the probabilities and summing them:
If we did an A/B test for each of these sorting policies, we could gather data to calculate the overall change in RewardR by simply taking the difference between the expected values:
Offline, however, we only have data from a single policy (p)! How do we figure what would have happened in the never-put-into-production policy t from this? Weight the value of each observation by the relative probability that it would be shown in t versus p:
where t(x | y) is the probability that we would have seen the product x given the position y under the policy t. In order to estimate the overall policy value for t from p, we will need this weight for every product and position pair that was actually ever observed in policy p .
In terms of the above example, we will be able to estimate the behavior for Product A in policy t from policy p behavior by saying
|Product||Position||P(t puts it here)||P(p puts it here)||Weight|
|A||1||11%||80%||0.11 / 0.8 = 0.1375|
|A||2||70%||15%||0.7 / 0.15 = 4.667|
|A||3||19%||5%||0.19 / 0.05 = 3.8|
… and so on for the other products.
Once we have these weights, we can multiply each observed interaction (of a customer with a given product in a given position) by its appropriate weight. This gives us the relative amount that we “would have” seen that product in that position under the new sorting rules. For example, any clicks we saw for product A in position 3 will be multiplied by 3.8, because t shows that product in that position 3.8 times as often.
To estimate the total number of clicks (C) we would have seen if the new policy had been in place during our observation period, we simply sum all of the clicks (or rewards, r) we saw, multiplied by the weight for the product that was clicked (x) in that position it was clicked (y):
Then if we want the “value,” or number of clicks per customer, we can expect from the new policy, we simply divide that total by the number of observations n:
Note that if we have a binary reward (like a click), we only need to consider cases where a customer actually clicked, because for non-clicks r = 0 and there’s no change to the sum. This can cut down on the data for this method considerably, since our key performance metrics (clicks, orders, etc.) are almost all relatively rare events.
This gives us an unbiased estimate of the reward we could expect for policy t from observations of policy p.
But we still have a problem: sparse data means large variance
Importance sampling is an unbiased estimator, meaning that errors where the estimate is too large will exactly cancel out the cases where the estimate is too small, given enough tries (or enough time). Sadly, that doesn’t mean it will actually give you the right answer on any given estimation.
If you think about the math, this makes sense: we’re giving the most weight to observations that are very rare in our current production sort but common in the new untested sort. In a perfect (noiseless) world this would be an accurate estimate; the problem is that in the real world the rarer an observation is, the noisier it’s likely to be. Giving too much weight to a few rare, possibly noisy observations makes our entire estimate unreliable. This kind of error actually had a substantial impact on some 2016 election forecasts, such as when one voter was given 300x as much weight as some others by a polling firm.
In fact, while the bias of this approach is zero, the variance can be arbitrarily large. So the estimates we get for any particular offline test can be arbitrarily bad, even though the errors would eventually cancel out if we ran the test enough times.
If we’ve never seen a product x in a position y, then there can be no known reward associated with it. In other words, from the equation for C (the estimated number of clicks, above), this means that the r (reward) value is always zero - we never saw it in production, so we never saw it clicked. This makes this particular (x, y) pair’s contribution to the total estimated value of the new sorting policy (V, above) effectively zero, even if the new policy would show it all the time.
There’s really no way around this other than to show each product in each position at least a few times. We’ll return to the practical considerations of what this requires below, but for now, consider that for base importance sampling, showing a given (x, y) pair just a few times can be even more dangerous to the estimation than never showing it at all. Let’s take as an example the probability that a given Product X will show up in each of the top 50 positions under two different policies, p and t:
We can calculate the IS weights and plot them as well:
As you can see, while most of the weights are between 0 and 1, weights for positions high in the sort order (say, positions 1-10) are between 10k and 200k. This means if we ever saw Product X in one of those positions and it actually got a click, our estimate for policy t would be dramatically changed by that one piece of information.
This means that under importance sampling, any given estimate could be massively incorrect. So... not actually very useful as an offline estimator.
Limiting the variance through Pareto smoothing
In e-commerce, there is often a lot of value in the “minimum-viable product” or “lowest-hanging fruit” — do the quickest thing that will get the job done, so that we can actually be using the algorithm even while we endeavor to improve it. Since the problem with importance sampling arises when the weights get too large, the simplest way to fix it is simply to cut off the weight of a given product+position pair if it rises above a certain value; this is known as “capping” the weights. The cap can be an actual number (say, 1) or a relative one for a given estimate (say, the 90th percentile of all the weights). Capping in this way does indeed reduce the variance of the estimate, as well as ensuring that variance is finite. So: problem solved, right?
Sadly, not quite. Capping introduces a large amount of bias because it fudges the most important data points we have (the ones with the largest importance weights). Our new estimate might treat two data points as equally important, even if one’s importance weight is 100x as large as the other—and recall that the really large weights are likely to be the noisiest, because those are the ones that are particularly rare in the data we’re actually collecting from policy p. I’m going to handwave through the testing we did to prove this was the case, in order to get to the more interesting piece: Pareto smoothing.
Since we need to keep some of the information contained in the largest weights without letting them take over the estimate completely, Vehtari, Gelman, and Gabry suggested smoothing these tail weights using a generalized Pareto distribution. Instead of clumsily setting all of the large weights to exactly the same value, this technique lets us rein the weights in more carefully, for a better bias-variance tradeoff. Additionally, this technique can alert us when our reward estimates are likely to be unreliable (see the next section). This technique has allowed us to create a useful offline estimator here at Wayfair.
Briefly, this method is as follows:
- Determine a reasonable number of importance weights that should be used to fit this Pareto distribution. For small samples they suggest this could be the largest [N / 5]; for our purposes when dealing with many different products even just in the first page of results, we find that the largest [3 * Sqrt(N)] gives a much smaller number that still represents the entirety of the dominant “heavy tail” of the weights.
- Fit a Pareto distribution to all values in this “heavy tail,” and estimate the shape parameter (k) for this distribution.
- If k > 0.5*, note that the tail estimates might be arbitrarily large, and so our overall estimate variance may be similarly large. For our purposes this means that we should collect more data before using the offline evaluation on this particular novel sorting algorithm. See below for a detailed description of why this matters.
- Replace the values in this tail with the order statistics of the fitted Pareto distribution (leaving all the other weights unchanged).
- Estimate the RISvalue as above, using the weights from step 4.
The Pareto family is a good choice for this smoothing because it can have very heavy tails, which is what we expect from our weight distributions in the worst case. By fitting a particular Pareto distribution to the largest weights, we get an estimate of how heavy the tail is, and how much it’s likely to interfere with our estimates, via a shape parameter called k (or sometimes ξ).
How Pareto’s shape parameter helps identify bad estimates
Remember, normal importance sampling estimates are unbiased because they’re almost always only slightly off in one direction, but occasionally hugely off in the opposite direction. Given that, a method that can flag “this might be one of those bad times” is incredibly useful for offline evaluation here at Wayfair. So how does it work?
As the shape parameter increases, the Pareto distribution puts more and more weight into its heavy tail. This has the effect that fewer and fewer of the normal statistical properties exist/operate as expected. Specifically, at k <= 0.1, all of our normal statistical tools work (for example, we can use the Central Limit Theorem to talk about the variance of our estimate). At k >= 1, the distribution has no fixed mean, and almost everything breaks: we can’t even rely on the law of large numbers to help us estimate our reward at this point. Between these two values of k, magic happens.
By fitting a Pareto curve to the excess IS weights, we can determine the shape parameter corresponding to this fitted curve, and thus know whether the distribution is heavy-tailed. If k is less than 0.5, we know that the fitted weights can at least be contained within a finite variance, and the resulting estimate is likely to be trustworthy. If k is greater than 0.5, we know that the fitted weights in fact have infinite variance, and we should be very suspicious of this particular estimate (read: gather more data and then estimate again). In practice, Vehtari and colleagues showed that in their tests, k values up to around 0.7 gave useful estimates.
What we can do when data are missing
One great thing we can do as a company with thousands of customers buying things from us every day is just go get more data. Coming from an academic background, as I do, it is almost unbelievable how fast we can generate new data when there’s a solid analysis showing that the benefits outweigh the costs of doing so. As such, let’s explore that analysis for PSIS.
One problem this Pareto smoothing method doesn’t address is: what if we’ve never actually seen a particular product in a given position before? While we can make some industry-specific heuristic guesses about the value of a products (murphy beds are less likely to be bought than platform beds, say) or positions (for customers who grew up reading languages left to right and top to bottom, positions above and to the left tend to be observed first and thus are more valuable), it is extremely difficult to create an industry-agnostic statistical way to make estimates about pairs that have never been observed. Importance sampling ignores them entirely. As such, it’s critical to any importance-sampling-based approach that we have some stochasticity in the observed policy, so that there is at least a possibility of seeing each product in each position. Of course, randomizing our products can be expensive, since customers usually don’t want to see items in random order. So the question then becomes, what is the cost of adding randomness, compared to the benefit of better importance sampling?
How much randomness is enough?
Products in the Wayfair catalog are generally sorted according to some ranking score, whether they’re personalized for a given customer, filtered according to some user search, or just displayed in the default order. I’m not going to give away any secret sorting sauce, but the issues and methods I’m talking about apply to these algorithms across the board. As such, let’s take a very simple example algorithm: imagine that our sorting score is simply a count of the number of times a product had been clicked over its lifetime in the catalog. This has a self-reinforcing property: the best (or at least, good enough) products will get more clicks, which will put them at the top of the sort, which will get them viewed by more customers, which will get them more clicks. Thus, the scores tend to follow a power law distribution, much more spread out at the top of the sort (for products in positions 1-10, say, that almost every customer who loads the page will see and have a chance to click) than further down (where only a relatively few dedicated customers go).
This gives us an easy method for controlling the amount of difference between a simple sort-by-score ordering of products and one that adds a bit of randomness to the score. By giving each score an identical standard deviation and sampling from a normal distribution, we can get a new score and re-rank the products by those sampled distributions. Since the scores at the top are so much more spread out, this means that they’re much less likely to change position than those lower down. For instance, the product in position 2 gets the distribution pictured in red—clearly it might move anywhere within the top ten or so, but not much lower than that. By contrast, the product in position 50 could easily move anywhere from position 5 to 1000+, but will almost never displace the products at the very top. This allows us to balance the amount of randomness we inject into the sort (and thus, the accuracy of our offline predictions of different sorting algorithms) with the likely cost of putting worse products near the top.
Simulated validation and comparison of methods
As with many methods, this offline evaluation of sorts would be potentially expensive to validate live on the site (which is one reason we’d like to have a good offline validation method). Thus, before putting it on the site, we tested it via Monte Carlo simulation to determine how much accuracy we could expect from a given amount of randomness injected into a given number of sorts. This has two benefits: first, it allows us to show that the PSIS method does at least as well as any other importance sampling method (in addition to giving us the shape parameter k that indicates whether the tail fitting can be trusted to have finite variance).
Second, these simulations can provide a cost/benefit analysis detailing exactly how much of a given performance metric (clicks, say, or dollars) we would lose in order to get a certain level of accuracy in our offline evaluations. Of course, these analyses are subject to their generating assumptions (particularly those involving how good the current observable sort is compared to the randomized one) —but we can at least use this to generate a maximum expected loss by assuming that the current sort is perfect and any deviation from it will cost us in terms of the specified metric. This puts the question as to how much we are willing to spend on improving our offline evaluations into the realm of a business decision, rather than a technical or mathematical one.
Result and application
These simulations show that the Pareto smoothing method does indeed give a good balance between the unbiased but unbounded-variance importance sampling and the fixed-variance but high-bias capped importance sampling. Given this, along with the maximum-loss estimate of our cost/benefit analysis, we can provide good offline evaluations of sorts close to the current one while limiting our cost. Thus, we will shortly be rolling out a framework for randomizing the product sort in a cost-controlled manner for limited numbers of customers. Future versions will allow particular stakeholders to randomize sets of products for individual customer groups, and thereby gather data on personalized sorting algorithms as well, letting us improve our offline evaluation of sorts based on customer browsing.
No model is an island, and the team here at Wayfair is hugely supportive and collaborative. I’d like to particularly thank the following folks for their extensive contributions:
Peter B. Golbus
Rachel Kirkwood Magnusson
I would also like to thank PAPIs (www.papis.io) for allowing us to present these methods at the 2018 conference.
 Gilotte, Calauzènes, Nedelec, Abraham, Dollé. Offline A/B testing for Recommender Systems. WSDM 2018, February 5–9, 2018, Marina Del Rey, CA
 Vehtari, Gelman, Gabry. Pareto Smoothed Importance Sampling. 2017
 Cohn. How one 19-year-old Illinois man is distorting national polling averages. New York Times, October 12, 2016. Accessed 5/29/19 from https://www.nytimes.com/2016/10/13/upshot/how-one-19-year-old-illinois-man-is-distorting-national-polling-averages.html