Bayesian Product Ranking at Wayfair

Wayfair has a huge catalog with over 14 million items. Our site features a diverse array of products for people’s homes, with product categories ranging from “appliances” to “décor and pillows” to “outdoor storage sheds.” Some of these categories include hundreds of thousands of products; this broad offering ensures that we have something for every style and home. However, the large size of our product catalog also makes it hard for customers to find the perfect item among all of the possible options.

At Wayfair, we are constantly working to improve our customers’ shopping experiences. If we are not able to personalize a customer’s experience, for example, because they are a first-time customer and we do not yet know their preferences, then it is important that we make it easy for them to find the products with the broadest appeal. This post features a new Bayesian system developed at Wayfair to (1) identify these products and (2) present them to our customers.

Fig. 1: With 25,702 shower curtains in our store, how did we decide to show our customers these three?

Building this system was a team effort, and we’ll discuss four aspects of the project below.

Determining products’ general appeal in a variable environment
Making the most of our data
Updating the model over time
Ranking products in real time

1. Determining products' general appeal in a variable environment

When Wayfair’s team of data scientists tries to identify the most appealing products, we immediately run into a problem: each product’s success in our store is heavily influenced by our own sorting algorithm. Products shown at the top of the first page have enormous visibility, and tend to be ordered more frequently regardless of their intrinsic appeal to a broad customer base. For example, we frequently see cases where a specialized product at the top of the page gets ordered twice as often as one with broad appeal but less exposure (illustrated below).

Fig. 2: An illustration of the power of position effects. The black curve shows how often an average product would be ordered, if shown at different positions on the page. The red product is ordered about twice as often as the blue one—despite being much less appealing than average—because it receives more attention at the top of the page. If we switched the positions of these two products on the page, more people would see the blue product, which is more appealing than average.

This figure suggests that we could model each product's intrinsic appeal as the vertical difference between (1) its order rate and (2) the average order rate of any given product in that position, indicated by the black curve. This would give us a notion of position-adjusted product performance, and allow us to explain each product's historical performance as a combination of “position effects” (black curve) and “product effects” (vertical offsets from the black curve). (In practice, we also adjust for a range of other factors, such as whether the web page has been filtered or whether it is being viewed on a mobile device, but we will ignore these factors for the purpose of this post.)

We implemented this approach using logistic regression to represent the probability that a given customer will order product i when we present it in position j using an equation like this one:

where “product effects” describe deviations between a given product’s performance and the average, and “position effects” describe the same thing for different positions on our site. In general, products whose “product effect” estimates are large and positive will tend to do well for a general audience, regardless of where they appear on the page.

As discussed below, we took a Bayesian approach to fitting this model, so that we can use our prior information about our products’ appeal to customers. More concretely, we use the pystan package in Python tofind the maximum a posteriori (MAP) estimate for each coefficient, and then use Laplace approximation to turn these estimates into full Gaussian posterior distributions.

2. Making the most of big data, spread thinly

Given sufficient data, we could just use the logistic regression model without further changes. Wayfair handled more than 9 million orders last quarter alone, which initially might sound like more than enough. However, those orders were spread out among millions of products, yielding just a few orders per product at most. Small integers like these can be extremely noisy, so we always have to worry that one product simply seems better than another because of random chance. For example, it is hard to tell if a product that happened to attract three orders is actually any better than one that happened to attract two, or if it just got lucky.

Our new Bayesian algorithm takes two complementary approaches to this problem: (a) empirical Bayesian regularization and (b) incorporating other signals besides orders.

A. Empirical Bayesian regularization

If we simply ranked products by their historical order rates, we would immediately run into trouble. Imagine a shower curtain that has an order rate of 100% after being shown to exactly one customer. If we took the “100%” number seriously, we would end up pushing this product to the top of the first page, leapfrogging past thousands of shower curtains with a long track record of enticing our customers. Clearly, we need some way of anchoring our estimates from flying off toward 0% or 100%.

In our Bayesian approach, we use our prior information about the variation among products in our catalog to ensure that our estimates fall within a reasonable range. For example, our historical data might suggest that a very good product’s order rate might be ten times higher than average, but that a factor of ten thousand would be unreasonable. When we take this prior information into account, we can generate a posterior distribution that gives us a reasonable answer, even in the case of noisy data implying 100% order rates.

In the hypothetical illustration below, the purple curve represents the range of order rates seen across the catalog (our prior), and the green curve represents the range of plausible order rates for the shower curtain discussed above. Despite the noisy data, we can still produce a reasonable range of possible estimates for the shower curtain. The fact that it was ordered the very first time it was shown to a customer is enough to shift the distribution to the right, but not enough for us to expect anything like a 100% order rate going forward.

Fig. 3: An illustration showing how a prior distribution like ours (purple) can be updated in response to new information on a product’s performance.

B. Incorporating other signals

Our Bayesian approach reduces the noise in our estimates, but it doesn’t do anything to amplify the signals we get from customer behavior. Fortunately, products don’t just get “ordered” or “not-ordered”. They can also be clicked, added to cart, added to registry, saved for later, etc. These other behaviors provide a much richer source of information about customers’ product preferences, since they can be orders of magnitudes more common than orders.

In order to use this additional information, we took our initial goal (to show products that potential customers want to order) and broke it down into steps. In order for a customer to order a product that we’ve shown them, the key steps are:

Clicking the product
Adding the clicked product to cart
Ordering the added product

If a product tends to get clicked and added to cart, then that is generally a good sign about its future order rate. Likewise, if customers never even click on a given product, then it would be impossible for them to order it. Because each step depends on the previous one, we can quantify the contributions of each step to our overall goal mathematically using the chain rule of probability:

This provides a very natural way for us to integrate customer behavior at all three stages: instead of fitting one big logistic regression model, we can fit three smaller ones and then multiply the results together. If a product hasn’t been on the site long enough for us to reliably estimate its overall order rate, we can still make an educated guess based on how often people add it to their carts. But with enough data, we can also change course if we find that a given product tends to be abandoned in people’s carts, and start showing customers something that they’re more likely to order.

3. Updating the model over time

So far, we have only talked about generating one estimate of each product’s general appeal, but e-commerce is dynamic and we’d like to generate new estimates as often as possible. Fortunately, incremental learning is an area where Bayesian methods excel. Instead of re-training on a big chunk of historical data each day, we can encode the historical information in our prior distribution. Then, we can update the prior in light of today’s data to form a new posterior distribution. Even though this updating step only consumes one day of data, the final result still contains the historical information from our prior, along with the fresh data from today.

But how exactly should we form our priors each day? We briefly considered the simplest approach: using the previous day’s posterior distribution as the new prior. This approach would have let us retain all of our information about products’ historical performance. But if we expect products’ performance to change over time, then we should not pay too much attention to the older data. For this reason, we chose not to use 100% of the information in our old posteriors to form our new priors; we use most of it, and intentionally “forget” the rest. Since our priors and posteriors are all Gaussian, we define “forgetting” in terms of an autoregressive process, slowly allowing each product’s posterior distribution to diffuse towards the original prior over all products.

Fig. 4: A sketch of our daily update loop. Each day, our logistic regression model combines our observations of customer behavior with our prior knowledge to produce a posterior distribution. This posterior distribution then informs the next day’s prior, but we intentionally “forget” a small portion of the information we have obtained, so that the model doesn’t become too fixated on past performance.

4. Ranking products in real time

Once we have trained our model, we can use it to make recommendations to our customers. Our main goal is to show appealing products near the top of each page, but we have some additional goals as well. In particular, it is important that we explore a range of different possible product rankings, so that we can gather more valuable data about product performance. This creates an interesting trade-off: on the one hand, we want to exploit information we have already acquired about product appeal; on the other hand, we want to explore and gather more information.

Exploration is important to us for at least two reasons:

If we show each product in a range of positions, then we are better equipped to estimate (and account for) the position-related effects discussed in Section 1.
If we show a range of different products, then we will have more opportunities to find great products that have not previously received much attention.

For our purposes, pure exploitation would mean that we always ranked products according to P(order|product shown); likewise, pure exploration would mean showing them in a different, completely random order for each customer. We chose a variant of Thompson sampling as our middle ground. Setting aside some engineering details, this is what happens when a customer browses wayfair.com:

First, we generate random samples from our products’ posterior distributions, yielding slightly-randomized estimates of each product’s appeal.
Then, we re-rank the products so that the products with the largest sampled values are shown on top.

This enables us to show each product in a range of different positions each day, while still making sure that the best products tend to be shown prominently for most customers.

The future of product ranking at Wayfair

Our new Bayesian ranking system helps us understand which products are most appealing to customers, and helps our customers find the perfect items for their homes. And even better, our system is designed to keep improving over time. As customers continue to peruse and purchase items from our catalogue, our understanding of the performance of different products continues to improve. Our logistic regression model disentangles each product’s actual appeal from its visibility on our sites (i.e., accounting for sort-order biases, see Section 1). Combining empirical Bayesian regularization with multiple data sources on customer behavior allows us to make efficient use of new data (Section 2). Bayesian updating makes use of historical data, while still being flexible enough to adapt to new trends (Section 3). Finally, Thompson sampling explores a range of good rankings, to help us find the best products to show future customers (Section 4).

But, we are never done! Given the huge success we have seen with this approach, we have started exploring how to commoditize the techniques described here such that they can be applied to a variety of optimization and ranking problems we face at Wayfair in a straightforward way. Additionally, we have been extending this approach to incorporate product features and non-linear interactions between customer features. Adding product features improves the estimation of product effects, especially for products with few observations, by generalizing performance across similar products. The non-linear interactions between customer-features produce better estimates of product effects.

Feature Image by OpenClipart-Vectors