Improving Recommendations with the Wayfair Algorithm Simulator Program (WASP)

Out of a zillion options, customers want to find the one item perfect for them. The recommendations team makes that happen by leveraging advanced machine learning techniques and our immense datasets to provide ever improving recommendations. A key step in this process is experimentation, i.e. the design, launch and analysis of A/B tests to understand which algorithm provides the best customer experience. This blog post describes a simulation tool our Analytics team has built and used prior to every test launch. This tool has improved our test success rate and testing velocity dramatically, and led to new insights/improvements of our recommendations.

Introduction

At Wayfair we pride ourselves on having a zillion products. Indeed, our value proposition is that “We’ve got what you need, carrying the widest and deepest selection, so you can find the one thing that reflects you, your life, and the people you share it with.” But with unparalleled selection, comes great responsibility. Indeed, if a customer wanted to browse our entire wall art catalog (>400k products), it would take them 24 hours! At Wayfair, we want people to find that one product they love in a matter of minutes, not hours.

Cue to the Search and Recommendations team, whose job is to make that task easier. We are the team whose mission is to make sure that out of a zillion options, customers find the one perfect for them.

The way we achieve that mission is by leveraging our huge datasets and cutting edge machine learning algorithms to make great recommendations. You can find some examples of our work in our tech blog, such as our Bayesian ranking algorithm or our Graph based algorithm for some of our carousels.

A key element common to the development of these algorithms is experimentation. One needs to run careful A/B tests to measure the impact of a new recommendation algorithm and make sure that this leads to a better customer experience. This process is not without its challenges including:

We are testing ML algorithms, which tend to be harder to interpret than simpler layout changes.
We test these algorithms within the constraints of a real business environment where we want to move fast, minimize losses, etc. This calls for slightly different approaches and best practices than, say, a drug randomized trial.

This blog post will focus on the first issue and explain how our Wayfair Algorithm Simulation Program (WASP) has helped us make sense of the output of these algorithms, increase our testing velocity and reduce the risk of “bad bets” dramatically.

Product cycle and the need for pre-test simulations

In a somewhat simplified view, the product cycle of a recommendation algorithm goes as follows.

Ideation: Somebody from the team has an idea to improve our recommendations and some data to back that up. E.g. We have strong evidence that customers care about the price of the items we recommend. We call this “price aware recommendations”.

Model development: Our Data Scientists start working on a new model. In the fictional example above, our Data Scientists incorporate explicit price information in the model.
Model productionization: Our Data Scientists work with our engineers to productionize the model and make sure it works seamlessly in our production environment.
Experimentation: We A/B test the model. If it improves the customer’s experience, say by increasing conversion rate, we roll it out.

The steps above might lead somebody to think that experimentation is something that comes last and that is relatively easy (just a yes or no decision). Reality, as always, is a lot more complicated...and a lot can go wrong as we move from step 1 to 4.

Just think about it, we go from an idea to a prototype of an algorithm to a fully productionized version of that algorithm that can work for millions of customers every day. That is a lot of code to write in several different languages (SQL, python, Java), and involves a number of different teams/people! And even more, the final product is very hard to evaluate.

To understand this, think first about a traditional UI test, e.g. a test where we change the color of our add-to-cart button (from yellow to green). To check that the change is working as expected, one loads the site (on different browsers/devices) and checks that the button is now green. Now think of recommendations, as shown in Figure 1. One sees different products in control and variation. But are those products “better”? I personally really like the first sofa in the variation but will customers agree with me? These types of questions have made our QA particularly challenging. In the past we used to manually load a few pages and simply gut check the recommendations we saw. This allowed us to spot egregious problems, such as showing only $20,000 sofas. But it was impossible to spot more nuanced issues. For example, if we recommend products that are 25% more expensive, this is enough to impact the customer’s experience but is hard to spot simply by looking at a few pages.

The challenges described above meant that quite a few times we would launch a test, run it for a few weeks, discover that the variation was underperforming, analyze our test data and then find the problem. If only we could have known that before launching the test… Cue to our Wayfair Algorithm Simulation Program (WASP), a tool designed by our Analytics team to solve this problem (and many more).

Figure 1: visual examples of products shown in variation and control.

For additional context, remember that testing time at Wayfair (or any other tech company) is a scarce resource that is in high demand. Testing an algorithm over a longer period of time could hold another team back from testing another idea, which is a huge problem. So our team decided to take on the challenge, with the prediction that if we could test twice as many ideas, we could get twice as many wins.

WASP

WASP is, conceptually, made up of two key components: a Simulator and an Analytics engine. These work as follows.

Simulator

The simulator performs two steps:

It creates a list of customers who are looking at specific pages. E.g. Julia is looking at Area Rugs while Samir is looking at wall art.
For each of these customers, it pings our recommendations API and gets the products that these customers would see in control (our current experience) and variation (the new algorithm we are about to test)

The output of the Simulator is two lists of products (control and variation) for each customer. We usually simulate a large number of customers, at least 1,000 up to as many as 50,000. Just imagine doing this manually- it would be virtually impossible.
While this is the basic functionality of the simulator, we can actually do a lot more, such as talking to other services at Wayfair and adding more data to our simulations. For example, we can ping our delivery time estimate service and add real-time data about when we think the recommended products can be delivered to our customers. Or we can do the same with our pricing service and get real-time pricing and profitability information.

Analytics Engine

The Analytics Engine takes the input from the simulator and calculates aggregate statistics of interest. For example we look at the average price of the products recommended to customers in control and variation. A fictional example is given by the table below. In this case we are testing two new algorithms. One can quickly see that Variation 1 is recommending products that are very expensive ($280 vs $500), which is a bit of a red flag.

WASP Output.PNG — Table 1: example of WASP output for one test. Note how variation 1 recommends products that are significantly more expensive than control.

An advantage of WASP is that it is a highly flexible tool that allows analysts to do custom data cuts to deep dive into any such issue. In this fictional example, an analyst's next step would be to run a “positional” analysis, which incidentally is one of the most insightful views we can have on our data. A positional analysis shows a specific KPI for each slot on our product grid, which we number from 1 (top left SKU on the page) to 48 (last SKU on the page), see image below. This allows us to better understand how our recommendations behave on the page. An example is shown in the plot below. One can quickly spot what the problem is. Variation 1 is recommending very expensive products (> $2000) at the end of the page, while the rest of the page looks fine. This is a very surprising finding that hints at some potential issues with the model implementation in Variation 1. Knowing this issue and where it happens is incredibly helpful for our Data Scientists and Engineers to fix it before the new model goes live.

We note that the ability to thoroughly simulate the results of our new recommendation algorithms allows us to now find most issues before we launch an A/B test and in a matter of hours, which in turn increases our speed velocity significantly. Indeed, since the introduction of WASP, we have increased the number of tests we run per year by roughly 10x.

Price v Position.PNG — Figure 2: (top) explanation of how we number slots in our product grid for our positional analysis; (bottom) Average product price for each position in the grid. It is apparent that, in this example, we have some very expensive products shown towards the bottom of the page, which is concerning.

Other Applications of WASP

While WASP started out as a QA tool, with the goal of ensuring that we are testing promising new algorithms, it quickly evolved into something more. We now also use it to do Exploratory Data Analysis and answer “what if” scenarios. One great example is speed - WASP enabled us to validate that our models implicitly prioritized items that shipped quickly, enabling us to remove manual overrides. This improvement led to a higher conversion rate and made our pages load faster (because the overrides introduced some latency). And all of this only needed changing a couple of lines of code. Talk about a low hanging fruit!

This use of WASP is something that our team now does on a regular basis. Indeed, at the writing of this blog post, we are discussing how to use WASP to tweak another parameter in one of our models.

Some Additional Details and Context

The first thing we want to point out is that Wayfair’s microservice architecture made WASP possible (see my simplified description of our recommendation service in a previous post). This allowed us to pass information and request the recommendations that would be served to a real customer. This is also an example of why cross-functional collaboration is a key part of our process and enables success at Wayfair. In order to build WASP, we had to work closely with our engineers and spent several hours at our desks, going over the nitty gritty of our server architecture and the information needed in the payloads to request recommendations.

Second, we want to clarify the difference between WASP and the off-line evaluation that our Data Science team does for their algorithms. One is not a replacement for the other! Our Data Science team thoroughly evaluates their algorithms using off-line KPIs such as nDCG or prediction accuracy (watch this video for more info). And they also look at some KPIs, such as price, similar to what we described above. However their evaluation is usually done off-line and with a prototype of the algorithm. Furthermore, the products you see on our page do not come from a single algorithm but are the combination of several (~10!) algorithms, so one wants to know how the new algorithm will interact with the rest. This is where WASP really shines, as it takes a very customer-centric view, by looking at the products that a customer would see on a page (the final output) and by focusing on customer-centric metrics, such as product price, review ratings, etc.

Another interesting point of discussion is what good looks like in WASP. While the fictional example we shared above was very clear-cut (we were recommending exceedingly expensive products at the end of the page), reality is sometimes more ambiguous. For example, if the products recommended by the new algorithm have slightly lower review ratings, is that OK? What threshold should we use to define good/bad? The answer here is imperfect. While we have developed an intuition of what is healthy, there are plenty of gray areas. Our approach in these cases is to test (we are OK with taking risks) but to monitor these tests closely, since we know they are risky. There is a lot of power in knowing which tests are risky and acting swiftly if things don’t look good.

Conclusions

A/B testing recommendations algorithms comes with a variety of challenges that our Analytics team needs to creatively solve on a daily basis. In this post, we described how we came to build WASP, a tool that has made our pre-test QA a lot easier, faster and more effective. The results have surpassed our expectations. WASP is now a key step in any A/B test that we launch (as well as for cases where we just roll out new changes). It has allowed us to dramatically increase our testing velocity, by catching any issue before an A/B test, and therefore has allowed our recommendations team to make more frequent and successful improvements to our algorithms. And it has also become a vital EDA tool to better understand our recommendations and come up with new ideas for how to improve them.

Acknowledgements

Wayfair is a highly collaborative environment and a number of teams contributed to WASP in its current state including: Analytics, Data Science, Engineering and Product.