# Introduction

At Wayfair, we make millions of daily customer-level marketing and advertising decisions aimed to drive customer acquisition, conversion, and long-term loyalty, all while enhancing our customer experience and Wayfair’s top and bottom line. These decisions include: whom to show, what message/content to show, where to show this content, and how much to pay, etc. Since it would be infeasible for humans to make these individual decisions effectively, we’ve designed and developed a scalable algorithmic decision-making platform called
WayLift for paid media marketing. In this blog post, we will provide an overview of the theoretical foundation of WayLift — the contextual bandit algorithms for optimizing marketing decisions.

# Background

A contextual bandit chooses an action from a set of candidate actions based on a learned policy for a specific context, such that in the long run we strike a balance between exploitation and exploration and maximize cumulative rewards. The contextual bandit also updates its policy based on the feedback it receives as it interacts with the environment.

There are many different contextual bandit methods available, and the specific methods we use are provided in the online active learning library called
Vowpal Wabbit. Vowpal Wabbit, or VW in short, is an open-source project initially developed by researchers at Yahoo! and Microsoft and it provides a blazingly fast and scalable solution for contextual bandit learning. The main feature that sets it apart from other contextual bandit implementations is that it uses reduction techniques to reduce the complex problem of contextual bandit learning into a set of more traditional machine learning problems that we readily have solutions for^{[2]}. The key reduction method in VW’s bandit implementation allows us to recover an accurate causal estimate of how good an action is if we were to play it. The immediate benefit is that we can use production data to update the model in an unbiased fashion. We will dive deep into this in the “Reward Estimation” section. VW also provides a large set of exploration algorithms for us to constantly explore and respond to changes in the environment; these algorithms are discussed in more detail in the “Exploration Algorithms” section.

Although the contextual bandit model can work with a large set of candidate actions, for simplicity, we will use the motivating example of deciding who to show ads to throughout this post. In this case, we have only two actions to choose from: to show ads to a given customer or to not show.

# Key Concepts

Before we discuss how a contextual bandit works, it would be helpful to go over the key concepts in contextual bandit learning:

- Context/Features/x: The context features, which are the information about the state of the environment, and/or the action-dependent features, which describe attributes of the treatments themselves.
- Action/Treatment/Arm/a: The bandit algorithm chooses an action from a set of feasible actions for each unit of engagement.
- Probability/Propensity/p: The contextual bandit stochastically chooses an action according to a probability mass function, and actions with higher rewards are more likely to be chosen. The probability/propensity usually refers to the probability associated with the chosen action.
- Reward/Outcome/r: These two terms sometimes are used interchangeably but have slightly different meanings. The outcome is what we directly observe. The reward can be one particular outcome but also can be a derivative of the outcome/set of outcomes, for example, a weighted sum of a list of outcomes. Sometimes, we have delayed rewards that are not immediately observable; for example, a customer might make a purchase days after seeing an ad. In this case, one can build a model to forecast the delayed rewards and use the predicted reward as the bandit learning label. We’ll cover Demeter, our in-house delayed reward forecasting engine, in a future blog post.
- Policy/𝛑: The policy is a function that takes in a context and outputs an action. It can evolve and improve throughout bandit learning.

# Overview of Process

Below are the steps in the learning process:

- Observe context, x
_{t}, at time t - Select action, a
_{t}∈ {1, . . . , K} , with the help from a policy/policies, 𝛑_{t}, where 𝛑_{t}(x_{t}) = a- Estimate the reward of each action conditional on the context
- Generate a probability mass function (pmf) over the actions based on their estimated rewards
- Randomly choose the action according to probability, p
_{a}

- Observe the reward, r
_{t}(a_{t}) (alternatively, can be denoted as loss l_{t}(a_{t})= -r_{t}(a_{t})) - Update the learner’s policy/policies based on the quadruple (x
_{t}, a_{t}, p_{a_t}, r_{t}) to obtain 𝛑_{t+1} - Repeat steps 1-4

To put the process above in simple terms, let’s use the problem of determining who to show ads to as an example and see how a single learning episode runs (1) we observe the context, for example, the user feature; (2) we can either target the customer or not target her. The contextual bandit model will estimate the rewards for each option and output a list of probabilities that sum up to 1, with higher probabilities for the action that is expected to generate more profit; (3) we decided to target the customer after a random selection based on the probabilities; (4) we observe the reward, and in this case, the user converts on Wayfair and generates profit; (5) we update the contextual bandit’s policy to complete one episode of learning.

# Reward Estimation

Now that we have gone through the high-level steps in the contextual bandit learning process, let’s take a closer look at how we model the reward of each action and obtain the reward estimates in step 2. Then we will discuss in more detail how the underlying reward estimators are updated in step 5.

The reward estimators take in the context vector and the action as inputs and output the reward estimation. We will introduce two linear formulations of the reward model. The first approach builds one linear regressor per action: *f(x,a) = θ _{a}*

^{T}

*x*, and we will fit one set of 𝜽

_{a}per action a. The second approach fits one model for the combined context and action-dependent feature vector x

_{a}:

*f(x,a)=𝜽*

^{T}

*x*. We typically use the first formulation when the set of actions is fixed and the number of actions is small. The second formulation is more suitable for the scenario where we have a changing set of actions characterized by action-dependent features and the action space is large. With these two reward formulations, we can easily reduce the problem of contextual bandit learning to the problem of fitting one or more linear regression models. When we need to make a decision given the context, we can input context and each of the actions one at a time into the model, and we can choose the action to play based on predicted rewards corresponding to each action.

_{a}Now that we have the reward model, the question then becomes how do we estimate the 𝜽’s using either historically collected data or new data that come in during production? One natural intuitive solution would be to fit the regression model to minimize the MSE between the estimated rewards and observed true rewards.

However, there’s a catch! Correlation does not imply causation. If we don’t account for how likely a particular type of customer will be exposed to a certain type of action in the reward estimation, the predicted rewards themselves will be biased. We will use a simple example (borrowed from Susan Athey’s presentation on Counterfactual Inference’s
__slide 74__) below to illustrate the problem and then discuss the solution to recover causality.

Imagine a hypothetical scenario where we have at our disposal two actions: blue and red, and the context is a single scalar value x. The reward of choosing a particular action at a different level of x is determined by the two data-generating models, denoted by the red and blue curves. We can’t observe the data-generating models but we can observe the red and blue dots which represent the manifested samples. The goal is to fit two models to approximate the true data-generating models.

Now, if we deploy a reinforcement learning solution using reward estimators trained with the objective of the naive MSE minimization, then we will run into the problem of biasing the estimator. Let me unpack it a little bit. Suppose we train two models which are represented by the two dashed lines below. We will learn over time that the blue action is better for low values of x, and the red action is better for high values of x, and we will play blue when x is low and red when x is high. Consequently, we can see the training data become biased: we have many blue dots in the low x region and many red dots in the high x region. When we fit the red model, it will essentially be fitted to reduce the MSE for the cluster of red dots in the high x region and ignore those dots in the low x region. Similarly, the blue model will suffer the same problem. If we then use the learned models to draw a decision boundary based on the x value above which we choose the red action and below the blue action, the learned decision boundary will be to the left of the optimal decision boundary (not visible to us in real life) and we will erroneously decide to treat customers in the region between the two boundaries with the red action (the red dashed line is above the blue dashed line), while in fact, the blue action has the higher reward (the blue solid curve is above the red solid curve).

One reduction method to debias the estimation is called Importance Weighted Regression (IWR)^{[2]}. It modifies the traditional regression loss of MSE by adding the inverse propensity weight term. This weight will increase the influence of the data points with a small probability of choosing the logged action:

One can also use another reduction method called cost-sensitive classification (CSC)^{[2]}. The intuition is to reduce the problem to the task of multi-class classification. The classes represent the actions that can be chosen. We train the multi-class classifier by associating a different cost value for each action for a given sample. The cost for each action can be estimated using the Inverse Propensity Scoring estimator (IPS)^{[7]} or the Doubly Robust (DR) estimator^{[4]}. To keep the post short, we won’t discuss the details of these two types of estimators.

# Exploration Algorithms

Once we have the reward estimates for each action, the contextual bandit exploration algorithm assigns different probabilities to each action according to its reward estimate and randomly (but not uniformly) chooses an action to play based on the probability mass function over the actions.

There are many different exploration algorithms available. One simple approach is to choose the action with the highest reward estimate 1- ε of the time and for the remaining ε of the time, choose an action uniformly at random^{[8]}. This is known as the epsilon-greedy algorithm (example with ε = 0.1 below).

There is one inherent drawback with the Epsilon-greedy algorithm: it assigns the same probability to all the actions that don’t have the highest reward estimate; this will lead to inefficiencies caused by exploration on suboptimal actions. In the example above, our reward estimator believes not targeting the customer will generate 0.01 dollars less profit than targeting the customer. The two outcomes are virtually indistinguishable and not targeting can easily contribute to higher profit in reality. But epsilon-greedy will by design assign a much lower probability for the not targeting action.

Slightly more advanced algorithms like the Softmax^{[3]} or SquareCB^{[6]} explorers overcome this problem by assigning probabilities proportional to the actions’ reward estimates, so actions with higher rewards are more likely to be played. They also have a parameter to tune the level of exploration. By setting the parameter to a large number, the algorithm will start to always pick the action with the highest reward estimate. Alternatively, by setting the parameter close to zero, the algorithm will pick an action uniformly at random (example below).

There are also exploration algorithms that rely on an ensemble of policies to collectively assign treatment probabilities. They include the Bootstrap Thompson Sampling algorithm (Bagging)^{[1]}, and the Cover algorithm^{[1]}. Besides, another class of algorithms represented by RegCB^{[5]} assigns probabilities based on the lower confidence bound of each action. For the sake of brevity, we won’t discuss them in-depth here.

# Summary

In this post, we provided a high-level overview of the contextual bandit algorithms that power our WayLift marketing decision platform. The contextual bandit model consists of a reward estimator and an exploration algorithm. In the “Reward Estimation” section, we went over how the reward estimator estimates the expected reward through various reduction techniques. Then in the “Exploration Algorithms” section, we introduced a few exploration algorithms that probabilistically assign a treatment so that the probability of choosing a treatment is proportional to its estimated reward. In a future blog post, we will describe the architecture of our production reinforcement learning system and how we handle edge cases. Stay tuned!

## References

[1] Alekh Agarwal, Daniel Hsu, Satyen Kale, John Langford, Lihong Li, and Robert Schapire. 2014. Taming the monster: A fast and simple algorithm for contextual bandits. International Conference on Machine Learning, PMLR, 1638-1646.

[2] Alberto Bietti, Alekh Agarwal, and John Langford. 2018. A contextual bandit bake-off. arXiv:1802.04064.

[3] David Cortes. Adapting multi-armed bandits policies to contextual bandits scenarios. 2018. arXiv:1811.04383.

[4] Miroslav Dudík, John Langford, and Lihong Li. 2011. Doubly robust policy evaluation and learning. arXiv:1103.4601.

[5] Dylan Foster, Alekh Agarwal, Miroslav Dudík, Haipeng Luo, and Robert Schapire. 2018. Practical contextual bandits with regression oracles. International Conference on Machine Learning, PMLR, 1539-1548.

[6] Dylan Foster, and Alexander Rakhlin. 2020. Beyond UCB: Optimal and efficient contextual bandits with regression oracles. International Conference on Machine Learning. PMLR.

[7] Daniel G. Horvitz, and Donovan J. Thompson. A generalization of sampling without replacement from a finite universe. 1952. Journal of the American Statistical Association, 47.260, 663-685.

[8] John Langford, and Tong Zhang. 2007. Epoch-Greedy algorithm for multi-armed bandits with side information. Advances in Neural Information Processing Systems (NIPS 2007) 20, 1.