By Sarah Cotterill
It’s critically important for Wayfair to understand the causal impact on customer’s future spend of various customer actions, like downloading our mobile app, subscribing/unsubscribing from our email list, or purchasing in different product classes. These estimates not only inform strategic business decisions at a high-level, such as what the value proposition of various offerings should be, but they also inform our marketing efforts on a customer-level. For example, when deciding whether to speak to customers about our in-home Assembly, extended Warranty, or Wedding Registry, we’d like to consider both how relevant these offerings are to the customer as well as the longer-term incremental value associated with successfully cross-selling them.
One of the challenges in estimating the longer-term impact, or halo effects, of such actions is that we often cannot conduct randomized experiments (i.e. we cannot randomly assign some customers to purchase in-home assembly services, and withhold those services from others). Instead, we have to employ statistical techniques to try to recover causal estimates from observational data.
The second big challenge is that these causal estimates should be produced in a highly consistent and scalable way, even as the portfolio of events for which we'd like to measure halo effects grows to 200+. In addition, we’d like to produce updated estimates on a frequent basis with little or no dedicated analytic resources for each successive refresh. In this post, we discuss these challenges in more detail, as well as the output of our efforts, our Halo Effects Platform, which offers a number of benefits:
- Automation, efficiency, and standardization: The platform we’ve built automatically generates halo effect estimates for a portfolio of key customer “events” on a monthly basis. It also standardizes the approach to measurement, so that halo effects across events are robust & comparable.
- Better Insight: Because the platform generates estimates each month, we are able to gain long-term insight into how the halo effect of each event changes with shifts in the business and marketing strategy, the underlying customer base, etc.
- Scalability: Because the platform we’ve built is modular and generalizable, it can be easily extended to accommodate additional use cases and marketing events where self-selection complicates causal inference (e.g., customer service calls, purchases in product class, etc.).
Why not run an experiment?
As noted above, these are causal questions — we are hoping to understand how much more, for example, in-home assembly service causes customers to spend over time. The gold standard for answering causal questions is the randomized experiment, where customers are randomly assigned to different treatments, and one measures differences on a KPI of interest across treatment groups. The genius of randomization is that it ensures, with a sufficient number of customers, that the only factor differing on average across treatment and control groups is the treatment itself, meaning randomization balances all other covariates (measured and unmeasured) across groups, allowing us to isolate the causal effect of the treatment on the post-treatment KPI. If we do observe a difference on our outcome metric, we can then be more confident it’s due to the causal effect of the treatment.
Unfortunately, as mentioned in the overview, in many situations it is difficult or impossible to run a randomized experiment. We can’t, for example, for logical reasons, randomly assign new customers to buy a baby crib vs. an area rug as a first purchase (nor would giving cribs vs. area rugs away for free work, both due to practicality constraints, and as receiving the item as a windfall is psychologically different than deciding through free will to make the purchase). It’s also the case that people are self-selecting into these crib purchases in non-random ways -- for example, customers purchasing cribs might be more likely on average than customers purchasing area rugs to be expanding their families and have a need for more nursery furniture in the near future. It then becomes difficult to say whether any lift in revenue we observe post crib purchase is due to the effect of the crib, per se, or the pre-existing differences in life-stages across the groups of customers that purchase cribs vs. area rugs.
We therefore have to work to recover causal estimates from observational (non-randomized) historical data. Many approaches involve computing a set of features and then matching like-for-like customers, some of whom happen to opt into an event and some of whom don't. However, there is substantial variability in 1) the selection of features used for matching, and 2) how the matching is actually done.
What do we match on?
Which set of variables should you match on? Should you, for example, find customers with equivalent likelihood of being treated (i.e., match on features predictive of treatment)? Should you find customers with equivalent pre-treatment propensity to make purchases (i.e., match on features predictive of outcome)? Here we leveraged recent work on causal inference suggesting that matching on features strongly predictive of the treatment and weakly predictive of the outcome can actually enhance bias (see also here and here ). Intuitively, this is because by reducing the amount of variance these features are able to explain in the treatment to zero, 1) we reduce bias with respect to these variables but 2) we force the variance that remains in the treatment variable to be entirely explained by remaining, and perhaps unobserved, confounders.
The recommendation from this work is therefore to match on features strongly predictive of the outcome, i.e., of long-term gross revenue. In this case, we generated ~300 “lifetime” features that capture the bulk of customers’ interactions with Wayfair (e.g., from the first time they arrived on site until the date at which the features are generated). These features span three broad interaction types: orders, visits, and views. From these features, we identify a subset most predictive of future 12-month gross revenue. These are features we use for matching: they ensure that customers who take an action (“positives”), and those matched “negatives” (customers who don’t) are roughly equivalent, prior to the date on which the event of interest occurred, with respect to factors predictive the metric we are hoping measure in the long-term.
It’s not possible to match exactly on all features, as we’d run into the curse of dimensionality: finding exact matches across our positives and negatives becomes increasingly difficult, if not impossible, as the number of features used in matching grows. Researchers have proposed a number of matching algorithms meant to overcome the curse of dimensionality, however, they vary in their efficiency, power, and ability to reduce imbalance across groups. For example, one popular approach for addressing the curse of dimensionality is propensity score matching, where you first regress the treatment on your set of covariates, and then compute predicted scores for each customer, corresponding to the “propensity to treat.” Propensity score matching then identifies customers with equivalent propensity scores across the positive and negative customer pools. There are however a number of known issues with propensity score matching , including the fact that it approximates the completely randomized experiment, where covariates are balanced on average across groups, rather than a fully blocked experiment where covariates are exactly equivalent across groups. In this way, propensity score matching is relatively inefficient; by collapsing information along many dimensions to a single dimension, it throws away quite a bit of information that could be used to further reduce imbalance across groups.
We employ instead coarsened exact matching, which is a very simple and very powerful method that approximates the fully block experimented. In essence, CEM creates bins for each feature and then matches exactly on those bins. For example, we might create a feature for all previous orders, and then “coarsen” the space by binning customers into 0 previous orders, 1-2 previous orders, 3-4 previous orders, etc. Likewise, we could create a feature for previous page views, and then again create bins (0 pages views, 1-5 page views, 6-10 page views, etc). The combinations of these bins are called “strata.” For example, customers with 1-2 previous orders and 6-10 previous page views form one strata. We’d then prune strata where customers are exclusively either positive or negative, resulting in only strata where there are negatives that are matched (across all covariates) with the positives. For more information on why CEM is particularly powerful, and why it’s more efficient at reducing imbalance than other matching approaches, like propensity score matching, see this paper .
Halo effect engineering pipeline
The first step in the engineering pipeline is to find historical examples of positive and negative events. Here it is important to think through the choice of the appropriate counterfactual pool of negatives—customers who are eligible to be matched with the positives. Matching finds customers who have equivalent histories with Wayfair (equivalent number of page views, orders, etc.), however, it’s important to also identify customers at similar points in their consideration cycle. This is especially true as our products have relatively long interpurchase times (customers don’t, for example, shop for a new couch every week), and so systematically selecting customers in our negative pool who have lower time-varying intent would bias halo effect estimates of an event upward, as we would be comparing our positive examples to negatives who have similar long-term histories at Wayfair, but just so happen to be at a point in their short-term consideration cycle where they weren’t in-market. We are therefore judicious in how we define our counterfactual pool—for example, to estimate purchases of complementary services, we pull all orders in the same timeframe as our positives’ service purchases, but from customers who didn’t purchase the service. Likewise, for estimating the halo effect of an email acquisition (e.g., a customer giving us their email address) during a visit, we take customers who had a same visit in the same time period and weren’t acquired.
Once we pull our positive and negative pools for each event, we generate for each customer in the pools the features used for matching. To avoid post-treatment bias, we consider information up to the date of the event, but not after, in computing these features. In other words, we find customers who are approximately equivalent, prior to the event occurring. We can then be more confident that any differences we observe post-treatment are due to the treatment itself.
We perform coarsened exact matching for each event to identify our matched positive and negative customers, and then estimate the average treatment effect within these matched samples. Finally, the platform writes the results to a Hive table.
How do we know if we are approximating the underlying true causal effect of an event? Over how long of a time period can we reasonably and accurately estimate halo effects? 30 days post event? 90 days? As interpurchase cycles are long and it can take some time for revenue to land, we’d ideally observe revenue spent by the matched groups over a relatively long-time frame to get as close to the “true” long-term halo effect of an event as possible. However, the further out one goes, the more likely it is that the matched groups drift apart for reasons unrelated to the event, making it harder to isolate the effect of the event per se on gross revenue. Being able to identify the longest window at which we are able to confidently produce estimates is therefore quite important.
To answer these questions, we use two approaches: A/A testing and A/B backtesting.
The logic of A/A testing is to match two groups of customers that should be the same, i.e., in the absence of a treatment, by, for example, matching in situations where customers opted into the same event, or matching and looking at gross revenue across groups prior to one group later self-selecting into an event. We then observe whether there are differences in gross revenue across the matched groups at different time intervals from the date of matching (e.g., 30 day post-matching, 90 days post-matching, 180 days post-matching, etc). The reason this is helpful is that if we see, for example, that there are no significant differences across groups in the absence of a treatment 90 days out, it suggests any differences we observe in the presence of a treatment/acquisition 90 days out, are due to the treatment itself and not the quality of the matches degrading overtime. If we also observe that there are no significant differences 90 days out, but there are significant differences 180 days out, then we know that the maximum post-event observation window we can confidently generate estimates for is 90 days.
Below we show some results from our A/A testing—in particular, estimates from matching customers who all had a click in a predefined time (i.e., they opted into the same event). On the x-axis, we show the different intervals for observing revenue (30 days post matching, 90 days post matching and 180 days post matching). On the y-axis we show differences in revenue for these time-frames across the matched groups. To get a sense for their reliability, we show results for four different run dates of the Platform (January, February, March, and April). The results make apparent three things:
1) In general, across different time windows, we observe small differences in gross revenue between the matched groups, suggesting the matching is doing a good job of finding like-for-like customers.
2) There aren't significant differences between groups in terms of average gross revenue 90 days post-matching, and there are significant but not practically meaningful differences 180 days out (i.e., only a $1 difference).
3) These estimates are quite stable across time.
On the basis of these results, we decided to output estimates for each event, each month, over two timeframes—90 days and 180 days. The 90 day estimates are more accurate, although they observe revenue over a shorter window; the 180 day estimates are somewhat less accurate but allow more time for revenue to land.
As a second form of validation, we conducted A/B backtesting. The logic here was to run an A/B test to estimate the causal effect of an event—an event for which a robust A/B test exists—and to simultaneously use the Halo Effects Platform to estimate the causal impact of the same event using observational (non-randomized) data. We then look to see whether the estimate HEP produces falls within the 95% confidence intervals of the estimate from the A/B test. If it does, we conclude that the HEP estimate adequately approximates, within the error bounds, the A/B test estimate.
We show below the 90 day A/B backtest results—in particular, the A/B test estimate shows that 90 days post event, customers in the Treatment group essentially spend no more or less than Customers in the Control group (95% confidence interval includes zero; note because we are measuring revenue—rather than, for example, conversion rate—the estimates are a bit noisy, reflected in the somewhat wide confidence bounds). We also see that the HEP point estimate of $7.38 falls within the confidence interval of the A/B test.
90 Day A/B Backtest Results
|Estimate (Gross Revenue)||Confidence interval|
|A/B test treatment effect estimate||-$1.68||[-$14.34, $11.00]|
|HEP treatment effect estimate||-$7.38||[-$32.98, $18.21]|
We reach a similar conclusion when considering the 180 day results—once again, the HEP estimate falls within the 95% confidence interval of the A/B test estimate—although note the confidence interval is wider still, reflecting greater uncertainty in the estimates stemming from the longer observation window.
180 Day A/B Backtest Results
|Estimate (Gross Revenue)||Confidence interval|
|A/B test treatment effect estimate||-$4.80||[-$21.40, $11.79]|
|HEP treatment effect estimate||-$15.70||[-$61.09, $29.62]|
These results suggest that HEP is able to approximate, within the bounds of error, the results from the gold standard A/B test for the event in question. There remains of course a question of how well the results here would generalize to estimates for other events, but when coupled with similarly encouraging results from A/A testing, the findings build confidence that the platform is able to correct for biases in the raw observational data, and performs as an adequate stand-in for estimating causal effects in situations where running A/B tests is not possible.
The Halo Effect Platform is currently running in production, allowing us to estimate the incremental causal impact of customer actions and events. This in turn allows us to optimize for longer term growth of the business; by speaking to customers about products and services that resulted in positive experiences for past customers (as evidenced by their returning and shopping with us again), new customers are themselves likely to come back and shop with us again—a win for customers and a win for the business!