At Wayfair, data scientists help optimize our marketing channel performance by rapidly and iteratively developing machine learning algorithms and data-driven strategies. When it comes to measuring model efficacy, AB testing plays a central role and serves as a gold standard in strategic decision making within marketing. Due to the natural complexities in marketing operations, marketing-related test design and measurement require special techniques to account for multidimensional, sparse, seasonal, and autocorrelated observations. In this blog post, we will discuss these challenges, and introduce how Wayfair’s advanced marketing test platform —Gemini— remedies these issues.
A Brief Overview of the AB Test Framework
Typically, an AB test consists of two phases, design and measurement. During the test design phase, experimental subjects are assigned to control/treatment groups, with primary Key Performance Indicators (KPIs) pre-specified and sample size/test duration determined through power analysis. During the measurement phase, treatments are applied to subjects accordingly, with treatment effects and statistical significance being measured through hypothesis testing. For example, to understand if a new web page style is more appealing, customers would be randomly shown either the old or the new style when the web pages are loaded. Such test design is referred to as a complete randomized design (CRD). To measure the attractiveness of the two web page styles, add-to-cart rate and total browsing time could be used as the primary KPIs for which a two-sample proportion test and a student’s t-test could be used to quantify the treatment effects, respectively. While CRD together with t-test is one of the most commonly used AB test approaches, it is not always efficient, economical, or even possible to employ this method. This is especially true in the context of marketing tests due to the following challenges.
Challenge 1: Dealing with multidimensional, collinear, and skewed KPIs
Marketers are typically interested in measuring more than one KPI in a test. For example, to compare two bidding algorithms in Search Engine Marketing, marketers would usually like to quantify the outcome by net revenue, ad cost, and traffic. Therefore, good marketing test design relies on even splits of test subjects into treatment groups across multiple KPIs. Interestingly, these KPIs are often highly-skewed and intercorrelated. For instance, most of the visits from the Search Engine Marketing channel do not generate any revenue, and therefore the observations are zero-inflated. At the same time, revenue and ad costs are usually positively correlated. A complete randomized design often fails to efficiently make even splits across multiple KPIs, especially when there are cost constraints which limit the sample size. In this case, stratified sampling together with principal component analysis will speed up the test design process.
Challenge 2: Dealing with observations with seasonal variations
Marketing KPIs tend to have very strong seasonal variations. For example, Search Engine Marketing traffic usually doubles during weekends compared to weekdays, or revenue grows month over month as business grows. It is crucial to ensure that our treatments are designed orthogonal to such seasonality and therefore the estimation of treatment effects remains unbiased. A good practice is to avoid simple pre-post comparisons, and make sure treatment groups are assigned within each of seasonal periods. In addition, we recommend modeling the seasonality as covariates during hypothesis testing to increase the power of the test.
Challenge 3: Dealing with autocorrelated observations
Due to the constraints on marketing operations, it is common that we have longitudinal measurements where observations are not independent to each other. For example, to test two Search Engine Marketing optimization models, the test subjects are restricted to search keywords on which our bidding algorithms are applied. The same keyword has to stay in the same treatment group throughout the test period due to the marketing attribution process. Therefore, the day-to-day measurements are from the same batch of keywords and are autocorrelated. In this case, special statistical models are needed to account for such autocorrelation. Failure to account for autocorrelation will increase the probability of drawing false positive conclusions (inflated type I error).
The Gemini Solution to Marketing Test Design and Measurement
At Wayfair, we developed the Gemini test platform to remedy these challenges. The Gemini framework consists of two components. 1) The test design phase which generates control and treatment groups based on multiple KPIs; 2) The test measurement phase which evaluates treatment effects by controlling for seasonality and autocorrelation. In addition, the built-in train-validate-calibrate-evaluate logic allows quality control and power analysis, which improves test design quality and test measurement accuracy. I will elaborate in more detail below.
Gemini improves test design quality with Multi-Group Palindromic Split (MGPS) based on the first principal component
As discussed above, many of the marketing tests require even splitting of control and treatments across multiple KPIs. Fortunately, most of these KPIs are correlated to each other and therefore the first principal component (PC1) could be used to capture the information carried by these KPIs and serve as a key metric in stratified sampling. On top of PC1, Gemini applies Multi-Group Palindromic Split (MGPS), a special version of stratified sampling to assign control and treatment groups (Fig-1). Our practice in the past year has demonstrated that applying MGPS based on PC1 is a robust and efficient way to generate even control and treatment groups. In cases in which the primary KPIs are independent, the PCA approach would not work; however, MGPS can still be applied based on user-defined summary metric, albeit it might take more tries to get satisfactory splits. Notably, even when the splits are not satisfyingly even, the Gemini measurement model introduced below is robust enough to adapt to it.
Gemini enhances test measurement accuracy with linear mixed effects model
The central component of the Gemini framework is a linear mixed effects model, where each primary KPI is modeled as a function of treatments (as fixed effects) together with other covariates (as random effects) to account for seasonality and longitudinal measurement (Fig-2). Specifically, time or date is encoded as a categorical variable to account for its nonlinear relationship to primary KPIs. The subgroup from MGPS is used to account for autocorrelation which helps to prevent false positive conclusions.
Improving Results Overall: Gemini validates test designs and calibrates test measurements
Another highlight of Gemini is its built-in train-validate-calibrate-evaluate framework, enlightened by the well-known train-validate-test machine learning logic (Fig-3). During the test design phase, historical data is used to conduct an AA backtest to measure the quality of test design before test being launched. Specifically, MGPS should be performed based on the training period only. An independent validation period is used to measure the split evenness. As a bonus point of doing so, data points during the validation period could be used for simulation-based power analysis to estimate sample size as well as test duration. During the test measurement phase, if there are inherited baseline differences between the control and the treatment group during validation period, we may modify the test design matrix and encode the treatment indicator as a time-dependent variable, and hence the estimation of treatment effect could offset/calibrate any pre-test baseline differences.
In this blog post, we have discussed the challenges of test design and measurement in the context of marketing use cases.We also presented in this article the Gemini solutions to conquer these challenges, where the principal component-based Multi-Group Palindromic Split method serves as a central part during the test design phase and the linear mixed effects model plays a central role during the test measurement phase. Notably, while Gemini was developed in the context of marketing tests, our practice of Gemini at Wayfair in the past year has demonstrated the values of Gemini as a general and robust test framework with extensive use cases in merchandise, storefront, and logistics. And we are excited to announce that currently we are productionizing the Gemini framework into R/Python packages, and will keep this post updated going forward.