Wayfair and its subsidiary brands operate globally and we send out millions of marketing emails every day to communicate with our customers. To improve customer satisfaction and increase customer engagement, we have developed a new generation of daily sales email models (Nightingale) to make email sending decisions. The Nightingale model is built via a scalable retraining pipeline to serve different business needs across all regions and brands.
Wayfair sends out daily emails (Daily Sales Email) to millions of customers through our Notification Platform. Our team leverages customer signals such as page views, add to cart rates, email clicks, orders, and historical preference to determine 1) how we expect each customer to respond if we send them an email, 2) what time (e.g. 9 am, 6 pm or other times) we should send out this email, 3) to whom we should send the email, 4) how frequently we should send the email (e.g once per week or seven times per week).
To divide and conquer the complex business requirements (e.g fast-changing customer behaviors and global business expansion), our team focused on building a new version of the ML model that handles the targeting of these emails (Nightingale) to better promote customer engagement and reduce the unsubscribe rates for highly engaged customers. Nightingale estimates the benefit value of sending an email (Core value model module in the chart). The Nightingale model will be the main topic of this article, and we will also briefly introduce the final sending decision process as well (Governance module in the chart). However, Send Time Optimization will not be covered in this article; please wait for our future series of articles for more details.
Nightingale is a family of machine learning models developed for daily sales ("batch") email, for the purpose of optimizing email cadence (how often we should send daily sales emails to customers). These models leverage historical customer interaction data to predict the probability a customer will convert or unsubscribe if we send them a daily sales email. The probabilities output by these models are combined with expected revenue gain/loss of conversion/unsubscription to generate a total net value score for each customer which guides send decisioning.
Background and Motivation
Previous daily sales email models set up a great foundational model-driven approach to target customers based on their past behavior to increase customer engagement. However, as our user base grew significantly in recent years and our business environment is rapidly changing, we require a new generation of daily sales email models which will:
- Balance customer engagement benefits and unsubscribe costs: Our previous model could not estimate the potential revenue and benefits from our daily sales emails and the opportunity costs if customers unsubscribed from our emails. Therefore, the new daily sales email model suite should be capable of estimating these two metrics and assessing the trade-off for every sending decision.
- Scale efficiently and foster automation: Previous models were trained on Jupyter Notebooks without the capacity to expand to all stores quickly and with manual processes involved to retrain. An automatic and scalable retraining pipeline is required to retrain our models more frequently to capture fast-changing customer behaviors across multiple regions and brands.
- Address Decay in Model Performance: The previous model leverages random data collected in 2019. Setting up an automatic retraining pipeline will let us leverage the latest customer data to achieve better performance. EDA (Exploratory Data Analysis) demonstrates at least a 4% performance lift by using the latest data compared to the outdated data.
Nightingale is a family of models to predict whether each customer will 1) make a purchase, 2) unsubscribe, or 3) do nothing after receiving our daily sales emails. After we predict the probability of each action for each email subscribed customer, we then combine these probabilities with our estimated customer potential GRS (Gross Revenue Stable) and unsubscribe costs to generate our final Nightingale score. We have different versions of the model for each region and brand.
We have separate models to predict the customer level expected GRS (Horus) and expected unsub cost (Osiris).
We selected more than 500 customer features to serve in the Nightingale models, by using multiple feature selection techniques including correlation analysis, random forest feature importance scores, hyperparameter tuning, step-wise selection, etc.
In order to integrate with our Daily Sales Notification Platform - which actually executes customer email sends - we must translate our model scores into a consumable decisioning format (Governance). After we generate the combined Net Value (Nightingale score), we bin customers by ranked model score, and determine the send decisions by mapping each bin to business-logic-defined “applicability” groups—e.g., customers in the highest bins (highest Nightingale net score) receive 7 sends per week, customers in middle bins receive 5 or 3 sends per week, and customers in lowest bins receive 1 send per week (applicability cadence is decided by historical Marketing analysis and previous A/B tests).
Automated Retraining Capability
We built a new retraining pipeline which is incorporated as follows:
- We leverage a set of dashboards to assess model performance daily. When we observe model performance decay issues (approx every 6 months for Nightingale based on our model performance monitoring analysis), we confirm with our marketing stakeholders that there are no ongoing blocking marketing campaigns, and kick off the retraining process as defined in the following steps.
- Send out emails by random frequency to a random holdout group of customers for 4 weeks to collect unbiased retraining data.
- Leverage the Mercury platform (our feature platform) to regenerate training features, and leverage our new Nightingale Retraining Pipeline modules to retrain the model. The retraining pipeline takes a seed table of unique customers, generates Mercury customer features, trains the model, and outputs performance metrics.
- Backtest model performance with more historical data and roll out the new model into production scoring DAG. (First-time migration we will conduct an A/B test since there are significant changes between the existing model and our new generation model. Most likely, future revisions of Nightingale will not require A/B testing due to the much smaller changes in the model itself)
- Continue to monitor model performance until the next retraining.
These steps are illustrated below:
Features of the Retraining Pipeline (RP)
We designed the retraining pipeline to be:
Flexible: All components are parameterized such that we can easily apply the pipeline to different models and geos/stores. For example, we can select the Nightingale Wayfair US parameter or Horus Wayfair Canada parameter to retrain different models in the Nightingale suite.
Comprehensive: The Retraining Pipeline can support the full model retraining workflow including BigQuery seed table generation, feature generation and selection, data pre-processing, data sampling, model training, parameter tuning, model calibration, and model performance monitoring or backtesting.
Extensible: The RP is also extendable and easy to maintain. We leverage object-oriented design to modularize different retraining components and we could easily incorporate a new model into the RP. For instance, if we would like to introduce a new channel notification model, we could create a specific seed table BigQuery script and leverage the existing component to retrain the model. If we were to use models other than existing Pyspark models in the repo, we could also add a new class inherited from other model classes, so that we wouldn’t have to worry about the consistency of model training and evaluation methods across different models. For example, if we were to use a new spark ml model other than the existing random forest classifier, or would like to use Tensorflow or another framework, we could easily add modeling classes in the pipeline to enable the retraining pipeline to support the new types of models.
Robustness to upstream data inputs: The retraining pipeline directly leverages Mercury (our feature platform), moving us off legacy Hive scripts. This ensures that we have feature definitions consistent with the rest of the organization and that we are insulated from changes in upstream data sources - which are monitored and handled by the Mercury team.
Model Result and Learning Points
The Nightingale model is showing AUCs ~0.8+ for the attributed revenue and null class, and ~0.7+ for the unsubscribe class (~4% improvement in performance over the previous version based on model decay performance analysis), which means we are better able to predict if a customer will make purchases or unsubscribe from our email.
To test model performance in a real business environment, we launched a 4-week A/B test where we compared the performance of the new Nightingale model against the previous version, and the results demonstrated that the new model reduced unsubscribe (opportunity) cost by 7% without impacting our revenue and main customer engagement metrics.
The following plot shows that our new Nightingale model with customized potential revenue and unsubscribe costs estimation deprioritizes sending emails to customers with higher unsubscribe costs. In this chart, a higher percentile ranking represents a lower combined Nightingale score (a higher probability to unsubscribe), and therefore we will send emails less frequently to these customers. The depression between the BAU (Business As Usual) and New Nightingale Model curves implies that customers ranked between 10-90% by the new Nightingale model contribute less to our unsubscribe costs - we are able to drive down email unsubscriptions for the majority of customers while not negatively affecting other customer engagement metrics.
Launching the Nightingale model is a major win for our team to enhance the customer experience with Wayfair and simultaneously improve email marketing channel performance.
While a success, we identified several areas for improvement as we move forward:
Reduce developer costs incurred by integrating multiple ML models
We separate the training pipeline and offline model evaluation across these three models to prevent any mixing effects. Before we launched the A/B test we adopted a backtesting approach to simulate sending decision performance with historical data so that we would have a certain degree of confidence to move forward to the online A/B testing. At the same time, we actually launched multiple test branches to test the new models with or without the customer level revenue/cost estimation. For example, we could test across the core Nightingale score without the other two models, with one of the models, or with both of the models. If the performance was not ideal, we would know which model should be investigated. However, this approach is costly because if we would like to integrate more models, there will be an exponential number of new model combinations required to be tested. We are looking at a more scalable methodology to handle multiple model integration.
Reduce operational costs incurred by testing bespoke sending decision thresholds
To align with our notification sending platform configuration, we selected a couple of Nightingale score thresholds for email sending decisions. All of these thresholds are validated in multiple A/B tests and there is a large amount of operating costs to maintain the pipeline and scale to more and more stores in the sending decision space. In the future, we will look into leveraging Reinforcement Learning techniques such as Contextual Bandits to automatically select the best email sending thresholds. These RL agents will enable us to scale our models across different stores with minimal costs to make email sending decisions, maximizing customer satisfaction and meeting our business goals.
Enrich our feature space with the additional customer and product feature embeddings
The Wayfair data science team created a large amount of customer level and product level features. Although we migrated our retraining pipeline to leverage more features in this new generation Nightingale daily sales email model, more collaboration with our product recommendation engine team and customer sequencing models could potentially enhance our model performance by adding deep learning-generated signals. There will be a tradeoff between increasing model performance and enhancing model pipeline complexity as mentioned above.
Overall, our new generation of Nightingale daily sales email sending decision models brought improvements for our customers and us. With the new model email sending strategies, we reduced high engaged customer unsubscribe rates and maintained customer motivation to engage with Wayfair through emails. At the same time, we rearchitected our send decisioning platform to accelerate model retraining and experimentation. We look forward to continuously improving our marketing notification channel performance by building more efficient model-driven solutions.