To B2B, or not B2B, that is the question.

When you think of Wayfair, you might think of a storefront that caters exclusively to consumers. However, Wayfair Professional, our site that caters exclusively to business shoppers, continues to grow year after year. Wayfair Professional customers are able to take advantage of a number of benefits such as an expanded product catalog, personalized services, bulk orders, multiple address delivery, special pricing, and more.

Business customers might be office managers looking to furnish a corporate location, or a hotel chain that wants to furnish rooms in elegant and consistent ways. However, many business customers might not be aware of Wayfair's business-to-business (B2B) offerings. To engage these customers in meaningful and relevant ways, Wayfair’s Data Science & Machine Learning team developed B2B customer identification models, also known as Hamlet models, to help us identify and engage B2B customers who browse or shop on the regular consumer site.

Leads — customers who we seek to enroll in the Wayfair Professional program because we believe they are B2B shoppers — are generated through self-identification, interaction with B2B marketing and on-site messaging, or through machine learning (ML) models that classify shopping behavior as “B2B-like”. These ML models are an important source of B2B leads for Wayfair. We nicknamed this suite of models “Hamlet”, as they help us answer the question inspired by Shakepeare’s soliloquy: “to B2B or not to B2B?” This post focuses on Hamlet Order Confirmation (OC), which is triggered after an order placed on the website to estimate the probability that a customer is shopping for a business.

Design Goals and Challenges

Hamlet OC triggers in real time whenever a customer places an order, and predicts the likelihood that the customer’s order was actually placed on behalf of a business. Our goal is to accurately differentiate between B2C shoppers (consumers) and B2B shoppers (businesses). High scoring customers (high likelihood of being a business) will be contacted by our sales team to confirm their status as a business and to present them with the benefits of enrolling in Wayfair Professional.

Solving this problem presented several key challenges. In this article, we’ll focus on one key modeling challenge and one key engineering challenge:

Modeling challenge — labeling with incomplete ground truth information. The target variable “y” for the Hamlet models is whether the customer is a business customer or not. Since we do not know if an arbitrary site visitor is shopping for business or not, we took a three-step approach for creating ground truth labels for model training, which is described in the next section.
Engineering challenge — meeting real-time SLAs without compromising on feature richness. We decided on a real time architecture for Hamlet OC (as opposed to batch next-day processing) because customers are 3x as likely to enroll in Wayfair Professional if we reach out immediately after an order is placed (either through a phone call, email, or on-site messaging if they remain on site after ordering). Real time models follow strict execution SLAs in order to effectively drive downstream outreach. Meeting these SLAs with a relatively large feature set that includes both historical and real-time features posed an engineering challenge.

Challenge 1: Labeling with incomplete ground truth information

As many “true” business customers browse and shop our consumer storefront without enrolling in Wayfair Professional, we cannot assume that all shoppers in the B2C experience should be negatively labeled for model training. In order to solve this problem, we followed a 3-step approach:

Label known B2B and B2C customers: positively label existing Wayfair Professional customers, and negatively label customers who self-identified as consumers (for example, by opting out of Wayfair Professional marketing in the past).
Augment the positive label set using record linkage algorithms: use an in-house record linkage algorithm to match customers to known businesses based on shipping and billing addresses. Known business data is acquired through public records or third-party firmographic data providers.
Complete the label set with synthetic labels: the previous steps provide high quality labels, but at a low volume compared to the overall Wayfair customer universe. To grow the training set for the real time model, we trained a neural network to estimate “isB2B” probabilities for all of the previously unlabeled customers. Finally, we positively labeled customers with P(isB2B) > c, and negatively labeled customers with P(isB2B) < 1-c. c is our confidence threshold for synthetic labeling, which becomes a hyperparameter in the overall model training exercise.

a flow diagram shows how we use first heuristics, then a neural network to separate positive labels from negative labels — We used a combination of heuristics and an offline neural network to separate B2B and B2C customers for our training dataset. Note that the neural network used here does not have the SLA constraint of our real-time model, so we can use it to "bootstrap" our way into a lightweight online model.

Challenge 2: Meeting real-time SLAs without compromising on feature richness

Wayfair Professional serves a diverse array of clients ranging from small businesses such as designers and contractors, to institutional clients such as universities and hotel chains. With so many different shopping patterns, we need a large feature set to capture salient signals across such a wide variety of customers. In total, Hamlet uses over 100 features pulled from real-time site browsing activity, order data, marketing interactions (e.g., ad clicks or email engagement), public databases, and third-party firmographic data vendors.

Meeting real-time service level agreements (sub-second inference + response time) with a large feature set and a mix of real-time and historical features was a key challenge. We worked with Wayfair’s Machine Learning Platforms team to build a reusable framework for creating and retrieving these features:

Streaming Features: computed from data collected within the last several minutes or hours. Examples include current order details and on-site search history. Data is streamed using Aerospike, an open-source real-time data streaming platform.
Pre-computed features: computed from historical data (more than a few hours old, up to several years old). Examples include lifetime order count, historical classes browsed, average order value, etc. Using a Vertex feature store, we precompute feature values every 12 hours and retrieve them in real time when the model is triggered. Precomputing many of our feature values and then retrieving them at inference time is a critical enabler to meeting our sub-second SLA.

a diagram shows both real time and batch features stemming from a set of recorded events, and getting served to the Hamlet OC application — Some Hamlet features are streamed served real-time from a cached database (Aerospike), while others are precomputed using Vertex Feature Store

Model Selection & Training

We evaluated several different model frameworks to identify candidates that could meet our sub-second inference requirement with 100+ features, and ultimately selected the xgboost library. We then trained and tuned a model with the objective of maximizing PR AUC: the area under the precision-recall curve. This metric is a good choice for several reasons:

B2B customer identification is a highly imbalanced classification task (there are far more B2C customers than B2B customers), so precision- and recall-based metrics are more useful than accuracy in this case.
The whole PR-curve is of interest, not just one operating point. Different business applications require different minimum precision standards, so maximizing PR-AUC helps us optimize for all of these applications simultaneously.

Classification Threshold Selection

Ultimately, we need to convert Hamlet’s outputs — P(isB2B) — into binary true/false classifications that will drive concrete business actions. We’ll do this by setting thresholds on the model’s probability outputs; i.e., if the model score is above some threshold X, classify the customer as a business. For a given business application, the optimal threshold depends on our desired level of precision and recall.

For example, when reaching out to customers via phone calls, we want a high level of precision (most of the people we call are really businesses), because making phone calls is relatively expensive. On the other hand, showing on-site messaging is low cost as long as it doesn’t slow down page load time, so we prefer high recall (show messaging to most of the true business shoppers on the site).

a Precision-Recall curve shows the tradeoff between precision and recall at different classification thresholds — Different thresholds on our precision-recall curve can be used for different messaging applications.

End-to-End Architecture

The following flowchart shows the entire flow of the model end-to-end. To productionalize the model, we developed infrastructure to retrieve feature values from Aerospike, triggering event and vertex feature stores - which are then feature engineered and passed to the model for inference. Based on the model output score, we make decisions on how to engage the customer. Based on their model score, customers may receive different types of outreach / treatment in order to encourage them to join Wayfair Professional.

a flow diagram shows the logical events which occur between a customer placing an order, batch and realtime features being served to a model, which is used to generate a probability that the customer is B2b — Our full model inference pipeline is kicked off whenever a customer places an order, and terminates in our model giving a predicted probability that the customer is a business.

Conclusion

In order to build Hamlet OC (order confirmation) we:

Used record linkage (matching) and synthetic labeling to expand our label set for better classifier training.
Deployed a real-time scoring application without compromising on feature richness by using a VertexAI feature store to pre-compute as many features as possible.
Selected different thresholds appropriate for different customer engagement methods.

When we benchmark Hamlet OC's performance, we find the model is twice as effective as prior methods that relied on simpler business rules. Developing Hamlet OC required significant innovations on the scientific and engineering fronts, and we are delighted to be part of a team that helps businesses around the world design great spaces in cost-effective and efficient ways.

Hamlet: Wayfair's ML Approach to Identifying Business Shoppers