Let’s look at how we at Wayfair use custom embedding models to capture complex customer behavior patterns for advanced Fraud prevention.
At Wayfair we are committed to providing a safe and secure shopping experience for our customers. One of the key challenges in achieving this goal is detecting and preventing fraud on our platform. To this end, Wayfair developed several machine learning models for fraud detection. These efforts are important because fraudsters can directly impact the experience of our customers by compromising their personal and financial information, leading to unauthorized transactions, and potentially damaging their credit. Furthermore, fraudsters can indirectly affect customers by causing increased product prices and reduced trust in the platform, diminishing the overall quality of the online shopping experience. By proactively addressing these issues, Wayfair ensures a secure environment for everyone to shop with confidence and peace of mind. However, as fraudsters continue to evolve their tactics, these models need to be constantly improved to stay ahead of the curve. This is why we have developed a new embedding system called Melange, which is based on self-supervised representation learning, to help our fraud detection systems to organically adapt to new behavior patterns of the fraudsters.
Our goal was to develop a system that captures complex patterns and dependencies within time-series data. We already have some custom, manually created tabular features in our feature store which represent specific statistics of the customer behavior, like average visits or orders over the last 7 days or 30 days. However, manually created features like these only capture a limited amount of information and may not be able to capture more complex patterns. Additionally, manually creating features can be time-consuming and may require domain expertise, whereas sequence modeling can create features automatically from the raw data. Moreover, the use of manual features may also introduce biases or oversights, while sequence modeling can uncover more nuanced relationships in the data. Overall, sequence modeling provides a more comprehensive and accurate representation of the data1, and can potentially improve the accuracy and robustness of fraud detection systems.
This system already shows promising results in improving the performance of our fraud models, and we believe it has the potential to be a valuable tool for many other applications as well. So, let’s look deeper into our customer behavior embedding system.
What is an embedding system?
Before moving forward, let’s clarify some definitions. An embedding system is a machine learning model that generates vector representations, or embeddings, of data points in a high-dimensional space. The goal of an embedding system is to create a space in which similar data points are located near each other, while dissimilar data points are far apart. This can be useful for a variety of tasks, such as classification, clustering, and similarity matching.
In the context of our task, the Melange embedding system is designed to generate customer embeddings based on customer session history and captures the key features of customer journey during their sessions. These embeddings can then be used as external features in downstream models, such as fraud detection models, to improve their performance. The Melange embedding system is based on self-supervised representation learning.
Self-supervised representation learning
The goal of representation learning is to create data representations that simplify the process of extracting valuable information when building classifiers or other predictors.2 Self-supervised representation learning is a type of representation learning where the goal is to learn useful representations of data without the need for explicit labels, which can be time-consuming and expensive to obtain.
The training process in a self-supervised approach is usually split into two steps:
- Pretext task, which is an auxiliary task that is designed to learn a useful intermediate representation of the data without relying on labeled data. This task is based on some transformation of the data, such as predicting the missing word in a sentence, predicting the colors of a gray image, or predicting the rotation of an image. We need pretext tasks in self-supervised learning because they enable models to learn meaningful features and representations without requiring large amounts of labeled data.
- Downstream task. The learned representations can then be used for downstream tasks, such as classification or regression.
Development and training our self-supervised representation learning system
Now that we have clarified the basics, let’s look at the development and training process of our new system. The training process for the Melange embedding system involves collecting customer session data and using it to generate embeddings.
An example of customer journey during a browsing session may look as follows. The customer lands on our homepage and starts a search for 'TV Stands'. After being presented with a grid of results, they click on a particular product of interest and then if they like the product they are following the checkout process:
During the training we trained the Melange embedding model with a sequence learning approach (the process of training models to recognize, understand, and predict patterns or relationships within sequences of data). The primary goal of sequence learning is to create models that can accurately predict future values or events in a sequence based on previous data points. In our case we tried to predict the next type of page the customer will visit based on their previous session interactions.
After training, we saved the embedding layers from the sequence model for embedding inference. With this model, we can now process customer-site interactions and convert them into a single vector that encodes essential behavioral information in a high-dimensional space. This vector, also known as an embedding, serves as a compact and meaningful representation of a customer's interaction history with the website. While typical embedding systems aim to reduce dimensionality, our primary goal here is to capture the underlying patterns and relationships between different interactions, even if the source space might not be as large as in other applications (e.g., millions of words in the vocabulary of a language).
In our case, the high-dimensional embedding space allows us to account for the complex combinations and relationships between different interactions, despite the lower cardinality of page types. This approach ensures that we can better understand and analyze customer behavior and use these vectors as features for downstream models (like fraud machine learning models).
We hosted an Inference Batch Pipeline within the Google Cloud Platform's Vertex Pipeline to generate customer embeddings.
The Inference pipeline runs every hour and has two separate steps:
- Collect session information for the last three sessions of customers who interacted with the Wayfair website in the last hour
- Calculate the aggregated customer embeddings based on the collected customer sessions
The current use-case for Melange embeddings is Fraud models. The output generated by Melange customer embeddings is used as additional input features in these downstream tasks. See plot below.
Adding these embeddings as features allowed us to realize significant improvements to fraud model performance. In some of the downstream tasks we saw up to 18% relative performance improvements, as measured by the area under the Precision-Recall curve (PR-AUC). It's important to mention that there is a wide variety of downstream models in our Fraud field that we haven't discussed in this blog post. These models address different aspects of fraud detection.
Summary and Conclusions
In this blog post, we introduced Melange, a customer session embedding system that is based on self-supervised representation learning. Melange generates customer embeddings based on their session history. We described the architecture of Melange and the inference batch pipeline for generating customer embeddings, and how they can be used as external features to improve the performance of downstream models. We showed that the integration of Melange has already improved the performance of our fraud models.
In conclusion, Melange is an effective customer session embedding system that can be used to improve the performance of downstream models. By generating customer embeddings based on their session history, Melange captures the underlying patterns and behavior of customers that are relevant for detecting fraud. By using the Inference Batch pipeline, we efficiently generate customer embeddings and use them as external features to improve the performance of our downstream models.
There are several opportunities for further work with Melange. One area of interest is to apply Melange embeddings to new downstream tasks, such as churn prediction, customer segmentation, or personalization. Additionally, we are working towards enhancing our embedding system to achieve near-real-time performance. Another area of interest is to experiment with more sophisticated models, such as transformers or graph neural networks, to see how they perform with customer journey embedding. Furthermore, applying a contrastive learning approach (an approach in representation learning that aims to create an embedding space where similar sample pairs are positioned closely together, while dissimilar pairs remain distant from one another) can further improve the performance of Melange embeddings. Overall, we believe that Melange has great potential for improving the performance of downstream machine learning models in various domains and look forward to exploring its possibilities further.
1 We see this also in one of our personalization models: