Wayfair Tech Blog

MARS: Transformer Networks for Sequential Recommendation

Three couches of varying styles, stitched together to form one

Customers’ tastes may change over a period of time. How can we leverage their browse history to make sure our recommendations reflect their most up-to-date preferences?

At Wayfair, we want to recommend the right products to customers so that they can find what they are looking for. However, customers may change their preferences over time, both based on external factors (e.g. customers may shift towards more premium materials if their purchasing power increases) and in-house factors (e.g. they may see a new kind of design/material/style on Wayfair that they really like, but were previously unaware of). Moreover, customers may look for similar characteristics (e.g. value, style) across different types of furniture that they buy. In this post, we will present a new Multi-headed Attention Recommender System (MARS) that uses sequential inputs to learn changing customer tastes, and hence provide recommendations that better match customers’ latest preferences.

Motivation for MARS Architecture

Wayfair’s catalog is constantly changing - and so are customer tastes. Moreover, when customers change their preferences for a certain type of furniture (e.g. different material, style or value), it is likely that their new preferences will carry over to other product types that they may order in the future. If we want our models to be able to adapt to customers’ changing preferences, we need to use a model architecture that can handle sequential input data. For example, a customer may, after browsing some types of bed (e.g. traditional wooden beds) decide they actually want a different style (e.g. modern metal/wood hybrid beds). However, because the bulk of their browse history was traditional beds, a model that does not take into account the sequence of their browse history will primarily recommend traditional beds, as it cannot distinguish more recent customer preferences.

A transformer network is very well-suited to solve this problem. Because they can handle sequential inputs via self-attention, they can easily learn changes in customer preferences. Unlike other sequential models like recurrent neural networks, transformers can be parallelized and hence train much faster. MARS is a transformer network largely based on an open-source recommender model called SASRec.

The input to MARS is very simple - just a sequence of items that a customer has browsed. We do some cleaning to remove very rarely-encountered items, and also remove adjacent duplicate SKUs (so itemA → itemB → itemB → itemC → itemA becomes itemA → itemB → itemC → itemA).

MARS (Fig. 1) uses self-attention to learn similarity between different products, which is stored in an item embedding. Similarly, the positional information (whether the product is viewed first, second, third etc.) is stored in a positional embedding. This has the same dimensionality as the item embedding so that the embeddings can be added together. We could also concatenate them but this would lead to much greater model complexity (hence longer training time). The positional embeddings need not be learned; the original transformer paper used predefined positional embeddings, though subsequent work (e.g. BERT) found better performance when using learned positional embeddings. The resulting summed embeddings are passed through self-attention layers and a final fully-connected layer with sigmoidal activation and binary cross-entropy loss. A residual connection after each attention block prevents overfitting. The final output is a list of scores for all items in the training set; these can be ranked to provide the top n recommendations. Note that the recommended SKUs may include both items that were previously viewed, as well as new items in previously-unseen classes of products (e.g. desks and table lamps in Fig. 1)

An architecture diagram shows 9 distinct pieces of furniture, passed into a set of Item embeddings plus Positional embeddings. Those embeddings get passed to self attention layers, a multi-level perceptron layer, and  get outputted as 9 new distinct pieces of furniture which should be displayed next.
Fig 1. MARS architecture. The input (ordered sequence of customer’s browsing history) is encoded with learned item and positional embeddings (here shown with hidden dimension = 4); these are summed and the result is passed through the self-attention layer(s) in the form of query (Q), key (K) and value (V) matrices. After each attention block, the output is passed through a multi-level perceptron (MLP) layer. The final output is a list of scores for all items in the training set; these can be ranked to provide the top n recommendations, as shown here. Layer normalization, dropout and nonlinear activation steps are omitted for clarity.

To prevent overfitting, MARS uses standard dropout and L2-regularization methods. To keep the gradients stable during backpropagation, we use layer normalization and a residual connection.


1. MARS lift for substitutable recommendations by learning sequential behavior

We first try to use MARS with product views from just one class (here, beds), and compare against a baseline matrix factorization method that uses the same training data as MARS, except it is aggregated and not sequential. Adding positional information could help the model adapt to changing customer preferences, as more recent views may be better indicators of a customer’s current preferences. In Fig. 2, we see that even though MARS is able to greatly improve the recall (proportion of successful recommendations over all targets) by 67%. This suggests that the positional information is, in fact, very useful for making more relevant recommendations.

Graph of 3 recall curves based on number of candidate shows Mars with the best curve, the middle curve for baseline matrix factorization, and the lowest curve for bestsellers
Fig. 2. Hit rate/recall for top n recommendations for MARS and a baseline matrix factorization method (using the same input information as MARS, but without sequential awareness), showing considerable lift (e.g. 67% lift in recall for top 6 recommendations). This means that the top 6 recommendations from MARS correctly identify a bed that customers order in the future 50% of the time (vs. 30% for matrix factorization, and 13% when just showing the top selling items from the previous 60 days without any personalization).

As a further check on this, we can look at the learned positional embeddings to see how they converged (Fig. 4). We can clearly see that positions that are nearer have more similar embeddings. This is exactly what we want to see, as it shows that the sequence is important and relevant. The fuzziness of the diagonal line in Fig. 4 shows that the sequence is robust to slight perturbations, which is good as if we switch two adjacent items from a long sequence, we should not expect to have vastly different recommendations. We can also see that the fuzziness decreases as the sequence approaches the present (position 100 is the most recently-viewed item). Customers who have viewed more than 100 items are truncated to the most recent 100; customers with fewer than 100 views are 0-padded to 100.

A heatmap of cosine similarity shows highest positive similarity on the diagonal and lowest negative similarity at the 80x20 positions
Fig 3. Cosine similarity of the learned positional embeddings for each position from 0 to 100, where 100 represents the most recently-viewed item. Here, the positions that are closer in time are also more cosine-similar, suggesting that MARS is learning positional effects. The exact position is also less relevant for items viewed further in the past, as the diagonal line is slightly fuzzier/wider for position 0 than position 100.

2. Transferability of learned customer style preferences

Because customers commonly browse certain types of classes together, we can expect the learned item embeddings to reflect this. Indeed, in Fig. 4 we can see that commonly co-ordered classes (e.g. beds and nightstands) and co-viewed classes (e.g. sofas and sectionals) are clustered together, even though there is no class information provided at training. However, within each class, are similar items (e.g. similar style) also being grouped together? This means we want to see whether MARS’ learned item embeddings can learn to connect not only similar items within the same class, but also similar items for unrelated classes.

Animated gif of the embedding space shows products as a cursor hovers over clusters of products. Shows that Sectionals, Futons, and sofas are clustered near each other, and sheet sets and bedding set clusters overlap.
Fig. 4. The learned item embeddings projected into 2D using UMAP (click here for full interactive figure). The learned item embeddings are dominated by the class signal, and similar classes (e.g. beds, nightstands, sheets) are grouped together. Within each class, there is some sense of similar styles being grouped together, though this is more clearly shown in Fig. 5.

For commonly co-viewed, or “substitutable” classes (e.g. sofas and sectionals) they have very similar item embeddings and so customer preferences are very easy to transfer, as MARS may well be learning more tangible features like color, shape, size and price instead of style. For commonly co-ordered, or “complementary” classes (e.g. sofas and coffee tables) there are fewer common features like material, shape, or size to transfer, and so if we can show customers are browsing similar complementary items, then this is much more likely to demonstrate style transfer.

However, because the class signal is so strong, this means it is hard to compare browsed items with random items if they are all from the same class, as the cosine similarity will be similarly high in both cases. To rule this out, we can subtract the mean embedding for each class, which should remove the class signal and leave behind other item characteristics (e.g. style or value). After doing this, we can see in Fig. 5 that co-browsed items are much more likely to have similar characteristics, even when excluding the class signal, which strongly hints that MARS is learning transferable customer preferences.

Two cosine similarity graphs compare the browsed item to other items in the same class, complementary class, and random classes. graphs show that the cosine similarity is higher for other browsed items than random items, regardless of class.
Fig. 5. Similarity of the demeaned embeddings are higher when comparing co-browsed items than a random item from the same class (left); similarly, the cosine similarity for co-browsed complementary items is also higher than a random item (either from the same complementary class or from a random class). This suggests that MARS can learn customer preferences that carry over to other browsed classes. These preferences may be style, value, or other item characteristics.

Fig. 5. Similarity of the demeaned embeddings are higher when comparing co-browsed items than a random item from the same class (left); similarly, the cosine similarity for co-browsed complementary items is also higher than a random item (either from the same complementary class or from a random class). This suggests that MARS can learn customer preferences that carry over to other browsed classes. These preferences may be style, value, or other item characteristics. 


By using a transformer model, we are able to take a very simple input - a list of browsed items, with no other information - and greatly improve on the success rate of recommendations served up on the main browse pages for each class. For customers with browse history in that class, our MARS model dramatically increases the accuracy of our recommendations by ~67%. For customers without any browse history in that class, we have shown that MARS is able to learn transferable item characteristics (e.g. style) that carry across from previously browsed classes. Future iterations of this model will include more complex inputs, such as customer demographic information and product image embeddings, which should further improve our recommendation quality. MARS is a standout star of our recommendation models and certainly has a very bright future!