How to make deliveries faster without touching any packages
Wayfair’s product pages (see figure below) largely display information about the given product, but when looking closely, we also see some information inherent to the supply chain. In particular, we tell our customers how long it will likely take for their goods to arrive at their homes. This is important because customer satisfaction is greatest when deliveries are fast (speed) and the promised time is accurate (reliability).
In this situation, a great customer experience relies on two separate components: a supply chain that can deliver the goods quickly and reliably, and an accurate delivery date where the orders are delivered in the promised timeframe.
Meeting this promised timeframe is important and by using machine learning, we improve our promised delivery dates, substantially boost the customer experience, and garner additional customers. And we can do all that without having to physically speed up our supply chain.
The increased satisfaction is directly reflected in key business metrics: on average customers are more likely to purchase products that are promised to be delivered soon. Any reduction in promised times directly translates into higher sales. The challenge that comes with reducing the promised times, is maintaining the same level of reliability – missed promises decrease customer satisfaction and measurably reduce customers willingness to purchase from Wayfair again in the future.
One simple way to measure the speed of our deliveries and their promises is to look at the average number of days it takes from order placement to order delivery, and then compare that to the average number of stated days on the product pages. A straight-forward way to track the reliability is to consider the “delivery rate” or the fraction of orders that arrive on or before the promised date. These two quantities (promises and reliability) can be traded against each other: adding additional padding time to the promise will increase the stated days, but also the delivery rate and vice versa. In this regard, the prediction behaves similar to a classification problem, where precision and recall can be traded along a ROC-curve. One way to optimize this problem is to try to get the fastest promise at a given fixed delivery rate. And with that target in mind it’s time to find a predictive model.
In this article I’ll describe what we had to do to cut the gap between promised and realized delivery times in half, while keeping deliveries just as reliable as before.
Predicting the delivery time
The previous method for predicting delivery time relied on composing the expected delivery time from a number of different components, such as the timeframe a supplier is contractually obligated to fulfill an order in. The sum is adjusted by additional padding factors at several places to avoid late deliveries. But this method has two major drawbacks: First, splitting the prediction into a number of separate pieces makes it hard to account for correlations between these pieces. Second, the previous method relies strongly on hand-curated inputs, which turn out to be a serious bottleneck in keeping the algorithm up to date, considering the five digit count of unique suppliers and warehouses we are working with. For this reason we decided to explore options that would predict the delivery times end-to-end, going directly from the time the order is placed to the moment it is delivered. Running a single model also helps us achieve low latency computations, an important factor when we run the model as a live service for inference.
Choosing a model
Research1 (and our own experience) has shown that learning from tabular data is often well served by gradient boosted decision trees (GBTs). Combined with the fact that GBTs are easy to set up using one of several well established packages, it was an easy choice to start exploration using this architecture. As training target / loss function, we use a quantile regression as a quantile loss aligns closely with our goal of achieving a given delivery rate.
Among the available options for software tools, we ultimately decided on the catboost2 package, as it provides three key strengths:
- Highly optimized compute performance, which is important for us to service the tens of millions of product views every day without undue delay.
- Excellent treatment of categorical variables, which can otherwise be problematic in GBTs. We use features such as the supplier-identifier as categorical features, so that the model can implicitly learn about processes internal to suppliers and carriers which aren’t directly visible to us.
- Our previous experience with catboost has shown that the package has well chosen defaults and heuristics for many hyper-parameters that will result in highly performing models with very little tuning, allowing us to reach a mature model all the sooner.
Selecting suitable features represents a major part of the work. We started out using inputs used by the previous system, which largely relies on look-ups for some highly granular features, such as the supplier identity, various warehouse IDs or zip-codes. While it is technically possible for a machine-learning model to abstract important information, it hides a lot of the similarities between different orders.
Providing information that more explicitly connects similar orders is key because it ultimately allows the model to more effectively generalize orders that have not been seen before in all the details. In particular, two pieces of information had the biggest impact on model performance:
- The average time a supplier took to go from order to shipment over the recent past
- The physical distance between the supplier warehouse and the customer
These two features alone distill important knowledge over two major steps which happen in the delivery process and we cannot directly observe because they are undertaken by partnering companies. Reducing the reliance on high multiplicity categorical features like this, makes it easier for the model to onboard new suppliers, warehouses and similar items. Nevertheless, the supplier identity remains an important feature, as it allows us to capture details about our suppliers' business processes that we cannot otherwise obtain.
Beyond these major features, additional information fed into the model consists largely of two types: more details about the suppliers, such as 1) the variability of lead times as well as 2) a number of temporal features used to mimic a forecast with a regression model. For the latter in particular, we try to catch known cyclical dependencies in delivery times. For example, if you order on Fridays, quick deliveries may not happen due to the suppliers being closed on weekends. To include these effects, a number of features are prepared including the order date (time of the day, day of the week, etc.) and delivery impediments that we know in advance (e.g. carrier not delivering on Sundays in the target area and known supplier closures). The latter are particularly helpful to anticipate the effects of major holidays on the supply chain.
When training the model, there is notable tension between reactivity (e.g. picking up on new trends in the supply chain's performance), and stability, (e.g. the capability to ignore short-term fluctuations). One concern is sales events around public holidays. By testing the model on historical data, we concluded that training data that includes last year's comparable holidays is instrumental to good performance of these events. However, to avoid having the model learn too much from last year's supply chain, we employ recency weighting. This provides greater emphasis on more recent data, so that overall we obtain a fair compromise between the large lookback in training data, which promotes stability and focus on recent data, which helps with reactivity.
We also do a little bit more to squeeze some extra performance out of the model. Because of different business needs for different delivery speeds, it makes sense to allow different delivery rates for different speed groups.
The simplest solution would be to train various models to separate quantiles and then use post-processing logic to stitch together the prediction as needed. The major concern here is that the additional processing could become problematic due to our very tight compute latency constraints.
Ultimately, we chose to implement a loss-function similar to the basic quantile regression discussed above, which has a separate quantile for different delivery speeds. Ideally, this separation would be defined via the prediction itself. However, making the goal of the training dependent on the result of the prediction comes with some difficulties. For that reason we use the observed delivery times as a proxy, resulting in a fairly straight forward loss function. Although catboost has an interface for custom loss functions, this use case was a bit too esoteric for it, so we ultimately implemented it directly in the C++ source code. This was made possible by the open-source nature of the catboost package and has the advantage of high compute performance.
With a good model in hand, we need to present the resulting delivery date estimates to our customers every time a product page is loaded. Because the predictions depend on details such as the customer’s address and the current time of the day, we recalculate the expected time for each page load. It’s well known industry wide that long page load times drive site abandonment and Wayfair is no exception to this effect. This is why all computations necessary to display customer facing pages need to be highly optimized. Due to these time-constraints we are required to produce an answer to a request for a delivery promise of less than a few dozen milliseconds for 99.99% of requests.
The core computation of the prediction with catboost easily fits into this budget, due to the high level of optimization in its code. However, there is a significant amount of overhead caused by feature look-ups, network-round-trips, etc. These required serious and innovative computer engineering to achieve the desired timing at high reliability.
We started the final pre-rollout A/B test in January. Having a comparable delivery rate as the old model, we were able to reduce the gap between stated days and actuals by about half. The increased conversion rate and corresponding increased revenue more than pays for creating, maintaining, and running the model.
This success was not guaranteed and there are a few ingredients that appear in hindsight to be especially important to the success of the project.
Probably our first and most important insight was to drop the previous model's approach of splitting the prediction into a set of distinct parts and then add those up and in favor of a holistic end-to-end approach. This is more suitable to a machine learning approach because it gives the model the opportunity to learn about the hidden correlations between the different steps of the product's journey.
Ultimately, we needed to invest a large amount of time and brain-power into really understanding the details of our supply chain. Most advances we made when developing the model were not necessarily linked to intricate machine-learning techniques (we use a pretty standard GBT after all), but rather to insights about the supply chain and additional features that would reflect these details.
In the feature engineering work, we also invested time into finding features that would generalize well and highlight the similarity between product journeys. Technically, a machine-learning model should be able to pick up on information like the distance between customer and supplier, just by being provided with the correct zip-codes and enough training data. However, it turns out that learning converges faster and on smaller training sets when features are prepared to be highly relevant.
Last but not least, good communication with our business stakeholders and engineering partners was instrumental in getting all the tiny details and special cases right. This is the major final step that allowed us to turn a fairly academic prediction model into a real infrastructure component, powering Wayfair’s business.
The journey does not stop here. There are two avenues we would like to pursue in order to further elevate the promises to our customers.
The current prediction is specifically designed for small parcel (can be carried by a single person) and dropship (shipped directly from supplier to customer) orders in the US market. With this big chunk of Wayfair’s business working well, it should be fairly straightforward to tackle additional scopes, like the European market, or orders that originate from Wayfair owned warehouses.
We may also try to improve the accuracy of our model by employing other machine learning methods. Some initial trials with transformer architectures (inspired by Ubers drivetime estimation) look promising in this regard.
With that I would like to extend a special thank you to everyone who was part of this effort including:
- Christoph (special thanks)
- Ilia, Sunil
- Florian, Tamas
- Eldar, Leo, Christian, Mahmoud, Artem
- Prasad, Rupesh, Sam and everyone else in FOPT