Wayfair has over nine million unique items in its catalog, and tens of thousands of new ones are added every day. These products come from 8,000 different suppliers who sometimes sell the exact same products -- two completely unrelated suppliers could be buying the exact same area rug from a manufacturer, for example, but giving it slightly different names and even pricing it differently. One of the challenges of maintaining our catalog is therefore the risk of product duplication. At Wayfair, we pride ourselves on giving our customers a great shopping experience, and seeing items that look almost-but-not-quite identical on the site is confusing and might seem unfair, particularly if they have different prices.
Further, many of our products are slight variants of the same core product. For example, our customers can select from tons of upholstery options for many of the sofas we sell. To streamline the browsing experience of our customers---no one wants to see 50 of nearly the same sofa on one web page when browsing Wayfair!---we want all these different variants to be consolidated into single entities on the site.
Our procedure for finding duplicates and product variants in the catalog used to be slow and tedious: offshore contract workers used spreadsheets to compare the SKUs manually, which was time-consuming and not terribly accurate. Earlier this month, we released a shiny new internal tool, the Duplicates Review Tool, that gives our offshore workers a smooth user interface for identifying duplicate items and is based on a product matching algorithm that was developed using machine learning and computer vision.
The tool clearly lays out product details and photos, and the matching algorithm feeds pairs of SKUs to it directly. Contract workers then approve or reject the pairs, and those responses both help us improve the catalog and feed back into the deduplication algorithm so it will get even better with time.
The matching algorithm arose out of a major problem for merchandising operations: identifying duplicate and product variant SKUs in a catalog of nine million items. The Data Science - Operations team decided to address it, and built a machine-learning algorithm to improve the process.
The algorithm takes a number of product details into account for all items in the catalog:
- Part Numbers
For one class of products, wall art, the algorithm also compares product images--the embeddings of top-ranked product images from Wayfair’s visual search algorithm are compared with a simple cosine similarity.
To train the model, we developed our own training data by pulling pairs of known duplicates as “positive” duplicate examples. For the “negative” data set, we selected random pairs of products are from within each product type (sofas, dining tables, etc.). These pairs were evenly sampled across all types of products so that the matching model generalizes well to Wayfair’s entire catalog.
To measure how likely a pair of products is to be a match, similarity metrics are calculated between pairs of products for each of the product features above. For example, the percent difference products’ prices gives a metric of price similarity. A set of these similarity metrics feed into a Random Forest classifier, which then provides a duplicate match score (between zero and one) for each product pair. Any pair with a score greater than 0.5 feeds into the duplicate review tool to be evaluated by a person.
Implementation and What's Next
To handle the scale of the matching problem---naively, we’d have to perform 100 trillion product comparisons to catch all duplicates in the Wayfair catalog!---the matching is performed on Wayfair’s in-house Spark cluster. Each class of products is deduplicated individually, with job scheduling and data flow managed using Airbnb’s open source scheduler Airflow. This cycles through the entire catalog about every three weeks--better than the once-a-month estimate we offered the merchandising team before we started.
Our offshore workers have had to be trained on using the tool, including on how to handle edge cases, but overall acceptance has been good and the process of removing duplicates from the catalog is much less laborious.
While we’re happy with this initial version of the project, we of course have some ideas on how to improve and expand it. We would like to improve the accuracy of the algorithm for certain challenging classes. Shower curtains, for example, are pretty much all the same size, weight, and price, leading to many false-positive matches.
It would also be great to incorporate other product information into the algorithm, such as product tags, descriptions, reviews, and images. This information should allow the matching algorithm to find even more duplicates and product variants, especially when other features are missing or incorrect.
Finally, this process would ideally happen much closer to suppliers as they add SKUs to the catalog, so that duplicates don’t make it into the system in the first place. But in the meantime, we no longer have to use spreadsheets to identify duplicates, and our ongoing categorization is building up a dataset that will help the model get better and better.
Interested in learning more about the ways that Wayfair incorporates computer vision into many areas of our business? Find us at CVPR! We'll be at booth 224 in the expo hall, and our data scientists and engineers will be there all week.