Wayfair releases WANDS, the largest and richest publicly available dataset for e-commerce product search relevance

Search relevance – the relationship between users’ queries and the products returned in search results – is one of the most important performance indicators for ecommerce storefronts. However, the sheer volume of the data makes evaluating and improving search relevance a difficult proposition. To give just one example, the Wayfair catalog has millions of products, which makes it difficult to create relevance-focused evaluation datasets manually.

Historical approaches to solve this problem involved mining user click logs to create datasets. Such logs fall short in capturing the rich dimensionalities of customer interaction. They only capture a slice of user behavior and do not provide a complete set of candidates for annotation.

We are excited to announce a discriminative, reusable and fair human-labeled dataset for e-commerce scenarios. The Wayfair Annotation Dataset (WANDS) introduces an important cross-referencing step to the annotation process, which significantly increases dataset completeness.

The WANDS dataset includes details such as product title, product description, primary classes, product category hierarchy, various product attributes such as size and color, average customer ratings, and review numbers. It also contains the richest descriptions of the products and queries in the English language.

The experimental results indicate that our process significantly extends the scalability of human annotation efforts, in addition to being effective in evaluating and discriminating between different search models.

WANDS contains the largest number of relevance labels for query-product pairs. To our knowledge, WANDS is the biggest publicly available search relevance dataset in the e-commerce domain.

A break from the past

WANDS marks an important milestone in a series of efforts to improve search relevance. In prior years, Microsoft Bing and SOGOU have released high quality datasets to help drive search relevance. However, these targets are not appropriate for evaluating product search relevance since their target ranking was related to web pages rather than products.

Datasets such as those released by Home Depot and Crowdflower contain relevance information that is relevant for e-commerce scenarios. However, WANDS is significantly larger. There’s another important difference – WANDS also contains the complete annotation guidelines to ensure reproducibility and also to share best practices for future data collectors.

We stuck to a carefully developed set of tenets while developing WANDS. The first was that WANDS should be reusable and apply to a wide variety of systems. We also ensured that WANDS was agnostic to the systems being evaluated to ensure fairness.

It was important that WANDS should have the power to discern the performance of different product search engines, by utilizing metrics such as Normalized Discounted Cumulative Gain (NDCG), which account for both the relevance and position of search results. Finally, we made a break from the past by prioritizing completeness over dataset size: completeness referring to the property that within a relevance dataset, all relevant documents for a given query are known.

An innovative annotation process

To annotate data, we started by stratified-sampling of search queries from a pool of historical customer queries stored in the e-commerce customer behavior logs. Specifically, we segmented search queries among several dimensions that are key indicators of customer behavior, such as:

On-site organic searches as compared to marketing-redirected searches
Searches that resulted in customer engagement (e.g., added products to cart) versus searches that didn’t result in customer engagement
Product popularity over the past two years

To construct the product pool, we collected products that were relevant to one or more of the selected queries, including clearly relevant, clearly irrelevant and hard-to-determine almost-relevant products. We use several sources to mine such product-query pair information, such as alternative search algorithms and historical search log information. The diverse approaches to approximate relevant product retrieval embedded in each of these systems allowed us greater opportunities to increase our selection of almost-relevant products, and mitigated biases towards one particular system.
Once the query and product pools were constructed, we performed iterative product mining to identify the query-product pairs to be annotated. This marks the stage when we would send these annotations to human annotators for evaluation.

However, given the size of our sample, judging every product and query pair would require over 60 million annotation judgments. To reduce the number of unjudged but relevant query-product pairs, we iteratively mined the entire product pool for unjudged but potentially relevant products for each query as cross-referencing. Three annotators then provided independent judgments on the selected query-product pairs, according to the annotation guidelines.

We measured the agreement between the human annotators using two objective quality metrics: Cohen’s Kappa, and the overlap percentage of agreement (OPA). Both metrics measure the agreement between raters, based on the judgments they make. Furthermore, OPA describes how frequently annotators agree with each other.

Dataset performance

The main contribution of this paper is the WANDS dataset itself. We collected a total of 480 queries, 42,994 products, and 233K annotated query-product relevance labels. Table 1 shows a summary of WANDS relative to the Home Depot and Crowdflower datasets.

WANDS contains the largest number of relevance labels for query-product pairs. It also contains the richest descriptions of the products and queries in the English language. It includes details such as product title, product description, primary classes that product belongs to (i.e., chair), product category hierarchy, various product attributes such as size and color, average customer ratings, and review numbers.

The table and the graph below present a comparison of the mapping of labels across each dataset to a standardized set of scores for metric computation.

Conclusions:

With the release of the WANDS dataset, we made significant contributions to the scientific community that include making the dataset available in the public domain, introducing the annotation process and releasing the annotation guidelines we used for reproducibility, and sharing our proposal of cross-referencing as a way to improve dataset completeness while keeping the annotation problem tractable. To the best of our knowledge, WANDS is the largest search relevance dataset targeted at e-commerce applications.