Wayfair sells products in the USA (wayfair.com), Canada (wayfair.ca), UK (wayfair.co.uk), and Germany (wayfair.de), and works with manufacturers from several countries. Consequently, business is being conducted in many languages. When product data is submitted by our suppliers in a different language than the market where the product is sold, we translate the content to the local language. For example, if a British supplier submits data in UK English and the product is sold both in the Wayfair UK and German stores, we translate the text to show it in German on wayfair.de.
Given our current international markets, we mainly translate product content from US English to Canadian French, UK English to German, and German to UK English. However, in the long run we might need to translate from and to multiple additional languages. Therefore, the relevance of product translations will only grow as the company expands to new markets.
Product translations are performed through a mix of human and machine translation. Content translated by humans is of guaranteed high quality, but naturally more expensive than content translated by machine. Our goal is to use Data Science to enable the scaling of product translations at Wayfair, maximizing savings by using more machine translation while minimizing the risk of showing bad translations on site. But how can we know if a machine translation is good or bad in a scalable and cost-efficient way? This article explains how we are tackling this problem by using a quality estimation model based on OpenKiwi to algorithmically assess machine translation quality.
Machine translation quality has been traditionally assessed at Wayfair in the following ways:
- Human judgment: a human person reads the machine translation and provides a quality rating (e.g. 1-4 quality score, 1 being the best and 4 the worst)
- Automated machine translation metrics: these are algorithmic scores that help us to understand how ‘close’ a machine translation is to a human reference translation by quantifying their correspondence. The Translation Edit Rate (TER) metric explained in Fig. 1 is an example of these that will be relevant for the current blog post.
The above quality assessment methods are very effective, but the problem is that both require a human to participate in the process, either to directly provide a quality rating or to create a reference translation to compare against, which is time intensive (and costly.) To assess quality in a scalable and efficient way, we need a system that is able to evaluate whether a machine translation is good or bad without requiring a person to be involved in the process. Luckily, the field of quality estimation comes to the rescue for this problem.
The goal of quality estimation is to evaluate the quality of a machine translation system without access to human input. Quality estimation can take place at several levels of granularity, and the following will be relevant for this blog post:
- Word-level: assigns quality labels (OK or BAD) to each token (i.e., words and gaps between words) in the machine translation.
- Sentence-level: predicts the quality of a full machine-translated sentence, usually by estimating an aggregate score such as TER or by inducing a sentence-level score from the word-level quality predictions.
Quality estimation is a well researched field, and the implementation we are using at Wayfair is OpenKiwi, a state of the art open source framework available as a Pytorch-based package.
Translation workflow improvement
Until recently, product translations at Wayfair had followed an “all human” or “all machine” translation approach, or simple business rules that would help us to decide which products were of high priority and required high quality human translations vs. low priority and could be then left machine translated. An example of a simple business rule is that highly visited products would be always translated by humans, whereas seldom visited items would be translated by machine.
With the introduction of a translation quality estimation model that algorithmically informs us of when machine translations are good or bad, we will be able to extend the use of machine translation in our catalog while minimizing the risk of showing low quality translations on site. Given a certain product and a machine translation associated to it, the quality estimation model will signal if the translation is of good enough quality and ready to be shown on our store, or if it’s not up to the accepted standards and thus needs to be checked and corrected by a human.
Introduction to quality estimation via OpenKiwi
The translation quality estimation model used at Wayfair is based on OpenKiwi, an open-source implementation of the winning systems of the Conference of Machine Translation (WMT) 2015-18 word and sentence-level tasks on quality estimation. Included in the Pytorch based package we can find the following 3 systems or submodels, all consisting of deep learning architectures:
- Quality Estimation from Scratch (QUETCH)
- Neural Quality Estimation (NuQE)
- Predictor-Estimator (Predest)
To learn more about OpenKiwi and these 3 submodels, check out its creators’ full paper .
Below, we will give a brief introduction into the architecture of QUETCH based on  and , to further understand how a neural network system tackles a word-level quality estimation task. QUETCH is a simple model consisting of a multilayer perceptron (MLP). The input features consist of the source sentences in the original language, their machine translations in the target language, and the source-translation word alignments for each sentence pair (we generate the alignments by using the IBM2 statistical model, trained on Wayfair data plus an external corpus to find the correspondence between words in the source text and its translation). The predicted output for the word-level task consists of the OK/BAD labels for each token in the machine translation.
Given a source sentence and its machine translation, QUETCH’s deep learning architecture is illustrated in Fig. 5 and described below:
- Input layer: for each position in the target machine translation, a window around that position and a windowed representation of aligned words from the source text are concatenated and provided as input.
- Hidden layers
- Lookup-table layer: each of the words in the concatenated input is then represented by a pre-trained word vector in the lookup-table matrix M. All the corresponding word embeddings are then concatenated into a single vector. Matrix M is initialized with word2vec representations for all words in the vocabulary and continues to be optimized during training.
- Linear layer + non-linear transformation: applies a tanh non-linearity
- Output layer: scores OK/BAD probabilities for each token in the machine translation
The model is trained by optimizing the log-likelihood given the training data through back-propagation and stochastic gradient descent .
NuQE ,  has a similar architecture to QUETCH, as it also contains a lookup-table layer that assigns embeddings to target words and their aligned words in the source text. The word vectors are concatenated and then fed into a set of feed-forward and bi-directional Gated Recurrent Unit (GRU) layers. Finally, the output layer applies a softmax activation that estimates the probability of OK/BAD.
Predictor-Estimator ,  has a completely different architecture, and actually consists of two models: a predictor, trained to predict the target translation’s tokens given the source, and an estimator, which classifies each word in the machine translation as OK or BAD using features produced by the predictor. Both the predictor and estimator models are mainly based on a set of Long Short-Term Memory (LSTM) layers.
Using OpenKiwi at Wayfair
We trained QUETCH, NuQE and Predictor-Estimator with Wayfair data, which required us to extract and/or generate the features and labels listed in Fig. 6. We additionally generated an ensemble model that averages the 3 submodels’ word-level predictions. However, sentence-level prediction is more relevant to us as we want to enable the flagging of potentially bad machine translations to perform human upgrades when needed. Hence, we induced the sentence-level score used for our decision making by averaging word-level ensemble predictions.
There are limitations to our training data, as OpenKiwi submodels were designed for data based on the comparison between machine translations and their human post-edited versions but most of the human translations available in Wayfair databases were created from scratch without translators seeing a starting machine translation. This causes us to overestimate bad translations, as the human gold standard we’re currently using may not always comparatively reflect the quality of a machine translation.
For example, given a source sentence, its machine and human translations may have no words in common without the machine translation actually being wrong. Moreover, human translators could add creative descriptions or changes, which would affect edit distance when comparing against the machine translation without necessarily reflecting machine translation quality.
Wayfair translations have already started migrating towards a machine translation post-edit workflow, in which translators will first see a starting machine translation and then perform the minimum number of edits to transform it into a human understandable and fluent sentence. This will make our training data better and help us to overcome the bias that is currently present. Additionally, while we wait for a higher number of post-edited translations to become available for model training, we are regularly calibrating the current quality predictions vs. human judgment and creating a set of risk cases to be aggressive or conservative as needed in the trade-off between savings and quality.
So far, we have started to implement a translation quality estimation model based on OpenKiwi. Natural next steps will be to improve model predictions as more post-edited data becomes available, to extend the quality estimation model to future language pairs and to continue to research further quality estimation methodologies that will help us to exploit machine translation in a way that allows us to protect the quality of the content on our website.
 Matthew Snover, Bonnie Dorr, Richard Schwartz, Linnea Micciulla, and John Makhoul, (2016). A Study of Translation Edit Rate with Targeted Human Annotation.
 Fabio Kepler, Jonay Trénous, Marcos Treviso, Miguel Vera and André Martins. 2019. OpenKiwi: An Open Source Framework for Quality Estimation.
 Julia Kreutzer, Shigehiko Schamoni, and Stefan Riezler. 2015. QUality Estimation from ScraTCH (QUETCH): Deep Learning for Word-level Translation Quality Estimation.
 Andre F. T. Martins, Ramon Astudillo, Chris Hokamp and Fabio Kepler. 2016. Unbabel’s Participation in the WMT16 Word-Level Translation Quality Estimation Shared Task.
 Hyun Kim, Jong-Hyeok Lee, and Seung-Hoon Na. 2017. Predictor-Estimator using Multilevel Task Learning with Stack Propagation for Neural Quality Estimation.