Wayfair increased its Deploy Frequency - the rate at which we ship new functionality to customers - by 490% between January 2019 and June 2021.
What is Deploy Frequency, and why do we care about it?
At Wayfair, we want to deliver new functionality to customers fast. Increasing Deploy Frequency - the rate at which we ship new functionality to customers - provides more opportunities for experimentation and feedback. More experimentation allows us to learn what customers find valuable more quickly.
Deploy Frequency is also one of five key DevOps metrics popularized in Google’s DevOps Institute 2019 State of Dev Ops Report. The institute’s research has established that higher Deploy Frequency positively influences key business metrics, including profit and customer satisfaction.
What was slowing us down?
Throughout much of 2018, the typical engineer deployed code to production only once a month. At this time, the majority of Wayfair’s software comprised a single monolithic application. Early in our growth, a monolithic architecture allowed us to move quickly by providing paved pathways for all common use cases. These paved pathways kept engineers focused on solving customer problems instead of logging, data access, caching, security, messaging, and other technical requirements.
During this time, our engineering team grew quickly. Unfortunately, the rapid growth put tremendous strain on our ability to deploy changes to production. Our codebase, development environment, and deploy processes didn’t evolve fast enough to keep up with the growing number of engineers. As a result, friction in the development cycle increased enough that Deploy Frequency per engineer decreased across 2017 and 2018.
Our growing team meant total Deploy Frequency continued to increase, effectively masking individual productivity. To handle the ever-increasing volume, we deployed pull requests in batches using an in-house tool called Integrator.
While batching allowed us to push more changes to production, we were still limited to a single deploy pipeline. Batches need to be built and deployed sequentially to ensure each set of changes compiles and doesn’t inadvertently add or remove something from production. Preparing and deploying a batch is quite time-consuming due to the size of the monolith. Consequently, we were limited to about ten per day.
Less than 4% of changes deployed to the monolith are defective and require reverting off the batch. Larger batches have a greater likelihood of containing a defective change. As our change volume increased, so did our batch sizes. At one point, as many as one in four batches included a defective change. Identifying which change introduced a defect is a time-consuming manual process that increases proportionally with batch size; larger batches contain more changes to review when a regression is introduced. Deployments are paused while the defective DR is identified and reverted.
We consider it a change failure when a change introduces a regression to production that requires immediate resolution. If that defective change is not detected until a batch has reached production, we roll back to the previous healthy deployment. This approach effectively removes the batch and all of the changes on it from production. We call healthy changes on batches that experience a roll-back collateral damage and include them in our change failure count. Although we have reduced collateral damage by 75% since then, it remains our highest source of change failures.
Enabling separate deployments
The situation described above is a purposeful simplification. Integrator had many optimizations to improve Deploy Frequency. However, none of those could mitigate a fundamental limitation of a monolithic architecture. There’s only one pipeline to production, the pipeline has limited capacity, and a single defective change impacts everyone.
We wanted multiple pipelines to production. To do that, we needed to allow teams to create separably deployable artifacts. The first step on that journey was to enable standalone services that could be developed and deployed independently. We also wanted to retain some of the benefits of the monolith, namely providing paved pathways for everyday use cases that minimize the time engineers spend on infrastructure concerns. Our platform teams provided libraries that took care of the heavy lifting for data access, logging, cache, and other technical requirements to facilitate this.
Despite the initial set of supported use cases being relatively small, the offering received strong support. As a result, Deploy Frequency more than doubled during 2019 from 1.1 to 2.5 deployments per engineer per month.
Making separate deployments easy
Entering 2020, we had early signs that something exciting was happening, but creating new services was still complex and time-consuming. We had templates to generate services code, but it took weeks to go from code to having even a simple application running in production.
At one point, we measured more than 35 manual steps required to deploy an application to production, from creating a code repository to setting up SSL certificates to establishing ownership of the new application for compliance. Engineers who had done this a few times could sometimes complete the steps in a few days, but it took 4-6 weeks for most people to go through it the first time. Sometimes it would take this long even if engineers had been through the process a few times already if they had to wait on infrastructure tickets to be completed or took even half a step off the paved path.
We solved this problem by creating Mamba - our in-house application scaffolding tool. We created Mamba to get new applications to production as fast as possible by automating the manual steps. When it launched, we automated just a few steps, but the improvement was palpable. Today, new applications are scaffolded with everything they need for development and deployed to production in less than the time it takes to get a cup of coffee. We now scaffold more than 100 new services every month.
In the six months between Mamba’s launch and our next large-scale improvement, Deploy Frequency increased by 40% from 2.5 deployments per engineer per month to 3.5. We also saw a 130% increase in the number of services running in production.
Making it hard not to deploy
At this point in our journey, our teams could easily create new services, and they were! Two years after the introduction of Kubernetes, we were creating 100 new services every month. However, deploying those services was still a cumbersome, multi-step, manual process. The friction in the process meant teams were still batching multiple changes into each deployment, negatively impacting their change lead time and increasing deployment risk.
We set out to remove as much friction in the process as possible and make continuous deployment the default option. Continuous deployment is a release process that automatically deploys every code change to production, provided it passes all stages of the build and deploy pipeline. It is desirable because it mandates very low friction in the build and deploy process and promotes frequent delivery of small changes. Small changes reduce deployment risk and can provide dramatic improvements to Change Lead Time.
Continuous Deployment proved to be a tremendous catalyst for unprecedented growth in Deploy Frequency over the next year. In less than a year, deployments increased 65% from 3.5 per engineer per month to 6.5. Continuous Deployment has had other benefits as well. For example, teams practicing Continuous Deployment take half the time to review and deploy a pull request than those working in our monolith.
We believe there are still opportunities for us to increase our velocity further. For example, we’re adding additional use cases and capabilities to our services platform and enabling the last teams to migrate away from the monolith. Our migration to the cloud also opens up a broad range of exciting opportunities.
If you want to be part of a team building innovative and exciting solutions, head over to our Careers page to see our open roles.