In earlier installments we covered how our cloud journey started, talked about testing and validation, and discussed internal processes that positioned us for making rapid iterations and helped to achieve predictable delivery.
It’s now time to talk about the tech, validation in production and our successful WayDay 2019 event!
We had a lot of exciting projects on improving internal automation. These projects brought the time for spinning up a machine and getting it ready to serve the traffic from days to tens of minutes.
However, if we wanted to use the elasticity of the cloud and keep the costs in line, we needed more automation and a comprehensive strategy for our next gen provisioning, configuration and credentials management platform.
It was clear we needed to push the envelope and start more aggressively adopting parts of the newer stack we used for our first cloud deployments. The team worked tirelessly to define the desired state. Finding the sweet spot between “lift and shift” and a full re-platforming was one of the biggest challenges.
How to implement as many improvements as possible without risking the project timeline? After careful consideration, we chose to make use of auto-scaling for the majority of stateless systems a core requirement.
In parallel, to aid the process and increase overall level of automation, the following improvements were also made:
- Improved service discovery and membership systems adoption to reduce management overhead.
- Removed all external dependencies on boot - crucial part for environments that constantly scale up and down.
- Switching to immutable, preconfigured images for everything stateless. This speeds up system startup, which needs to be low (less than a few minutes) for efficient use of an auto-scaler. It also removes configuration management systems from a critical path during scaling events.
- Implemented auto-scaling for configuration management systems itself, for cost effective management of stateful machines.
- Redesigned clustered systems that relied on ip mobility (shared or floating ip addresses) to use HashiCorp Consul information and its DNS interface.
- Updated credential management flows to support auto-provisioning of secrets (password, TLS keys etc). Bootstrapping migration efforts to HashiCorp Vault.
- Updated our deploy systems to introduce a new path for deploying code on boot, in addition to the already existing one. This removes the dependency on deploy systems and improves engineers' quality of sleep.
- Hit 95%+ automation level with Terraform, with rare common sense exceptions, no manual provisioning was used.
- Integrated StackDriver logging with our core ElasticSearch based logstream.
As we focused on integrating Terraform and GCP with our flows, we realized that we had many opportunities to enable the functionality we needed and contribute to Terraform, Puppet and Vault abilities at the time.
We were excited to share some of our improvements work with broader open source community:
- Rolling update support for GCP Instance Group Manager (IGM) and Regional Instance Group Manager (rIGM) (HashiCorp Terraform) GitHub TF Google , GitHub TF Google 
- Google Compute Engine (GCE) service account data source (HashiCorp Terraform) GitHub TF Google 
- Google Cloud Storage (GCS) service account data source (HashiCorp Terraform) GitHub TF Google 
- Google project data source (HashiCorp Terraform) GitHub TF Google 
- GCS notifications resource (HashiCorp Terraform) GitHub TF Google 
- GCS default object ACL resource (HashiCorp Terraform) GitHub TF Google 
- Distribution policy for rIGMs(HashiCorp Terraform) Github TF Google 
- Custom certificate extension as Certificate Authentication Constraint (HashiCorp Vault) GitHub HC Vault 
The Validation and Integration
When the new systems started coming up online and finishing internal QA, the team was excited to try out our new traffic rebalancing feature.
This new functionality allowed us to move any percentage of traffic from any datacenter we wanted to the new GCP locations being validated. When a customer hit one of our DCs, we’d dynamically determine if we needed to reroute them to a different location by integrating with our CDNs, ensuring persistence for the customers' sessions.
As covered in previous instalments, per-component validation allowed, at the cost of increased latency or page load time, to reduce cycles spent when troubleshooting more complex website traffic switchovers. It’s how we could validate all funnels end to end and get a basic understanding of performance improvements we needed to make.
Performance is always tricky
Website performance can only be as good as your underlying hardware and the code it's running. Caching can make dramatic improvements — our systems rely on multiple layers of cache.
It could be a local cache on the application servers, databases, search engines or centralized remote caching. While some caches can be pre-populated, the rest comes from the so-called “lazy” population by natural traffic (i.e. when customers browse the website, the systems populate the cache with the products they interact with). It sounds a bit backwards but the more traffic you serve, the closer page performance approaches its ideal performance state as measured by SLOs. When caches are cold, we need to be mindful about customer experience.
In collaboration with our Product and Analytics teams, every validation event was analyzed for UX impact by measuring the number of pages visited, time spent on the site, conversion rate and many other metrics.
Understanding of the impact, or lack thereof, helped in making decisions on how fast we wanted to go, when to schedule the validation events and how successful the outcome was.
After dozens of rounds of validation, we finally reached a state that we were ready to start switching larger portions of our global traffic to new DCs and updating operational procedures for our global engineering teams.
The result of this work helped us to build confidence in the results as we were getting closer to our biggest event of the year!
Way Day 2019
Tremendous amount of teamwork finally paid off and the teams were anxious for the event to start!
While the level of confidence was high, the stakes were even higher.
What if we missed something? What if the system doesn’t perform up to spec with more live traffic than we could validate with? What if something else happens … ?
The teams started gathering in the Command Room and connecting to virtual bridges for communicating with our extended global teams and Google partners. The incident command was in place and everybody had a healthy level of anxiety, excitement and determination while waiting for our biggest sales day of the year.
Engineers were looking at the charts, investigating even the smallest anomalies. The atmosphere was great — everybody was excited looking at how our new systems were reacting to traffic and adding and removing capacity as needed. It was a big step for Wayfair, a large company, operating predominantly in static private data centers for years.
As we were getting closer to our peak hours and handling new record levels of traffic, we worked closely with Google's team to make sure there was enough capacity in our zones and cells to handle the bursts. Even being in the Cloud, at a certain scale, you have to be mindful and collaborative with your partners on predicting the demand and your needs for DR capacity to ensure it is available when needed.
Overall, everything worked as it should! There were so many “firsts” for us it was hard to count which was a huge relief and source of pride for all Wayfairians involved!
We. Did. It. We went through the biggest sale of the year with a vastly upgraded stack for running our monolithic apps, ensuring outstanding customer experience.
Stay tuned for more exciting stories from Wayfair Engineering!
Want to work at Wayfair? Head over to our Careers page to see our open roles. Be sure to follow us on Twitter, Instagram, Facebook and LinkedIn to get a look at life at Wayfair and see what it’s like to be part of our growing team.