Have you ever wondered how engineering teams tackle the challenge of ensuring reliability and performance amid increasing site traffic? If so, I hope the information in this four-part series will be of interest to you.
At Wayfair, we’ve seen a tremendous amount of growth, increasing our revenue and traffic exponentially in the past ten years. We operate at a massive scale, with shoppers visiting the Wayfair site and app more than 4 billion times in 2020. With more than 71% of orders coming from repeat customers, we’ve built a seamless, dependable online shopping experience that keeps customers returning to our site time after time.
As a member of our Infrastructure organization and the lead architect of our Storefront-to-cloud migration journey, I’d like to provide some “behind-the-curtain” insight into organizational and technical aspects of one of our most complex infrastructure projects. The decision to scale up our cloud adoption was made in 2017, so let’s start from there…
How did it start?
The majority of Wayfair Storefront systems were located in our private (“colo”) data centers, running predominantly monolithic PHP applications.
Being in a private datacenter typically means you procure hardware well in advance and use it for years to come. The primary focus is on reducing both the operational costs of deploying the systems and the time it takes to provision a system.
In parallel, we were exploring options for elastic compute in public cloud environments for some specialized systems. It let us narrow down the scope of the research into components of immediate interest, get data and validate system designs faster.
After researching three of the most popular public cloud offerings, we chose Google Cloud Platform (GCP) based on a combination of business factors and technical requirements.
We felt good about the way GCP networking and storage components were implemented. GCP’s combination of simplicity, performance and scalability impressed us and best suited our needs.
Our first cloud component
In 2017, one of the systems that required excessive maintenance and had reliability issues was our Image Processor.
It’s a system that performs image manipulations such as resizing and quality adjustments. One of its components – Tier 2 (T2) cache for already resized images – kept running out of space as we were scaling up our media content. After outgrowing NFS and RIAK object storage technologies, we decided to move T2 cache alone to Google Cloud Storage (GCS), improving scalability, reducing maintenance costs and putting it closer to our content delivery providers (CDNs).
This migration was a huge success and gave us valuable experience and confidence running production systems in GCP.
First cloud system and new Infrastructure model
After a couple of months, the team was eager to take on a project to move the rest of the Image Processor application, which was running on VMs, to GCP. Architecting the system to scale in response to demand was our primary goal.
Moving an application is a task that involves much more than just the storage component alone. It includes provisioning and configuring the VM environment, deploying the code, getting the data to GCP, and reviewing security and connectivity.
This project was an opportunity for us to come up with an updated architectural model for future deployments.
In our private data centers, we use stateless automation for compute provisioning, Puppet for configuration of the environment, and an internally developed code deploy system tied to our internal databases and backend components.
While functionally sufficient, total time to provision and get ready to serve traffic could be hours and required a lot of manual intervention, which wasn’t compatible with elastic models, where systems need to be ready within minutes and have the ability to scale up and down dynamically.
To address the issue, we needed an updated stack of frameworks and flows. Being huge fans of the HashiCorp automation suite, our model ended up using:
- Packer for image building
- Terraform for DNS, CDN and GCP provisioning
- Vault for credential management
- “Server-less” puppet model for image configuration and Jenkins for CI/CD automation
For the application itself, we applied the base principles of reliability engineering and operational simplicity:
- To keep the scope narrow, only deploy what you absolutely need
- Simplify the architecture
- Moving a database server with replication would add more dependencies and expand the scope of management to additional teams.
- Reworking the application to use REST and query existing geo-redundant private DCs endpoint solves the issue without sacrificing practical reliability.
Consume artifacts and deployment states from locally populated copies
- This is preferable over deploying, maintaining or creating boot dependencies on remote systems.
The transition to our new system went great, with no production downtime or service degradation.
It was a big milestone for Wayfair Infrastructure to reach that level of automation, even for an isolated use case at the time.
Joy was in the air.
In the next installment in this series on our road to the cloud, we’ll go over our most ambitious cloud initiative to date and how we went big, building on the foundation put in place by our first cloud systems. Stay tuned!