Wayfair Tech Blog

The Wayfair Forecast Calls for Google Cloud and Clear Skies Ahead

network

On Tuesday October 11, 2022 Google announced in a press release that Wayfair completed the migration of all our data center applications and services to Google Cloud Platform (GCP). This was a significant move that will provide huge benefits, enabling us to increase agility, technical innovation, and more. You can read the full press release [here].

As the Program Sponsor and Initiative Lead, I’d like to share some of the details behind this incredible effort, what prompted Wayfair’s decision to shift from a hybrid strategy, and some of the ups and downs we experienced along the way.

Why Cloud? Why Now?

The decision to go all-in on the public cloud began back in late 2020. At the time, we had a well-defined hybrid strategy as well as some incredible success moving critical components of our application stack to GCP. But that wasn’t all. By running 100 percent of our storefront services and most of our analytics environments in GCP, we experienced firsthand the value of the cloud.

The journey to GCP also helped us define our initial cloud strategies while teaching us how to work with large and complex cloud-based workloads. We were amazed at how our cloud-based systems were more portable and could scale rapidly to meet dynamic changes in our traffic.

The simple advantage of “traffic-based capacity on demand” was a critical factor in our decision to go all-in on the public cloud and here's why. During the initial months of the pandemic, we could scale down as the macro environment experienced uncertainty and then massively scale up to meet the onslaught of demand that came when an entire workforce moved to home-based offices.

During this period, the retail industry and specifically homegoods saw unprecedented hyper-growth, and our storefront services adapted rapidly to those demand changes. This scale happened seamlessly, and on-demand, and was a huge departure from how things had been done previously in our on-premise Data Centers (DCs).

After an entire summer of high traffic and high revenue events where everything performed flawlessly, we knew GCP was the answer.

Lessons Learned Earlier

A factor we had in our favor at the beginning of this migration was the recently completed effort focused on transforming our Data Centers into a zone-based design in 2019. For that effort, we implemented an Availability Zone (AZs) concept in our DCs that carved up the physical footprint into multiple 18 rack isolation zones for compute, network, and storage. I was also one of the leads on that initiative which helped us break the IP Address dependency for roughly 17,000 out of the 23,000 systems that we moved into these new zones. Once complete, we not only improved our pipelines and automation, the team gained critical firsthand knowledge about the cadence and velocity of that migration, which in turn informed our GCP approach.

Getting Started

Once the decision was made, the starter’s gun went off. Initially our plan was to use all of 2021 to perform transformations and as much decoupling as possible (more in a minute). We would then execute the heavy lifting in 2022. That all changed when we learned our on-premise Data Center colocation contract was ending in November of 2022. At that point, we realized there were serious cost incentives available if we could hit that date. Success required compromises, which included moving much of our application stack, including commitments that were already in-flight, forward into 2021. If that wasn’t challenging enough, we needed to modify the plan quickly.

Like any large organization, getting information into the hands of decision-makers and people doing the work can take time which we didn’t have. We needed to reach the middle layers of the organization quickly with the news of the accelerated timeline, incorporate their feedback into the program and ensure their buy-in.

To begin, we reached out to select application owners, inviting them to small format initial brainstorming sessions that we call fireside chats. These yielded some important nuggets.

  1. Paved Roads: Teams needed to understand their options and required a path they could follow. The teams required guidelines and reference architectures they could follow. 
  2. Communication: Not everyone involved in the work knew the timelines had changed. including those responsible for the heavy lifting. The word had to get out fast.
  3. Coordination: The program would require a high touch coordinated approach. Our existing program methodology was well-tuned to deliver on agile/ CI development of features and platform roadmaps, but coordination for initiatives of this scale just wasn’t common. For this effort, we needed to create interactive work funnels between large portions of the organization that would allow developers and platform owners to work together in real time, while providing us with a way to track the distributed efforts. 
  4. Scaffolding: Since we were so early in the process, we had not yet built out the basic program scaffolding. That would need to be done in parallel to all of the other efforts underway. The scaffolding had to include reporting, and telemetry as well as workstreams, and a dashboard that would allow us to track burndown, and a schedule. 

Closing The Gaps

1) Paved Roads: First, we reached out to every platform owner in the Infrastructure organization and had them document what we called “Paved Roads.” These were basically a list of approved services and migration strategies that teams could use for their move. These would include references to FAQs outlining what to expect when leveraging shared infrastructure services in the cloud. We pulled these together VERY quickly and got them published into a“Migration Guidance and Best Practices” guide. 

2) Communication: Once the Paved Roads were documented, we needed comprehensive and synoptic communication. Here we used a combination of release notes, and office hours while giving people access to “Ask The Experts” sessions. In every instance, we hammered home the same messages over and over. We also requested “Platform Champions” from each infrastructure platform team and brought these SMEs together as the authoritative voice of their platforms.

Next, we held migration road shows and town halls. These were technical meetings and all the platform SMEs were in attendance. All meetings started with a slide deck overview of the program guidance, the schedule, and the paved roads with the remainder set aside for Q&A. We tailored the content to the Line Managers, Application SMEs, and Product Managers and had massive audiences in attendance throughout the series.

3) Coordination: The roadshows helped us make new connections between the infrastructure platform and development engineering teams. We leveraged these to form bi-directional work funnels. We broke the program down into workstreams by starting with the top level CTO’s direct (Director) reports who then gave us top-level workstreams (1-3 per org) for their organizations. From there, we expanded the champion model by pairing application SMEs with Infrastructure SMEs which we called “InfraChamps.” There were 30 workstreams in total. 

The “InfraChamps” became cloud enablers for their assigned workstreams. Here we picked SMEs from the Infrastructure teams who had development experience and could help influence their application counterparts. They were dedicated embedded SREs responsible for coordinating platform needs and clearing infrastructure blockers on behalf of the workstream. They also paired up with the “AppChamps'' (application SMEs) and joined their sprints. This enablement mission became the core of the delivery model, and InfraChamps also filled in on status and project management duties.

4) Scaffolding: As the program got underway, the Program Management started to come together. We built inventory tooling and custom Data Studio dashboards to show the burndown and provide realtime and historical progress. Next, we consolidated documentation into central repos, created executive dashboards, scheduled monthly stakeholder meetings and created self reporting for teams to report their overall status every week. 

Decoupling First

In 2019, Wayfair had already commenced a microservices and decoupling journey. Teams had been rapidly decomposing the monolithic application and decoupling applications to Kubernetes; a process that was part of our Engineering Platform Manifesto and one that continues to this day. It also meant creating APIs for these services using new standards that ensured consistency of the deployment.

Obviously moving to containers helped us a lot with application portability. First, it meant that your code could be deployed and tested independently from the monolith. Second, it meant teams were also now deploying through a modern and consistent pipeline. We had very good parity between our on-premise K8s and Cloud GKE environment which meant that container namespaces could be easily moved back and forth at will. We encouraged teams that were already decoupling to continue with those efforts provided they could wrap it up by Jan 1, 2022. This natively helped our cause, and by the time we were done with the move, microsservices made up ~45 percent of our application footprint.

High Level Schedule

Teams were given a number of “paths to cloud” as noted earlier in the sections on paved roads. As the application teams began the efforts to get their services ready for the move, our Platform groups got to work replicating any remaining Infrastructure Shared Services that were not already in the cloud. Our Platform SRE teams were handling the basic migrations of services like Jenkins, Kafka, Caching, and Load Balancing, all of which had to be scaled as more teams migrated their systems to GCP. Below is a 50,000 foot overview of the timeline.

  • 2021, April - May: Early Adopter Migrations - Getting the word out, selecting pilot applications and defining the database and object storage migration schedule. SQL Server migrations start.
  • 2021, June - September: Early/Mid Majority Migrations - Prescaling all infrastructure shared services, majority of monolith applications tested and moved, ongoing decoupling, and parachuting tiger teams in for problem solving.
  • 2021, October - December: Mid/Late Majority Migrations - Application migrations in full swing. K8s footprint crosses the threshold of more containers in the cloud than on premise. Application migrations are ~70 percent complete by EOY. 
  • 2022, January - March: Late  Majority and Delayed Migrations - All Databases are in the cloud. Storage migrations are completed, 100 percent of business application workloads are in the cloud. On prem K8s are retired. Infra block out testing starts. 

* Note: Block out testing is where a service is shut down in progressively longer intervals over a period of time to ensure there are no remaining applications accessing it. After remediation the service is retested.

  • 2022, April - July: Infrastructure End of Service - All shared infra services in GCP are scaled up. All on premise infra services are block out tested and shut down. 
  • 2022, August - October: Decommission - Final decommissioning of the Data Center physical plant. Migration of Network backbone, and WAN services. Charitable donations begin. Sites prepared for turnover to the colocation vendor.

Some recommendations for large cloud deployments

  1. Decouple as much as possible first: As mentioned earlier, the container migrations helped tremendously because these services were independent and portable. Also, moving a lot of junk into the cloud only kicks the tech debt down the road and delays your real cloud transformation. Do as much transformation upfront as your timeline and non-cloud environment will allow. 
  2. Get the databases and storage moving early: The databases and storage are some of the most technically complex moves and require the most coordination. Carve up the work and assign dedicated Project Managers to this effort. The earlier you get started the faster things will begin moving in general. For our effort, we used SQL availability groups to move database instances to GCP and fail back when needed. Also, prepare for pushback on setting a fixed schedule. Moving the databases early can be an incentive for dependent application teams to move on the same schedule as their databases. 
  3. Prepare yourself for latency: During our move, we could not keep everything colocated with dependencies. There were way too many moving parts. In some cases we temporarily introduced 16-25ms of geographical network latency (that translated to upward of 200ms of transactional latency) where previously these systems were <1ms away from their counterparts. To solve this, we built latency tooling to inject synthetic latency between applications and their DBs or dependencies. Teams used this to benchmark their expected performance and identify where applications could be moved ahead of their databases. This also drove application tuning initiatives and gave teams the incentive to leverage connection pooling and other performance techniques. 
  4. Burst to the cloud where possible: One of the biggest challenges was validating against real production workloads to verify functionality. Here we used our internal load balancers and traffic redirection methods to run partial production workloads in the cloud. We were able to direct a configurable percentage of our traffic to the cloud so we could deploy a small parallel production cloud footprint and get our validation done with minimal risk. Once we were comfortable with application functionality and performance, we would scale up (burst) into GCP and then scale down on premise. This gave teams confidence that their applications were functioning in GCP. Logging and metrics are key. 
  5. Track your success and your costs: This was a tough one for us and required some tooling, and here’s why – there were multiple disparate inventory databases and no single source of truth for everything so we needed to create our own. We addressed this by building our own real time inventory polling system, mapping this to application owners, and then using that for forecasting and burndown. To validate our burndown against actual spend we used metadata tags that were instantiated at runtime when the pipeline ran for any migration related infrastructure. This allowed us to pull reports on migration related spend and differentiate that from the consumption of the systems that were already in GCP. 
  6. Reserve a bench of tiger team engineers: Throughout the program we had multiple escalations, blockers, and panic events. We had established solid escalation hygiene and were able to address problems quickly. We also reserved significant capacity in our sprints for fires and had talented people on the bench ready to dive into tiger teams as needed to solve problems. In some cases we had to retrain people along the way, and ultimately it came down to smart and committed humans working together.

Well it worked! After 16 months, we managed to move 100% of our applications and shared services to the cloud and shutdown our on-premise systems. To give you a sense of the scale of this effort, consider that we migrated the following to GCP:
16 Months Later……

  • 23,000 physical operating instances moved or retired
  • 330,000 computer cores moved or retired
  • 8,900 applications moved to GCP
  • 5,700 Kubernetes namespace applications moved or retired
  • Our entire monolith moved by Q1’22
  • 200 SQL database instances moved by Q2’22

It’s not just the scale of this project that’s worth noting but also the timeframe - 16 months may seem like a long period, but it’s an incredibly short runway for a project of this scale. Our team was committed and determined, and we supported them and responded quickly. Overall the program team included sixty champions (acting as Technical Project Managers), seven program-level managers, an overall program lead, hundreds of platform owners, and at times as many as 2500 engineers actively working on the program.

What’s next? At the moment, our teams are shifting focus to the next phase of our platform journey – enabling developer self-service, finishing our decoupling efforts, and leveraging cloud native services where it makes sense. The cloud is transforming how we work and creating better autonomy and velocity for our engineering teams. This transformation is allowing our application owners to better respond with agility to the needs of our customers and rapidly shift focus to where the opportunities are. You’ll hear more about this in the future, but for now our journey continues…

Share