In the previous installment, we talked about testing smaller components first, expanding the scope gradually, and learning more about the platform of choice in the process.
Today, we’ll focus on the Wayfair Storefront to show how we moved forward and used this knowledge to solve one of the biggest scaling challenges we ever had.
Expanding the scope to Storefront
Storefront is one of the most important components in the e-commerce industry. Typically, if you can’t show your products or accept orders, your revenue generation is severely affected.
The nature of retail assumes that website traffic will be affected by seasonality, events and item promotions and external factors influencing the demand.
Imagine a product being featured on a popular news website, running a groundbreaking sale on a collection of items, or being one of the first sites shoppers consider visiting during sale holidays. For many brands, such events can lead to significant swings of traffic levels, sometimes even doubling or tripling it.
The challenge on the infrastructure side in a private datacenter is to always keep excess capacity – A LOT of excess capacity – to anticipate these potential spikes in traffic. It translates to expensive equipment, tight deadlines and lots of extra physical data center space and power required.
If you don’t have this capacity available, you’re risking the site going down or losing a large portion of your revenue.
Given our tremendous growth pattern, we were looking for more elastic options to handle such demand.
Our existing cloud deployments had been working great for many months, so we had some reliability data collected.
Expanding to GCP seemed like a good option, but it was important that we experiment on a smaller scale first. Building a narrow and controlled POC gave us invaluable data. Expanding this work to hundreds of systems is where the hardest work typically begins.
In keeping with our traditional test-and-learn strategy, it was decided to keep the scope more controlled and only focus on one datacenter (DC). It allowed us to work on software and data center components, instead of jumping into rather complex cross-geo routing scenarios, involving traffic redistribution and CDN networking.
However, our customers had a final say...
The wonder of Way Day
Some of you may remember the “The Wonder of Way Day ” post written by our CTO at the time John Mulliken. He described the heroic engineering effort that led to the successful WayDay event of 2018. Being on the front line of the situation, we couldn’t stop thinking about future events and what they would mean for the company.
In the aftermath of that successful event, after careful consideration, the leadership wanted to try and target all our DCs for potential expansion to the cloud before next Way Day.
I was both excited and anxious with the amount of responsibility it put on our teams. Failure to deliver something that big could result in significant losses, either in equipment costs required or revenue hit due to degradation of customer experience.
It was full steam ahead, with no lack of technical challenges to overcome...
How did we approach the process?
To move a significant portion of the traffic to GCP in time, we needed to move fast and started looking into what it would actually take and what our “plan B” was going to be.
In the meantime, the list of questions kept growing:
- How does this affect customer experience?
- Is it financially viable?
- Do we have the resources in Engineering to make it happen before next Way Day?
- What is our deadline for a go or no-go decision?
- Can we still order required hardware and rack / stack / configure it if we find something we can’t overcome?
A reader could ask, “Why set a goal first when so many questions are still unanswered?”
The truth is, in engineering, you can’t answer all the questions beforehand. In cases like this, when the number of the systems is high and the timeline is short, we realized that answers won’t come from modeling and synthetic testing alone. One could spend countless days trying to build a model just to realize it can’t account for the number of variables and won’t exhibit the behavior of a system in practice.
Planning, doing the work and validating with increasing amounts of live traffic were essential for finding the answers.
As an architect, I worked closely with the teams on promoting prioritization of rapid iterations versus modeling wherever possible. I also found it very useful to participate in hands-on work early in the process. It allowed me to understand the nuances of the work better, be ahead of the potential issues the teams could face and have a better grip on planning and timelines.
Getting business buy-in and aligning the teams
Accounting for the cross team and organizational work that was required, it seemed like an impossible task – unless our internal priorities were aligned and execution was streamlined.
The leadership team worked tirelessly with business owners and Engineering leads across the company to make sure we work together towards the shared goal – make our customer experience great, regardless of projected traffic levels.
It’s often difficult to find a balance between prioritizing features and infrastructure needs:
- Do we spend more on infrastructure and get more in return because we delivered more features on time?
- Do we invest in the future and reduce the risk of outages causing potential revenue loss?
As a big and fast growing company, we needed to leverage the power of public cloud infrastructure to react better to customer’s demand. After highly collaborative discussions, the consensus to move forward was reached.
The Infra and Software collaboration began.
In the next installment in this series on our road to the cloud, we’ll go over the execution challenges of broad tech initiatives, and provide some insight into working with a large, dynamic production environment. See you soon!
Want to work at Wayfair? Head over to our Careers page to see our open roles. Be sure to follow us on Twitter , Instagram , Facebook and LinkedIn to get a look at life at Wayfair and see what it’s like to be part of our growing team.