As we head into Black Friday, Cyber Monday, and retail’s peak season for 2018, I wanted to reflect back on some excitement we had earlier in the year at Wayfair; specifically how we reacted and built, as a tech team, toward challenges we hadn’t seen before. I hope other e-commerce and tech leaders who are on a scaling journey will be interested in what we hit, and frankly, it was an adventure that anyone in the technology might find interesting.
As one of the largest online destinations for home furnishings, we live in a very seasonal world, as all retailers do. We have big days around Thanksgiving in the United States, and big holiday shopping days throughout the year. We work toward these days, and they’re typically exciting, but they’re mostly not the focus in engineering. There’s a cadence of developing new features to drive customer satisfaction and revenue, seeing how they affect performance, and then tuning them so that our pages load as quickly as anyone’s in the industry, on top of ensuring that our servers are rock solid. In tandem, we use a methodology of real-world testing including high-stress tests of small components, deferred and then concentrated processing or orders, failover, planned and unannounced chaos events, supported by capacity planning throughout the year so that we’re constantly preparing for the holiday rush. This is the way we’ve operated at Wayfair for years, long before I had the opportunity to lead our technology team.
Back in April 2018, I’d been Wayfair’s CTO for a year, leading a rapidly growing team of nearly 1,900 engineers who solve the prickliest of problems in the e-commerce space. Near the end of that month I received my baptism by fire, and had an experience that reminded me what a privilege it is to be part of an organization like this.
April 25th this year was a day like any other day for most people out there. At Wayfair, we’d been planning a new retail event called Way Day for a number of months, promoted with significant TV investments and even the surprise of white Adirondack chairs delivered to residents of the eight streets named Wayfair around America. While we were ready for significant volume, my main concern for the engineering team in the lead up to Way Day was to maintain velocity of delivering new products. Our most optimistic models had us operating at the capacity we’d achieved on Cyber Monday 2017, and our systems and applications had weathered that load handily based on all the preparation we’d completed back then. Apart from limiting deploys to a small number of key areas, it was meant to be an ordinary work day, just like the day before. We were excited to see how the efforts of our colleagues in Category Management, Supply Chain, and PR would turn out, but we were very comfortable that the stress tests we’d run on our systems to smoke out any regressions were sufficient.
But that’s not how it was going to turn out...
I awoke briefly at 4:30am before the sun was up and blurrily checked my phone to see if there had been any issues overnight. I was assuming I’d head back to sleep before getting up and making breakfast for my kids. I shook my phone and squinted when I looked at the overnight revenue report. Our midnight hour of revenue had dwarfed Cyber Monday’s equivalent hour, and not only had revenue grown, but the ratio of sales vs. that hour on Cyber Monday had grown in the few hours since. I soon realized that my “business as usual, don’t slow down” ethos sure wasn’t going to be enough. I kissed my wife goodbye and ran into work, tucking my shirt in as I jumped in my car. The streets were empty.
When I got to our office in Copley Square, the sky was just getting light. Our Wayfair Operations Center (we call it the ‘WOC’, but it’s similar to a NOC most places) was alive with the kind of positive tension we have during Cyber5, but we hadn’t yet gathered a team in our War Room to coordinate. Our systems were handling the load with ease, but the realization was dawning that if we kept the same sales trajectory, then they probably wouldn’t: We’d be over our projections and experience with our systems by a decent margin. My team and I don’t like admitting defeat, especially considering that our colleagues had done such an amazing job generating interest with TV, block parties across America, and great deals. Now, we were sailing into the unknown – our systems had never experienced real-world load like this. We only had a number of synthetic tests simulating this kind of load, and never across all of our systems in concert.
Senior folks from all our teams soon started to trickle into the War Room – SRE, Data Tech, Marketing and Storefront Engineering, the heads of our infrastructure teams, and more. By 9:00am some of our systems were already sweating – we were failing over some of our customer shards between data centers already, and we could see the path to our Storefront boxes getting way too hot. Our Datacenter and Servers teams swung into action knowing that what we had online wouldn’t be enough. We had several servers on hand thanks to our capacity planning and preparation for growth in late 2019, but those machines were still in their boxes on the loading dock. We needed them a lot sooner than we thought.
Red Alert! All hands on deck!
The call went out to everyone in Engineering: What capacity could we afford to redeploy during the day? Nothing was sacred – dev boxes, spare tracking servers… what capacity did we have on top of what was already being deployed? As the morning rolled on, we’d already started kicking some systems that had begun to sweat – SQLRelay, our MSSQL customer shards, our core PHP, etc. And it was becoming clear that while we had some real options of levers to pull, a lot of them were untested.
My Linux builds and Server Engineering teams were running hard to bring every single spare physical box online and deploy it in the course of hours. We’d identified a number of virtual servers we could shutdown, stealing capacity for more critical services. We isolated some applications based on their traffic patterns and shifted those to our less loaded datacenter so that we could commandeer the boxes in our primary. We cut off non-emergency software deploys late in the morning; we didn’t need to introduce any more changes into our environment given how hot it was running.
By 2:00pm I had only left the War Room once for a brief interview and to organize lunch for the team, and boy, was it sizzling in there – a lot of laptops and a lot of people were too much for the air conditioning to keep up, making it feel like a physical manifestation of what was going on in our data centers. What was also new to us was that we were running a number of flash sales that ended throughout the day, so traffic patterns were different from anything we’d ever seen. Systems had started to pop more frequently – and while one thing goes and you handle it, it’s never as stately as you’d imagine. There were a number of times we thought we wouldn’t make it, but just enough capacity came online and we frankly dodged a bullet – load peaked, leveled off, got higher, a few systems fell over, and we made it. A sigh of relief was heard throughout the room as traffic moderated right at 3:00pm.
The rest of the day was going to be tough; if what we had just experienced was anything to go by, traffic was only going to increase, and we’d just squeaked through that early peak. About a dozen of us had been in a hot room together for six hours, but more capacity was coming online, with our teams racking more physical servers, and freeing up more virtual capacity as each hour passed. Would it be enough? Had we added it all in time?
My boss, our CEO, dropped by at one point for moral support and so did a lot of folks from other parts of the company – all happy to get pizza, trade a few jokes, keeping the atmosphere light even though the pressure was high. As the evening progressed, we kept identifying more levers we could pull. One of my favorites was that my Frontend Infrastructure team had identified a number of server-side operations which could be feature-toggled off to give more breathing room to the Storefront boxes. Not only were we expanding our private cloud in real-time, but we were, in essence, offloading more compute to our customers’ browsers. We were switching traffic and capacity back and forth in real-time in a test of nearly all of our switchover and failover capabilities. Software teams and infrastructure teams that always work nearby, but don’t always get the chance to truly collaborate, had to do so in the moment just to keep our core systems up. And even though our Storefront was making it, the order volumes we were seeing were piling up inside our order processing pipeline in a way that made it clear they’d be truly tested before the night was over. Databases that hadn’t ever broken a sweat before were revealing small configuration anomalies that showed differences in performance, and we were failing them back and forth with DBAs on the front line.
Surviving our own retail storm
In the end, late in the night when traffic and sales were peaking, not every option we’d been able to concoct had been needed. We didn’t know if we were going to make it, but it never got chaotic. At one point, my Head of SRE pulled me aside and sagely said “John, I realize that we’ve built this arsenal of options, but even if some of them would give us headroom and we’ve already done a number of untested things today, we still don’t know the concrete outcome. Let’s keep our powder dry on a couple of these”. I’m glad he did – we made it through the biggest test we could ever imagine of our systems, but more so of our varied Engineering teams working together. Our core SRE teams were online for 20 hours straight. Our builds teams had used automation that they’d been building in the last year to enable them to do, in a single day, what previously would have taken weeks. As much as I want to always talk about predictable delivery, we experienced a real element of pure, plain, and simple heroism.
As a side note, a lot of people might ask: “Why aren’t you in the public cloud?” That’s a question for a different day, and sure enough we’re experimenting with a lot of workloads as I write this. That decision also needs to take into consideration the amount of re-architecting and replatforming we’d face – we’re not a startup starting from scratch, but a company with well over a thousand developers and fifteen years in the game. A number of our systems are already in the public cloud where the economics of the workload were obvious. But however you cut it, on the morning of April 25th, we didn’t have the option of going to the public cloud during that day, and running our systems as usual would have been courting disaster. Had we built enough of a private cloud, with enough spare capacity to handle the load, and enough power tools to allow us to swing it into service fast enough to get through the day? We had – we took our core principles of velocity, collaboration, and engineering ingenuity in the pursuit of hard business problems, and we got what we needed built and deployed.
We’d been perfectly prepared for a huge day, but not for the magnitude of day that we were faced with. It’s common to run a post-mortem after a major outage, but while we learned a lot about our systems and how we reacted to their performance, I didn’t want to call this a post-mortem because we didn’t have an outage! Our usual post-mortem approach wasn’t going to cut it – this had been a success that we wanted to unpack in order to figure out how to keep that spirit and capability alive. When a number of my leaders recently got together to reflect on Way Day, we recapped everything we’d learned and what we’re now putting in place. On top of it all, it was a celebration of the teamwork, heroics, and automation that made this crazy day a success.
The systems and applications we build here at Wayfair are like a race car; we spend every day building, rebuilding, and adding capabilities – we tune and re-tune but we don’t let ourselves floor it. On Way Day, we took the car out on the road, and added a turbocharger while we were going down the highway – faster than we’d ever driven before. We scuffed some paint, and dented the bumper, but it performed really well and we won that race. In fact, we showed just how prepared we actually were – the perfect dress rehearsal for Holiday 2018.