How our microservice governance layer helped us survive a crazy data center outage
Saturday morning: a great time to catch up on the weekly news, enjoy a cup of coffee, maybe get some chores done around the house. Or, if you were a member of the Wayfair Technology Team, you were rapidly responding to a developing issue at a major data center. An accidental cold water shut off led to an (almost literal) meltdown. Hundreds of microservices hosted at the data center started failing due to hardware overheating failures. Customers were experiencing outages on our website and alerts were lighting up every internal Slack channel.
So what did we do? We kept calm and deleted our pods.
Wayfair’s Operations Center (lovingly nicknamed the “WOC”) is our first line of defense to manage unforeseeable outages of this kind, and they were very busy bringing our hardware back online, rerouting requests, and helping restore our system health. Unfortunately, until fairly recently this team has also been our second line of defense, helping us manage our Kubernetes clusters and support operational concerns like pod deletions..
The reason that WOC operated in this way was that we didn’t have a good grasp as to how we should delegate permissions for pod deletions. For a given namespace “foo”, which engineer or engineers should be allowed to delete “foo” pods? That was the sort of thing that could sometimes take a day or two to figure out, using clues from git repositories and the organizational chart. To be sure, in some cases it was easy - find the git repository, look through the last few commits, send a few Slack messages - and many times that worked out fine. But what if the last commits were from a few years ago, and the engineer in question had since left the company, or the team had dissolved? What if the mapping from a Kubernetes namespace to a git repository was unclear due to odd naming choices? The process of finding namespace owners was tricky for engineers doing one-off searches, and became a real mess at the programmatic level.
Late last year, we began the process of putting together a proper governance layer for the Wayfair microservice ecosystem. This effort included a wide cross-section of our engineering team. Our Platform-as-a-Service Team helped record ownership data as feature teams spun up new microservices. The Identity and Access Management Team created an access request workflow to allow teams to change ownership records on the fly. The Kubernetes Infrastructure Team created a custom controller which applied ownership policies as role bindings against namespaces. The end result: a robust data set relating engineers to the microservices they own. More importantly, self-service tools allowed engineers to delete pods autonomously, without imposing any operational burden on the WOC.
It was a big, ambitious project, but with some careful coordination and lots of testing, we managed to make the whole thing work. And our timing was impeccable! Our self-service tooling went live on a Thursday afternoon, and our data center (near-)meltdown occurred not 48 hours later on a Saturday morning. Immediately we saw dozens of different teams take up our new self-service tools, healing their services as the necessary hardware came online.
In some cases we were seeing Kubernetes pods enter “CrashLoopBackoff” status because secrets, or some other necessary dependency, were not available when the underlying containers were initially deployed. Once secrets volumes came back up, a simple “delete pods” operation caused Kubernetes to re-create the containers and re-mount secrets - thereby restoring application health.
Whereas our teams would have had to work closely with infrastructure providers to delete pods during previous outages, our self-service tooling made an otherwise difficult weekend that much more tolerable. It also allowed us to respond more efficiently and nimbly- engineering teams could self-heal, and our WOC team had one less fire to put out.
Of course, we’re never done, and we are always improving! The self-service Kubernetes tools were a great, and very well-timed, start to this project. But there are countless other tools we want to bring online - for Kafka, Google Cloud Platform, and more.
This experience is just a small taste of the kinds of incredible challenges we creatively work through every day at Wayfair. If this kind of opportunity sounds exciting to you and you want to be part of a team that is directly developing the technology that will enable our growth and success for years to come, check out our open roles here and apply today!