Multi-zone Cluster Management at Wayfair with Kubernetes

Introduction

Wayfair operates large data-centers in addition to leveraging the public cloud to provide infrastructure services. Over time, Wayfair Infrastructure teams chose to introduce multiple logical zones within a single datacenter to facilitate easier network maintenance and to split the single failure domain into multiple smaller ones.

In this article, we aim to present architecture decisions and lessons learned from operating Kubernetes over the past few years. In particular, we will detail our approach to building multi-zone Kubernetes clusters with different failure domains, including our focus on reliability and scalability in making our networking decisions.

Background

Our first attempt at deploying Kubernetes was based on a simple design. We made the following key assumptions in the single zone design:

All worker nodes would be in the same Layer 2 network domain, allowing us to operate with minimal changes to network architecture.
Our CNI plugin would be Canal, with “host-gw” backend in Flannel. This setup avoids encapsulation as network packets are simply routed between the nodes.
Ingress would be implemented by chaining MetalLB Controller and Speakers to provide external IPs for Services in layer2 mode and Nginx Ingress Controllers to provide Kubernetes Ingress resource support.

The simplicity of the single zone design enabled a number of use-cases and helped us to get off the ground quickly. However, we had the following concerns:

All incoming traffic was handled by a single Ingress Node, which was a possible bottleneck and could cause problems if its network interfaces became saturated. Even LACP bonds have their limits.
MetalLB Speaker failover would reset all established connections from the external network to Kubernetes applications.
In the case of a major network outage, the entire Kubernetes cluster would have been affected.

These concerns led us to explore and adapt a multi-zone model.

Multi-zone Kubernetes Architecture

With a multi-zone model, every data center is split into multiple availability zones, each with its own isolated network equipment and power supply. Let’s take a quick look at the network diagram below.

Figure 1: Multi availability zone network topology.

Every availability zone is a separate Layer 2 domain. Unfortunately, this means our single zone setup with “host-gw” Flannel no longer works, as nexthop will only function inside the same zone. Nodes from the different availability zones get IPs from different network addresses and are routable via Transport layer. Considering those constraints we had two major options for implementing Pod-to-Pod networking: encapsulation (tunneling) or BGP based dynamic routing.

We decided to leverage BGP due to a number of reasons, the most important of them being:

Minimal performance impact: network packets are simply routed between the nodes, which is a basic Layer 3 network function.
Scalability: BGP and routing are working at the Internet’s scale.
Simpler and easier troubleshooting: we can use the same tools we use to debug any other network issues, like “traceroute” and “tcpdump.”

BGP topology

The diagram below illustrates our BGP topology in a multi-zone network environment.

Figure 2: Multi availability zone BGP topology.

In the new design we had to switch from Flannel with “host-gw” backend to Calico with BGP backend in order to support multiple Layer 2 domains and dynamic routing between them. We’ve decided to use one BGP AS per zone, because—compared to AS per Node—it scales better and simplifies BGP configuration and management.

The key components of our BGP setup are BGP route reflectors, which are running on Kubernetes Master nodes. We run two Kubernetes Master nodes per zone, so every Core Switch is peered with two Kubernetes Masters in the same zone via iBGP. And all Kubernetes nodes are also peered using iBGP with their local Masters via Calico BGPPeer. Using route reflectors provides better scalability, simplifies network configuration, and allows us to fully automate dynamic BGP configuration (adding/deleting Kubernetes worker nodes) on Kubernetes Masters’ side.

We still use a set of dedicated Ingress Nodes for serving all external traffic. These nodes run Nginx Ingress Controllers to provide Kubernetes Ingress resource support. We also continue to use MetalLB Controller for compatibility and seamless migration of applications from the legacy clusters to the new ones. It is responsible for allocating external IPs for Kubernetes services with “LoadBalancer” type.

Due to some known issues with running both Calico and MetalLB in the BGP mode, we’ve decided to drop MetalLB Speakers and announce external IPs from Ingress nodes using Calico. Every Ingress node announces the same set of /32 prefixes, one prefix for every LoadBalancer service’s external IP. This way all the incoming external traffic to every IP is distributed across all Ingress nodes in all availability zones, because they are routed via Equal-cost multi-path (aka ECMP).

Conclusion

The new design based on BGP has served us well. Thanks to ECMP, we can scale the ingress layer horizontally. In addition, we are able to expand our node footprint seamlessly and handle single AZ outages gracefully. In the following posts, we will dig deeper into configuration of Calico and BGP/ECMP design.

Acknowledgements

I wish to thank all the people who contributed to this project:

The Kubernetes Team for designing, building, and running brand new Kubernetes multi-zone clusters.
The Network Engineering Team, for providing invaluable assistance during the design and implementation of this project.