Managing Microsystems and Conway’s Law with Governance Layers and Trust

A few years back, Wayfair jumped into the microservices revolution feet-first. The results were dramatic. We had tons of activity—hundreds of Wayfair teams launched thousands of microservices. It was great to see so much enthusiasm for microservices, but it came at a cost. We had a lot of confusion about infrastructure permissions: for any given infrastructure resource, who was authorized to manage that resource?

The Confusion

The best way to explain the confusion I touched on above is through a hypothetical example. Let’s say we have a recommendation service, which uses a database and a Kubernetes namespace. The team in question wants to delete pods on the namespace, backup-and-restore the database, and otherwise manage their resources. Seems reasonable enough, right?

For this team, it most certainly is. But for infrastructure teams supporting our database servers and Kubernetes clusters, things begin to get a bit hazier. The big question staring them in the face is: which team should be allowed to manage the recommendation service’s resources? Getting this answer is a fairly rudimentary exercise for a single service—ask a few people and poke around the source code until you figure out who the owners are. But imagine undertaking this hunt in an environment where there are thousands of microservices. What becomes immediately apparent is that poking around source code doesn’t scale.

What tends to happen next is that rather than giving all our developers full access to every infrastructure resource, infrastructure teams go the opposite direction. They lock down access to infrastructure, and use ticketing systems to process operation requests. Now, if a developer wants to perform a backup-and-restore on a database, she needs to file a ticket and wait for an administrator to execute it. That’s a perfectly prudent choice, and it does provide some necessary structure. But it can also slow down developer velocity.

This scenario is what played out at Wayfair, and we knew it was not sustainable. No worries, right? At Wayfair we are lucky to have a data-driven culture, which leads us to detect and measure these kinds of obstacles - so that we can eliminate them. More than that, we have a culture driven by trust on every level: that was our secret weapon.

The Solution: Identifying owners at scale

The answer was to build a governance layer that would make life easier for infrastructure teams and developers. The key components were as follows:

A system for creating microservices, based on the Cookiecutter templating project.
A REST API for storing ownership records, based on the Casbin library. All ownership records take the form “subject, role, object.” For example, “Alice is the owner of the recommendation service.”
A REST API for storing infrastructure relationships. These relationships take the form “microservice, resource type, resource name.” For example, “Recommendation service uses the Kubernetes namespace named recs-svc.”
Our enterprise identity and access management solution, which serves as a source of truth for employee identifiers.
A background job for transforming ownership records into the permissions on the relevant infrastructure platform. For example, the job ensures that when Alice is listed as an owner of the recommendation service, she gains permission on the Kubernetes cluster to operate on the namespace “recs-svc.”

In summary - we tied together two REST APIs so that important events in a microservice’s lifecycle could result in permission updates on infrastructure platforms. Examples of what we deemed an important event would be the creation of a microservice or the addition of an engineer to the team managing a microservice.

Our Innovative Secret Sauce

It’s worth noting that the systems described are somewhat boring technologies: relational databases, standard API development frameworks, a little bit of Python magic, and so forth. As I hinted above, we had a secret weapon: trust.

The entire project was based in trust between infrastructure teams and the developers who use their platforms. We knew that infrastructure teams could not blindly give every developer global access to their platforms. But infrastructure teams could give developers suitable access to the pieces of infrastructure which served those developers’ microservices. Fundamentally, we believed that we could trust each other enough to self-govern our own microservices.

More than that, at a practical level our work required quite a bit of trust among the project participants. Stitching together the governance layer required a wide variety of teams to work together, each in a different domain and with markedly different goals and incentives. To succeed, we had to work together to ensure that our systems were integrated properly. That required more than traditional API design and contract alignment. It required close collaboration to ensure that our code didn’t take down critical infrastructure or lead to any unforeseen, unpleasant surprises. We had to negotiate multiple tests, determine success criteria for each test, and recruit beta testers to participate at each stage. The teams involved built up a great deal of trust built on the strength of shared expectations. We regularly updated those expectations, asking for feedback and critique, and pivoted as needed to accommodate a wide range of obstacles.

It was great to release the first iteration of our governance layer last summer; it turns out that it was enormously successful, and helped dozens of teams self-heal their microservices during a severe outage. But the true success happened some months later, as we worked to tweak and improve the governance layer.

We discovered that the two REST APIs described above were a lot more interrelated than we had initially suspected. As our governance layer took on a broader set of requirements, we discovered that the APIs had to interact with each other more and more. Over time it became clear that they should both be owned by the same team. Otherwise, maintenance and improvements would become a real headache in the long run.

The trust we had built up during the project proved crucial to success at this stage. In a matter of weeks, the two teams involved made a joint decision to combine both APIs under the banner of a single team. We executed the transition quite smoothly - we held a series of trainings and code walk-throughs, freshened up the documentation, and set a calendar for an incremental change in ownership.

Working around Conway’s Law

What was really exciting about this experience was that it showed us we could defeat, or at least work with, Conway’s Law. If you’re not familiar with Conway’s Law, it’s named after computer programmer Melvin Coway and states that “Organizations, who design systems, are constrained to produce designs which are copies of the communication structures of these organizations.” If your recommendation system is developed by two teams, it’ll have two components - an API and a background data job, for example.

Conway’s Law is the dread of engineering managers everywhere. It dooms them to produce systems, which look a lot like their org charts. As a result, systems could be inefficient, difficult to maintain, or buggy. Some managers turn to the Inverse Conway Maneuver: they figure out the kind of system they want and design their org charts to produce that system.

It’s a clever trick - except when you’re building tricky systems. If you’re setting out to solve a problem, which is something of an enigma, and which requires the cooperation of a wide variety of engineers across your organization, then you don’t really know what kind of system to build in the first place! Some amount of trial and error is necessary; you’re certain to wind up with a system, which is very different from the one you had initially designed.

With our governance layer, Conway’s Law could have been a real curse. We might have ended up with a governance layer that was simply the product of quirks in our org chart, rather than one which served the best interests of our microservice ecosystem.

Trust enabled us to work around this problem. We built the system that we thought made the most sense, and then we aligned it to our org chart. As a result, our teams were able to nimbly take on a host of new requirements, ranging from supporting more infrastructure platforms to advanced reporting requirements. Without a culture of collaboration, we would have been tied down by communications overhead, held back by Conway’s Law.

In the end, our microservices journey went from chaos to success. All it took was a little bit of technology and a healthy dose of trust.