Continuous Integration and Continuous Deployment (CI/CD) are industry standard, best practices for web-scale software engineering. Wayfair has practiced CI for well over a decade, although the ways in which we have performed it have changed over time. I’m here to share with you the system we have built to coordinate our deploys over the past few years, and how our processes have evolved for deploys at scale. As a consideration for this discussion, it’s worth noting that Wayfair has historically leaned towards a monolithic architecture for its web application development.
Background and Context
In everything that we build software-wise, Wayfair is fiercely pragmatic. We only want to build software systems we know will deliver value. This means, from a deploy perspective, we refine and execute a manual process with our Wayfair Operations Centre (the WOC) until we see clear value and wins for automating that process. At its core, CI/CD (for any software) is somewhat straightforward.
- Step 1: Obtain business level direction to build out a feature based on the latest, up to date copy of your master source code branch.
- Step 2: Build said feature.
- Step 3: Deploy the feature to production systems and merge code changes into the master source code branch.
- Step 4: Monitor, Rinse, Repeat, Profit!
This is an oversimplification of the deploy process, but at its core is what we do. This process becomes complex when you scale up your company fast (I’m talking 1000+ engineers in two years or less). At the same time you want everyone to work together seamlessly. You want everyone to leverage technological economies of scale so that every software engineer isn’t re-implementing logging, databases, caching, and similar systems for every sub-group.
One of the ways that Wayfair has achieved its speed is by expanding a monolithic application space where the above systems are already built. You then arrive in a place where your software teams are ready to deploy a new feature every five minutes, or even less. On an average day, Wayfair deploys over 200 changes to its monolith, but on peak days (prepping for holiday and the like) we deploy over 500 unique changes. Even running deploys for about 18 hours daily, due to both our Berlin and Boston offices’ expanding working hours, that is a new deploy request every 2.5 minutes. So what type of system do you build to support that volume of deploys, to a monolithic architecture from currently hundreds, soon to be thousands, of software engineers? And most importantly, how do you make it fast?
With the frequency and volume of changes we’re making to our monolith, it quickly became clear that we would be deploying batches of changes. You can do the math; even if every single deployment takes only 10 minutes total, we would still need to deploy in batches. To add to this, deploy requests don’t come in a balanced, even stream. This means we need a way to organize, coordinate, and orchestrate batch deploys.
Integrator is Wayfair’s internal deploy coordination system that allows us to organize this activity. Before I explain the current state of Integrator, I want to dive into how we arrived at building this tool. Because again, fierce pragmatism.
Initially we organized batches of unique software changes with source code branches. We would create shared branches (or in our case 5-6 different daily branches), and have engineers merge their features onto shared branches when they were ready to deploy. Our WOC then deployed each branch, reverting any individual change if issues arose. This process worked for a time, until it didn’t.
With the velocity of software engineers increasing we encountered a “race to merge” point. This is the point with Git where individual engineers were “racing” to merge feature A and failing, because someone else had merged feature B in the span of time that feature A was being merged into the batch. This is a known scaling limit of Git that you can read more about from Microsoft’s work with Git here. It’s also worth noting that we follow a Git branching strategy at Wayfair where code only makes it into the master branch once it is successfully deployed to production.
Evolutions of Integrator
It was at this point we decided it was time to build a software system that would orchestrate deploys and merge requests. We built a deploy queue, where individual feature developers would enqueue their changes to join the next deploy batch. When ready, the Integrator would then merge changes onto a release branch and process that branch through deployment. At the same time we gave our WOC a reasonable degree of control to monitor, and potentially reject, breaking changes from a batch. We also retained abilities such as fast-rollbacks and hyper-urgent workflow interrupting hot-fix deploys.
You might think that all of these options are available in standardized, off-the-shelf open source software or hosted CI/CD solution systems. But the reality is that Wayfair is breaking ground building a hybrid cloud infrastructure model, and embraces a build-it vs. buy-it mindset. At the end of the day, the core concepts of CD are not complex enough that we felt it would be worth outsourcing the orchestration of our system. In particular, we want to retain the capability to optimize the user experience of our software developers.
Our current fourth (and we believe long-term) version of Integrator is completely separate architecturally from the monolithic application that it deploys. One of the shortfalls of building some of the earlier initial versions of Integrator into the monolith is that a poor deploy could take down the tool needed to recover from that very same poor deploy. At the time, the upside was that we could get earlier versions of the tool launched in days and weeks, but once we validated that the strategy and concept had traction, it was important to rebuild on strong architectural foundations.
Another aspect of the iterative growth of our system is the fact that we’ve had the good fortune to keep some extremely talented engineers who’ve grown right alongside the systems we’ve evolved. In particular for Integrator, we have one engineer, Andrew Wilson, who’s worked through all four versions of the tool to-date, and has risen to take on the role of Tech Lead for our Coordinated Deploy (Integrator) Team.
I have the good fortune to work closely with Andrew as his manager, and I interviewed him to hear his inside perspective on the version of Integrator we have today.
Andrew Wilson: On Integrator
Andrew has worked on Wayfair’s deployment software since September of 2016, from the launch of the first version of Integrator in 2017 to our most recent version released in late 2018. Continue reading for some highlights of our conversation:
DT: What has been the most recuringly difficult technical aspect of Integrator? Andrew: Integrator isn’t really that technically crazy. At its core it is managing Git operations and kicking off pipelines. The greatest complexity comes from the sheer number of permutations that a batch can go through. Complexity in Integrator comes from managing multiple repos with hundreds of changes across all of those, while also handling the compliance automation for it all.
DT: Release Engineering sits at the intersection of hundreds of developers trying to complete their work through a deploy. How do you handle the pressure of this number of engineers all turning to our team? Andrew: Personally, I try to make decisions that are going to be the most beneficial for everyone, implement things that seem the most fair, as well as try not to make decisions for others as much as possible.
DT: If you had to do it all over again knowing what you know now, what would you do differently? Andrew: I would definitely push away from the chat-ops stuff as much as possible. A web UI was wanted and needed but chat-ops wasn’t as necessary.
DT: What would you say has been the most fun to work on for Integrator? Andrew: The most fun has been writing the business logic for it. That’s been the part that’s kept me enjoying the work – it’s very process-based, and writing the code is trying to capture and automate all of the scenarios that can happen with a batch.
DT: What is your personal opinion of trying to scale deployments for hundreds of developers? Andrew: I think Integrator has worked well. On the surface it doesn’t seem like it should be much more complicated than deploying for a smaller number of people. But when you really start getting down into the edge cases, the complexity really starts growing. In particular, handling deploys that span multiple repos adds another degree of freedom and complexity for the system. We are constantly soliciting feedback from all of our software engineers, and hiring a lot of experienced folks from a broad range of backgrounds. During this time, no one has suggested a seriously different solution for deploying to our monolith than Integrator.
More to Come
Wayfair has rapidly scaled up our engineering organization as our team and codebase has inevitably grown. While doing so, adopting a DevOps mindset and engineering culture has become a top priority across all parts of our Tech organization to allow us to continue to move fast as a company. CI/CD is an integral part of that transformation and I expect to share more soon about how the structure of our software systems have matured, but also how our software teams have improved their own processes to support the rest of our engineers.