Wayfair’s Scaling of Infrastructure Management Processes (IMP)

At Wayfair, we operate at a massive scale, which has led to tremendous growth over the last decade. During this time, we have increased our traffic more than twenty-fold. To keep pace and fulfil our technology platform needs, scaling has been vital. In fact, it’s been at the core of everything we do throughout this timeframe.

For me personally, scaling is a huge passion and as the co-lead of Wayfair’s Platform as a Service (PaaS) initiative, I’m going to share my insights on how scaling concepts apply to technology and the processes around it.

Typically when a technologist mentions “scale,” the first things that come to mind are systems, software and hardware design. The common thread connecting these elements is that they help your applications serve customers efficiently and make their experience world-class.

At this point, it’s important to note that your technical platform is a product of both engineering work and the processes that enabled it.

In my previous blog, I focused on Wayfair’s journey to the Cloud, providing an overview of the business processes and technical design decisions that needed to be made to ensure its success. In this blog, I will shift my focus to Infrastructure management processes (IMP). These play an important role in developer enablement by helping our software and system teams increase their velocity and improve system stability.

What is an Infrastructure Platform?

Before diving into the IMPs, it’s important to first understand that the infrastructure platform is a system that provides computing resources and related components. Some of the common platform components for private and public clouds include:

Networking, storage
OS and its configuration
Support software components providing:
- Traffic management (routing, distribution, quality of service)
- Compute resource execution (process managers, virtual machines, container schedulers)
- High availability
- Secrets distribution
- Telemetry: metrics, logging and tracing
- Identity and access management

Some examples of the basic platform that the code runs on are:

A server connected to network which includes some storage that is attached
Linux OS installed and its parameters tweaked
Java / Python / PHP packages and process managers
SSH is configured to accept connections and provide shell access to specific users
Encrypted secrets are placed in a directory for applications to use
MySQL service installed and configured on the same server

What are Infrastructure Management Processes (IMP)?

The topic of IMP is rather broad and for the purpose of this article, I’ll focus on defining, provisioning and maintaining a platform for applications to run on or interact with. Here is an example of a simple process where a developer requests a system to run their code:

A verbal request is made to provision a Linux application stack. It can also come in the form of a ticket from the operations team or a person assigned the task.
The operations team provisions the server, performs manual configuration of the support services and copies encrypted secrets to a directory.
A record of these steps is added for auditing and inventory management purposes.

While this example helps illustrate the process, it does not necessarily meet the requirements of your environment or current best practices.

The challenges of IMP scaling

As is typical in most industries, one size does not fit all, especially when it comes to processes. While some may disagree, the fact is that a process which works for one company does not always work for another. What are the differentiating factors here?

First let’s review the main goals of standardized and optimized processes. These include:

Improving quality and consistency
Increasing productivity
Reducing workflow time to achieve better business outcomes

Before getting started, there are a few things to keep in mind:

Every process has a cost and an implementation time and must be weighed against potential benefits.
Just because a process is simple or more manual doesn’t mean it isn’t a fit for your environment. The same is true for modernized, highly automated processes that require significant investments of engineering resources to build and maintain.
Companies with a low volume of engineering work and few employees often do not require fully automated deploy, provisioning and management pipelines. Conversely, companies with large engineering teams may struggle to operate with processes that lack sufficient automation.
Leveraging a combination of existing open-source or commercial “cookie-cutter” processes can help simplify a deployment by removing some manual aspects and forcing it to standardize on restricted, predefined flows.
When it comes to larger “technology first” companies, it’s best to integrate a mix of products into their own flows. The reason why is that attempts to adjust flows to a specific product are rarely successful.
In the end, the biggest challenge is striking the right balance between simplicity and underlying complexity while also keeping pace with the growth of the business.

Evolution of Wayfair’s IMP

With Wayfair’s rapid revenue growth and the continued expansion of our technology department, we have evolved significantly. What was once a small “web operations” team with a dozen engineers has become an “Infrastructure and Platform Engineering” group comprised of a few hundred engineers who are partnered with a few thousand Software Developers!

At the same time, our system designs have continued maturing. This has allowed us to deliver our customers the best experience. Over this period, our processes have gone through multiple transformations, which reflect our growth and changing business requirements.

We can break these transformations into unique the stages each which its own specific focus:

Early Formation

Dozens of technology employees, a few servers and configurations, and a low rate of change.
Mostly manual or semi-manual (scripted) configuration, provisioning and deploy processes.
Verbal or ticket based change requests and manual change logging.

Continuous development and standardization

Hundreds of technology employees, hundreds of servers, dozens of configurations, and a moderate to high rate of change.
Automated and standardized configuration change pipelines, semi-manual provisioning, improved deploy processes (hundreds of deployments/day, deploy bundles for monolithic apps).
Ticket or Form based change requests, automatic change logging, and development of request processing SLAs.

Rapid scaling and optimization (current)

Thousands of technology employees, thousands of servers, hundreds of configurations, and a high rate of change.
Automated and standardized configuration change pipelines, automated provisioning by infrastructure teams, direct Infrastructure as a Code (IaC) contributions, GitOps, and developing of self-service options via “The Easy Button” (PaaS).
Tickets/Forms change requests and self-service options made available via UI/API or IaC contributions.

As you can see, the higher rate of changes, the number of managed configurations and employee growth drove the need for process improvements and optimizations. This is a natural progression and it reflects the increasing cost of operations, the value of reduced turnaround time and decline in the number of defects introduced. Add it all up and it signifies the importance of investing in the development of more modern and complex IMPs.

Current State and Challenges

At Wayfair, our work is never done. Teams are always exploring the changing landscape of Infrastructure Operations. At the same time they are always on the lookout for new tools that can make our flows more streamlined and cost efficient.

This change is happening outside of Wayfair as well. For example, the industry has been hard at work providing new ways to manage and audit infrastructure. Take IaC and GitOps. Both are promising better integration into common software development flows that include version control and CI/CD pipelines.

GitOps for Infra Provisioning

When we started shifting to Infrastructure as Code and implementing GitOps, we liked the declarative nature of frameworks such Terraform and the ability to more effectively observe our desired state. When our team was building 95% of our cloud deployments, we followed these principles which helped us to a certain extent. That being said, along the way we also discovered some of the limitations of this approach.

Here are three examples:

Problem #1: Having the “state” of your Infrastructure in three different places

While GitOps promotes centralizing the state and history of your infrastructure in Git repositories it only covers the intent (declaration) of what it should be, not what it actually is. The more complete representation of a component state, as seen by the API or framework structs, is stored somewhere else.

The Infra components can be rather complex and consist of more properties than your intended configuration describes. They are also prone to configuration drift which can be caused by API or DataLayer (DL) changes. It can also result from changes made outside of your provisioning pipelines or the race conditions inside them.

Credential management also happens outside of Git due to the nature of the process and more sophisticated access control and audit requirements.
When doing a comprehensive change audit, you need to look at all sources and not just Git history. This can lead to more customizations and in-house development.

Problem #2: Good for Engineers at Small to Medium Scale. Tough for Automation or Large Scale

Depending on the provisioning framework that is being used and the number of deployed systems, your flows may not scale as necessary. For instance, when managing Google Compute Instance Groups, the amount of code that is required to provision 100 systems will be minimal. That’s because you can pick a profile and write one block of code with active cloud components taking care of the rest. It works and engineers can use it frictionlessly.

However, things become more complicated in cases where components are unique and cannot be easily grouped together. When this is the case, you need to write a lot of code to describe each entity individually. This is a significant issue for engineers because it may require significant manual data editing as well as copy pasting. This can inevitably lead to mistakes and hard to spot problems. At this point, many decide to templatize and use code generators.

Using code generators to instantiate your deployments works to a certain degree but at some point it starts presenting another set of problems. This includes decreased visibility, impacted performance for concurrent Git modifications, automation complexity due to data parsing, as well as challenges implementing CRUD functions.

Basically, you’re fighting with loosely structured data, dozens of different schemas and text file processing. When looking at it holistically, it becomes clear that we keep implementing workarounds for something that it’s naturally better served by some more advanced backends such as SQL/NoSQL databases.

Problem #3: Every additional framework has a cost

Using IaC and GitOps helps to remove some of the barriers of Infra and Software collaboration. For example, having components in version control allows you to extend the access to your software partners and improve visibility. The core benefit of this self-service approach is that now everybody can do it and there is no need to file a ticket. This is definitely a step in the right direction.

While your Infrastructure now is more open, the barrier of entry is still relatively high because any engineer wanting to make a change needs to learn the frameworks that are being used and adhere to established standards, pass code reviews etc. From our experience we found that this works best for teams that want more flexibility and are willing to commit time and effort to master established flows. For others, unfortunately, this is not a great option.

Platform as a Service (PaaS), AKA “The Easy Button”

In 2020 alone, our team opened/reviewed and merged more than 9,000 Git Pull requests either to provision or modify/delete the components. Realizing the shortcoming of IaC/GitOps approaches at scale, we needed to supplement these options with a more simplified way of provisioning. To make it intuitive and eliminate the requirement to learn new frameworks while reducing complexity and time to deliver, we decided to create our own abstractions.

A natural question at this point would be, “Why not use Cloud UI/Cli provisioners that are already available?” While it’s true that vendor provided API wrappers, such as UI and CLI tools, are available, they rarely provide comprehensive coverage of what it takes to make such components usable in your environment in one step.

For instance, when provisioning a Google Cloud Storage (GCS) Bucket, users are usually required to apply a pre-approved set of parameters and policies. These elements prevent accidental data disclosure while granting access to your application identities and distributing credentials to your servers or k8s namespaces.

But here’s the catch. This process requires a user to go through multiple steps of configuring components and to then follow up with some other established flow for credentials creation and distribution. While this is certainly possible, it increases the chances that things are not done correctly. It can also lead to the need to create flow-specific wrappers.

With many components facing the same type of challenges, we chose another direction—we decided to create our own internal platform consisting of UI/Internal API layers and functional providers.

This platform provides additional organization and visualization options for existing apps, For example it takes into consideration user and API documentation, monitoring data and ownership information.

Next, to minimize the cost of in-house development, we tasked ourselves with reusing existing IaC frameworks and other open source components. This allowed us to maintain community or vendor support, utilize the existing knowledge our internal teams have and solve some of the GitOps problems I talked about earlier.

Specific examples include Backstage from CNCF which helped us build our application catalog and Terraform Enterprise that allowed us to use standard Terraform modules. It also provides full API coverage and the built-in ability to use IaC without instantiating every single component within your Git repository.

At this point we started integrating most typical operational tasks into it to further boost our self-service capabilities. One of the recent additions is k8s pod restarts my colleague Shai Sachs described in his great Keep Calm and Delete Your Pods post.

While we are only at the beginning of our Infra “Platformization journey,” our team has already received many positive signals from users who have indicated that we’re on the right track. As new functionality comes online and the user experience improves, we are seeing a consistent uptrend in both self-service workflow runs and active users engaging with the Platform.

There has also been tremendous adoption—hundreds of engineers are using the platform on a weekly basis, running thousands of workflows per month.

There are still many challenges and opportunities ahead of us but teams are approaching them with great excitement. After all, striving for continuous improvement is a part of Wayfair’s DNA!

Want to work at Wayfair? Head over to our Careers page to see our open roles. Be sure to follow us on Twitter, Instagram, Facebook and LinkedIn to get a look at life at Wayfair and see what it’s like to be part of our growing team.