Our Self-Service Hybrid Performance Engineering Platform

At Wayfair, we pride ourselves on our speedy response times to issues during heavy web traffic to ensure a smooth customer experience. As we scale and continue adding products to our catalog, it is imperative that our backend infrastructure is able to handle increased traffic, especially during retail peaks and promotional sales.

As a performance engineer focused on our underlying infrastructure capacity, scalability, and reliability, I have realized the importance of executing real-life use cases against our production environment, particularly targeting specific loads to mimic holiday sales. Testing is important, even in production.

While the use of production systems for performance testing may earn the scorn of many performance aficionados, playing out typical user scenarios at expected peak holiday scale via mock drills to identify vulnerabilities is good praxis. The Site Reliability Engineering (SRE) team adheres to stringent service level objectives, even during times of intense customer traffic, so performance testing in production helps us uncover incredibly valuable information about the resiliency of our infrastructure.

How Do We Test at Scale?

In keeping with this rationale, my team designs performance tests that can be used to target and tune customer creation, product search, adding items to cart, and order placement, among others things. As we want to run controlled performance tests in production, we make sure we can produce near identical synthetic traffic from our load generators that hit our Domain Name Server (DNS), Content Delivery Network (CDN), and our load balancers. For this purpose, we spin up multiple virtual load generating instances in the cloud, distributed across geographies, to simulate web traffic originating from different regions.

To truly measure the resiliency of our backend systems, we also need component load tests that stress a specific service in isolation, typically via load testing internal endpoints or by mocking application calls going into the specific component under test. If the previous two approaches aren’t feasible, we can replay actual production logs. By understanding the load handling capabilities and performance of minor and major components in relative isolation, we can plan for any increase in infrastructure capacity.

With more than 2,000 engineers working for Wayfair, the challenge we face is maintaining a sane source code control system (SCCS). When the number of branches and commits into multiple repositories increase, so too does the latency experienced by our engineers. To address this, my team kick-started a performance-driven infrastructure project which focused on ensuring our source control system also scales at the rate of engineers committing code – we’re currently pushing 30,000 commits per quarter. Working with smart engineers from multiple disciplines, I created load test scenarios that imitate a typical engineer’s workflow while using two different SCCSs for comparison, then executed them with permutations of the number of users, type of activity, hit rate, number of backend nodes in the cluster, and their CPU/memory configuration.

Like most e-commerce sites, searching for specific products among the millions of SKUs available is key to converting those search queries into actual orders. At Wayfair, we have deployed SolrCloud for most things that can be searched for by our customers and employees. As we continue to increase the variety of merchandise on our platform, fast and context-based search plays a very crucial role. To test whether the platform is capable of handling more than the expected volume of traffic during peak holiday times, I devised a multi-pronged approach to performance measurements by replaying production SolrCloud logs against the passive cluster in a distributed manner, while throttling the request rate, and dialed it up gradually in steps to mimic 1x through 8x the volume of the previous peak baseline. In conjunction with this, I also hit other search endpoints at desired request rates that didn’t have any coverage in production logs. This approach circumvented the need to simulate synthetic searches that may barely cover the variety and quality of data inputs from real customers.

Performance Engineering as a Service

One of our biggest strengths as an engineering organization is the ability to ship code through our CI mechanism at breakneck speed. With Agile and CI practices, we have all learnt that functional test automation is crucial in the systems development life cycle (SDLC) process, but what about performance-based tests? For Wayfair’s scale, growth rate, and continued success, performance is everything. With an ever-evolving tech stack developed and maintained by many engineering teams, it isn’t practical to be the sole performance torchbearers across multiple backend systems.

After some introspection, I decided to create a self-service performance engineering platform that our entire development community could leverage. This would instill performance engineering discipline across all functions and create some community-based momentum behind performance engineering as a whole.

To roll this out, the Performance Engineering team embedded themselves with several application development teams to harvest accurate test data from databases and design the right tests. They then executed, validated, and analyzed the results before handing it all off to the respective team owners for future execution for regression, on top of baking these tests as part of the overall CI pipeline.

As Wayfair moves towards creating a hybrid model for deploying and maintaining infrastructure on private data centers and the cloud, Performance Engineering as a Service is being made available in the same way. The diagram below captures our hybrid performance testing where the orchestration mechanism controls the execution of load tests, simulating real workflows using a controller, and workers/load generators against our production servers while providing real-time metrics and dashboards. These can be executed from our data centers or in the cloud by spinning up Elastic load generation instances on demand, using state of the art automation built by the sharp folks from our SRE team.

Yet another advantage of treating productions systems like a performance lab is the ability to adjust a single or a set of servers’ weights at the load balancing layer, relative to other servers in the cluster. This means stress testing the selected servers with live web traffic by dialing up the number of requests relative to other servers in the cluster. This technique can be very useful when creating a formal synthetic load test isn’t feasible, when replaying logs wouldn’t be effective, or when you’re short on time.

What’s Next?

Future automation initiatives coming from our Performance Engineering team includes the exciting adoption of open source tool sets using code to fit Wayfair’s hybrid model of infrastructure deployment. Enabling system-level benchmarking and performance assessment prior to installing application software is also on the cards, together with creating the capability to automate this process using machine learning algorithms to predict KPI values for hypothetical load levels. Further down the roadmap, we’re looking into building effective automated performance tests that can report on intricate details at a component and system level, which is upcoming for the next two quarters. We’d be happy to report back on our progress from the Performance Engineering crew!