## Introduction

At Wayfair, we have typically discussed scalability from an infrastructure perspective—for example, when considering how many servers to provision for an upcoming holiday season, our usual response has been to calculate based on customer volume projections from the marketing team. However, while this source of data is no doubt useful in making such decisions, it is not always sufficient. Basing additional server capacity needs off of projected customer volume data assumes that the throughput will continue to scale up linearly at a higher user load and that adding additional hardware will always help with absorbing additional load. But, that is not always the case.

It’s easy to understand the temptation to account for increasing load with more hardware; as hardware has become less and less expensive, simply over provisioning (adding more capacity than we will ever need) to avoid performance bottlenecks seems like an increasingly attractive option. Unfortunately, performance bottlenecks are more likely to arise in the application design than in the hardware of a system, and most software systems behave non-linearly at higher scale. As such, the temptation to simply throw more hardware at a bottleneck should be avoided, as it will not necessarily improve performance^{1}.

However, even if additional hardware *did* always improve performance, the extra/unnecessary capacity would cause a dent in the allocated budget. Often, software teams developing a new service will request more hardware than necessary from our infrastructure team because they have no past data to bank on. However, such unnecessary expenditures can be avoided by adopting a scientific, data-based methodology for calculating necessary capacity, such as the model Wayfair is currently developing.

A typical approach to figuring out necessary capacity is to experiment with load tests at every user load level to determine the peak throughput. The input load level at which the throughput plateaus or begins to decline is identified to be the max throughput/capacity of the system. This approach works quite well against a single server or a relatively small group of servers, where the peak throughput can be achieved reasonably quickly. However, in distributed systems that are designed to be horizontally scalable, achieving the peak is not that straightforward. In such cases, we can utilize a mathematical model known as the Universal Scalability Law (USL) devised by Dr. Neil Gunther, which takes into account the fact that computer systems don’t scale linearly due to *queueing* effects.

In the sections below, I demonstrate the use of this model against a distributed system used at Wayfair to determine peak capacity, and track the effects on capacity while doubling the load with only a few data points obtained via load testing.

## Why systems don't always behave linearly

Linear scalability is ideal and yet systems that actually scale linearly are rare. It’s very useful to understand the reasons for this, because a correct understanding of scalability and the reasons and sources of sublinear scaling are key to building more scalable systems^{2}. A system is perfectly linear if transactions per second observed with 1 node is say, 2000, and 5 nodes produce 10000 transactions per second. Such a system would be 100% efficient. In reality, there is almost always some efficiency loss, which means that systems are not linear. In fact, real systems not only tend to become sub-linear, but actually exhibit retrograde scalability at some point: as you scale up, your systems perform less efficiently (see Figure 1). This nonlinearity is attributed to queueing or contention in computer networks and cross talk among computer systems.

*Fig. 1: Linear vs sub-linear behavior with increasing (hardware) nodes*

## Queueing in distributed systems

So, what is queueing after all and what does it have to do with capacity planning? We experience queues in our everyday life—at grocery checkout counters, airport, traffic lights, sports, music events etc.; even in seemingly trivial situations like these, you can see queueing theory at work. Consider the line of customers at the grocery store checkout lane. The *queue* includes not only the customers **waiting** but also the customer currently **in service**. So, if there are 5 checkout lanes and only 2 customers at the counter, there is still available capacity to handle more customers. But when there are a total of 5 customers with 1 customer being serviced at each counter, then the capacity or utilization is 100%. The 6th customer who arrives at any counter now will have to wait until the current customer exits the counter. During weekends, there are typically many more shoppers at the store and customers’ arrival rate at the checkout counter is variable. The service time at the counter is also variable because it depends on the number of items being purchased and how long the counter agent takes to scan items. The total time a customer spends during the checkout process (also called residence time) is the sum of the time spent waiting in queue and the service time at the counter.

Queueing theory comes down to answering simple questions like the following^{3}:

- How likely is it that things will queue up and wait in line?
- How long will the wait be?
- How busy will the server/person/system servicing the line be?
- How much capacity is needed to meet the expected level of demand?

These questions can be rephrased as “how likely are we to lose business due to long waits or not enough capacity?”

Translating the above grocery store analogy to computer systems, customers can be considered units of work that the system serves. For ex: a web request, DB query. Cashiers at the counter are servers that process the units of work. For ex: a web server or DB server. Queues are where the units of work wait if server(s) are busy and cannot process work as they arrive. Residence time in a queueing system is response time or latency.

Queueing causes contention for resources, thus reducing the available capacity—the USL attempts to capture its magnitude via a mathematical model. More regarding why systems often don’t scale linearly and why they sometimes show retrograde scalability is explained by the USL in the next section.

## Universal Scalability Law

*Fig. 2: Effects of contention and coherency on linearity*

Dr. Neil Gunther defines scalability through the universal scalability law (USL)^{4}. He provides 3 reasons for this observed non-linearity in the form of three C’s—**Concurrency**, **Contention**, and** Coherency**. Concurrency is the number of customers/processes/units of work that arrive at the server in parallel, as illustrated by the slope associated with linear-rising scalability showing in Figure 2, chart A. Contention is essentially the queueing up of requests per the theory above. In distributed systems that are highly available and follow the eventually consistent model, there is constant cross talk among servers in order to ensure that the data is consistent.

These three C’s are captured in the form of a parametric model based on rational functions with three parameters. This model makes the statistical fitting process universal and also simplifies the overall procedure. The example use case discussed below has been modeled with three coefficients.

### The 3-parameter USL model

At any given load N, the scalability model as defined by the USL is

In terms of absolute throughput, X(N) = N X(1), i.e. the overall throughput X(N) increases in direct proportion to N, the user load. The single user throughput X(1) doesn’t change and acts as the constant of proportionality, denoted by γ. So, X(N) = N γ. The new parameter γ is a constant of proportionality that represents the slope of the line associated with ideal parallel scaling^{5}.

The second term in the denominator represents the degree of serialization of shared writable data and is parameterized by the constant α. The third term in the denominator represents the penalty incurred for maintaining consistency of shared writable data and is parameterized by the constant β. When α = 0 and β = 0, the system is linearly scalable.

The above equation holds true for both hardware and software scalability with the independent variable N representing the number of users on a fixed hardware configuration for software scalability and for hardware scalability; if p represents the number of physical processors and N represents the number of physical processes executing per processor, then the ratio N/p remains fixed.

The USL model works for multi-tier architectures used in e-commerce since performance metrics like throughput and response times are measured at the system level as aggregate transactions as opposed to individual subtransactions^{6}. The key assumption for the use of USL for multi-tier architectures is homogeneity with the load test platform, i.e. the test executes a homogeneous workload and maintains a homogeneous configuration of fixed hardware components as the user load N is varied.

### Example

To demonstrate the effectiveness of the USL model, I utilized the SolrCloud platform used at Wayfair to power our customer focused product keyword searches in addition to various internal applications. I obtained performance test data for user load vs throughput from a test SolrCloud cluster, evaluated for software scalability (fixed # hardware nodes in a cluster), and plugged them into the USL. The load testing efforts preceded the application of the USL model; the charts below graph both the modeled and measured data and their closeness is clearly visible. Note that all data presented here is for illustrative purposes only.

The open source software package R has the ‘nls’ (Nonlinear least squares) built-in solver which can be used to determine the nonlinear (weighted) least-squares estimates of the parameters of a nonlinear model^{7}. The Nmax found with R is 602 with the projected throughput at 282. The R model projects the throughput if we were to double the user load and in this case it resembles retrograde scalability, i.e. the projected throughput at a load of 1280 users would be lower than that observed for 320 users.

*Fig. 3: 3-parameter USL model*

### Analysis

From Figure 3. above, we can see that the R model has calculated some values, displayed in the legend. Let’s understand their significance. The coefficients observed are α = 0, β > 0, representing an *incoherent-only* system. Such a system has no significant contention in the workload (linearly scalable), but the nodes in the cluster exhibit cross talk/data exchange coherency contributing to retrograde scalability. This means that as we scale horizontally by adding more nodes, we will see a faster drop in the peak throughput. Examples of systems that depict this behavior are OLAP, scientific computations, data mining, decision support software etc. OLAP is usually concerned with complex aggregate queries across large data sets such as querying all customers’ bank accounts, querying product via keyword searches (our case study here), recommendations etc. OLAP databases are usually populated via batch query—all data is inserted at the same time. The data that’s inserted into an OLAP database usually originates from an OLTP (ex: SQL Server) application. A batch query is used to scan the source system and import the data into the OLAP database^{8}. SolrCloud may therefore be considered an OLAP system as it facilitates business intelligence that allows users to compute complex statistical aggregations over result sets.

It is important to realize that the max measured throughput of 295 was 6 times the projected traffic (requests/sec) and adding one more node to this SolrCloud cluster of 90+ nodes may have only marginally improved linear scalability, if at all. Doing so would also have increased the effects of cross talk among nodes, thus reducing the throughput.

Further diagnosis of the drop in throughput at higher load level (8 times) revealed opportunities to make some TCP/IP kernel parameter optimizations which should have helped with leveling the throughput. Also, stopping/reducing the frequency of multiple jobs that populated SolrCloud indices with OLTP data via batch queries were identified as ways to mitigate the effects of the extra load on the system. However, these changes were deemed not critical given that the system was capable of handling expected customer traffic with some buffer capacity. The overall testing effort provided the necessary confidence for software and infrastructure teams ahead of the holiday season.

Though the USL points to retrograde scalability at higher load, per the above analysis, that is the expected behavior of SolrCloud clusters. However, this may not be true for other systems and may actually indicate an application design flaw which should warrant further investigation.

## Conclusion

From the analysis above, it should be evident that we cannot always solve scalability problems by merely adding more hardware. While the marginal cost of adding more hardware may be low in terms of actual dollar amounts, the benefits are almost entirely wiped out in our example scenario.

If engineers do not implement a more precise, scientific way of approaching capacity planning, they will reach a point at which they will be throwing more money at the problem but will see diminishing returns due to the coherency penalty (β > 0). While the USL is not to be taken as gospel, it does give us a model to quickly analyze the system’s capacity trend. Capacity and scalability problems should be solved through a combination of hardware and software optimizations, not any one alone. With just a few data points obtained via load testing or from production systems, the USL can provide an estimate of what the optimum capacity needs to be at higher load levels. Additional load tests may be executed with the load input doubled as suggested by the USL to validate the model. The USL is still in its nascency at Wayfair as teams are just beginning to implement it for their capacity planning and measure its effectiveness. Stay tuned for further updates on the subject!

## References

- Gunther, N. J. (2007). Guerrilla Capacity Planning: A Tactical Approach to Planning for Highly Scalable Applications and Services. New York, NY: Springer
- Schwartz, B. (2015). Practical Scalability Analysis With The Universal Scalability Law. Retrieved from https://cdn2.hubspot.net/hubfs/498921/eBooks/scalability_new.pdf?t=1449863329030
- Schwartz, B. (2015). Everything You Need To Know About Queueing Theory. Retrieved from http://cdn2.hubspot.net/hubfs/498921/eBooks/queueing-theory_1-1.pdf
- How to Quantify Scalability - Performance Dynamics. (n.d.). Retrieved from http://www.perfdynamics.com/Manifesto/USLscalability.html
- USL Scalability Modeling with Three Parameters- The Pith of Performance. (n.d.). Retrieved from http://perfdynamics.blogspot.com/2018/05/usl-scalability-modeling-with-three.html
- Gunther, N. J. (2007). Guerrilla Capacity Planning: A Tactical Approach to Planning for Highly Scalable Applications and Services. New York, NY: Springer
- nls. (n.d.). Retrieved from https://stat.ethz.ch/R-manual/R-devel/library/stats/html/nls.html
- What is OLAP? | Database.Guide. (2017). Retrieved from https://database.guide/what-is-olap/