Accurate performance monitoring is crucial for Wayfair’s Storefront Engineering team. Each day we deploy hundreds of code changes to the web application for our customer-facing websites, and each change has the potential to impact performance for better or worse. For this reason, we carefully monitor KPIs such as page load times to catch regressions, identify opportunities for speedups, and to verify that improvements work in the real world.
Client-side monitoring of page load times in the browser is known as Real User Monitoring, or RUM. In 2018 we gave RUM at Wayfair a major upgrade and migrated from our legacy Graphite backend to next generation InfluxDB. This article will take you through our journey and the challenges we overcame on the way.
What is RUM?
Web performance metrics can be classified into three broad groups: Server-side, synthetic tests, and RUM. We start with server-side timers which are straightforward to build, but blind to entire classes of issues that happen on the client. Synthetic tests give some insight into client performance, but they are limited by a predefined set of URLs and an artificial test environment.
Monitoring server-side load times is generally straightforward. Put a timer around the code that you want to track, send the value to a database, then graph it with a simple moving average.
Client-side load times tracked by RUM are influenced by server-side load times, but they’re also heavily affected by client and network conditions such as latency and bandwidth.
RUM at Wayfair
Wayfair’s RUM system was first created in 2013 using a Graphite time series database for real-time charts and alerts. Since then, we’ve tweaked the system by adding improved validation, a Speed Index heuristic, and the ability to breakout metrics by browser (Chrome vs. Safari vs. Edge vs. Firefox), but the core has remained largely unchanged until this year.
Over the years, our RUM dashboards have been incredibly useful at identifying issues, but there are quirks that make it difficult to interpret the data.
RUM is particularly challenging to measure and visualize due to the wide range of values and outliers. For example, a page with a median load time of 5 seconds will commonly see some page loads of 200+ seconds. Just a few outliers can drastically skew the mean, and the legacy Graphite backend was particularly susceptible to this problem. Graphite estimates and samples in a few places during data ingestion. In some common cases those errors compound, resulting in misleading graphs that waste time. The new InfluxDB backend doesn’t suffer from that problem, since it uses a true median calculation, one of many available operators, aggregators and selectors.
Daily Patterns: RUM metrics exhibit a daily pattern, where the value will rise and fall periodically due to natural traffic cycles. When traffic is low in the early morning (2-6am EDT for the US and Canada stores), outliers have a bigger influence, thus graphs appear more “spikey”. Traffic makeup also shifts throughout the day. For example, mweb customers in late evenings are more likely to be on fast WiFi when compared to traffic during “commute times” where slower 3G/4G networks are more common.
High-Cardinality: We currently record 495 distinct values for the
script_name tag which indicates which page is measured. Combine that with tags for
client_type (desktop/mweb/tablet), store, browser, and datacenter, and you get a large number of unique series, or high-cardinality. Query performance scales with the number of series, so in our original implementation the dashboards were slow. We worked with engineers at InfluxData to find a solution: Removing an unnecessary
host tag reduced cardinality by 600x.
Scale: On a typical day, we’ll see more than 20 million RUM measurements across eight stores, hundreds of page types, and thousands of device types from phones and tablets to laptops and PCs.
We used a few strategies to work around these challenges and the limitations of the Graphite backend. In particular:
- Using Graphite’s
movingMedianfunction with a large window (100 data points) helped tame outliers. This has a few downsides: Slower Graphite queries, difficult to do real-time alerts, dashboard response is ~15 minutes delayed, unintuitive metric behavior for low traffic pages.
- Manually dropping outliers with
removeAboveValue. The downside here is that we may miss legitimate slowdowns and it’s tedious to manually set the outlier threshold for each page.
We recently introduced InfluxDB as our first-class time series database system, where we had the opportunity to work directly with InfluxData to ensure we were on a path that is scalable, robust, and in line with the future direction of their platform. As Jim Hagan of the Logging and Time Series team has previously stated, InfluxDB is architected in such a way that allows us to balance horizontal and vertical scaling approaches for both compute and storage. Our in-house time series technology review identified some key attributes that offered substantive advantages over the status quo. You can read a more in-depth article on our Tech Blog here.
InfluxDB Schema for RUM
Schemas are the biggest change for application developers who are familiar with Graphite. InfluxDB supports defined fields and indexed tags for each measurement, while Graphite uses a plain, unstructured dot-delimited string to identify each metric.
An example templated Graphite metric:
An example query:
The simple structure used by Graphite makes it easy to get started, but complexity grows quickly as you add facets. For example, in order to break out by browser, the same measure was stored multiple times, but it was limited to only certain pages to keep disk usage reasonable on Graphite hosts. Dashboards have to be created with these limitations in mind, and new views would often require deploying new metrics to duplicate measures.
An example templated Graphite metric with browser:
In contrast, designing your InfluxDB schema takes a little more time upfront, but it gives you more power and flexibility. Dashboards are generic and easy to customize. Introspection queries let you explore data in a natural way to find interesting cross-cuts and correlations.
Similar to a table in a SQL database, the schema has a field for each value that we measure on a page load:
Tags (similar to indexed columns in SQL) are used to filter and slice:
Optimizing the Schema
When we initially deployed the InfluxDB schema and started building dashboards, we ran into a problem: Slow queries and timeouts. This made it difficult to explore the data and impossible to view a time range of more than a couple of hours. Due to the daily patterns, the inability to chart the last 24 hours was a deal breaker, so we had to find a solution.
In May, engineers from InfluxData visited the Wayfair offices in Boston to put on a workshop for our application developers. This helped our team quickly get up to speed on the InfluxDB data model and best practices. We learned that query performance scales with series cardinality, and were able to diagnose the issue with our RUM schema.
We had inadvertently added a
host tag to the schema which recorded the web server that received the RUM metrics from the browser. There’s no value to filter on that tag and, as mentioned previously, removing it reduced cardinality by 600x. After resetting the schema by moving the RUM measurement to a new cluster, query performance was dramatically improved!
At Wayfair, we use Grafana to build dashboards and visualize data from a variety of sources including Graphite and Elasticsearch. Grafana’s extensive InfluxDB support helped make this a smooth transition.
Templated query used for this Grafana dashboard:
SELECT PERCENTILE(/^$metric$/, $percentile) AS p$percentile FROM "rawdata"."rum"
WHERE "dc" = '$dc' AND "script_name" = '$page' AND "store" = '$store' AND "client_type" = '$client_type' AND "browser" <> 'other' AND $timeFilter
GROUP BY time($__interval), "browser" fill(null)
With Graphite RUM, it was confusing and unintuitive to interpret each graph without a certain amount of background expertise. InfluxDB has made these graphs more accessible to a broad audience of engineers and product managers across our organization, which is incredibly important and impactful for us as a whole. On the same token, some filters in Graphite (for example, break out by browser) were limited to certain pages. With InfluxDB, we have visibility of full tags on every page by default.
How do we use this to improve the customer experience? With the Login Status dashboard, for example, it displays when a page is slower for logged in customers, thus we can investigate the cause and fix the bottleneck.
With InfluxDB we are able to represent the full granularity of some of these metrics due to the different storage architecture used; with Graphite we ran up against the issue of pre-allocated storage (which consumes space in direct proportion to the number of series). We had to lower our original metrics granularity or we would have exhausted all storage in our Graphite cluster! We’re also able to hone in on specific areas of consideration using the where clause.
Measuring performance on high-traffic e-commerce web sites is challenging. Migrating to InfluxDB for RUM at Wayfair has meaningfully improved our visibility into the experience of real customers, and created a scalable platform that we’ll build upon for years to come.
The first thing we’ve added is an experimental metric for “Time to Interactive” (TTI). Using the new Long Task browser API we measure the time for the browser to reach idle, so that it’s ready to quickly respond to user input. By monitoring and optimizing TTI, we hope to improve the customer experience by reducing “janky” scrolling and unresponsive clicks. In the future we plan to evaluate additional metrics such as Philip Walton’s recently announced First Input Delay, another measure of site interactivity.
Working on similar challenges in your Engineering department? Let us know in the comments!