Excitement buzzed in the virtual air -- after multiple years of work, we were finally going live! As we ramped up our feature toggle, we were met with words no engineer wants to hear: “Your service isn’t performing fast enough. We need you to toggle off.”
Our team quickly pivoted to a week-long all-hands-on-deck performance-obsessed sprint. It was stressful -- we felt like we were throwing darts at a wall -- but in retrospect, it was fun to come together as a team and focus intensely on one metric. It was also productive -- We brought down our p50 from 27ms to 6ms and our p99 below our previously observed p50. Our service now services hundreds of millions of requests per day. And, most importantly, we have lots of learnings to share from our journey. We’re excited to share some of those below.
Who are we? What do we do?
We are Dynamic Services, a Wayfair engineering team that is responsible for providing contextualized product data to Wayfair’s customer facing e-commerce platform. One of our main services is the Dynamic Product Data Service (DPDS). We take in product data requests and information on customer context (who the customer is, where they are on the site, and how they arrived there) and return dynamic product data, contextualized to the context from which the customer is browsing on our site.
How We Initially Thought About Service Performance
Before we went live for the first time, we knew performance was important, since we are serving traffic to some of Wayfair's customer-facing pages. Initially, we measured performance by looking at average request duration for every request made to our service. We also had extremely granular metric logging in place that let us see timing metrics for almost every method down the stack that our service worked through. Unfortunately, when we went live, we realized that we had not measured data in the way that mattered to the consumers of our service.
Performance Issues Exposed During Initial Rollout to Product Display Page
When we went live, a few things became clear:
- Our consumers measured performance in percentiles. Since we had just been looking at averages, we were not meeting their SLAs defined for p50, p95 and p99 metrics. Percentiles are great since they let you immediately know how your service is performing for large buckets of consumers. e.g. if your p99 is great, then you know that 99% of requests to your service are performing great. This is much clearer than using averages, which give no such visibility.
- Our consumers assumed that we would perform the same as the previous way they had been getting product data even though our service provided extra functionality. This assumption led to us having to have these expectation-setting conversations after we already weren’t meeting their expectations -- a much harder conversation to have!
- Our service is built in .NET, but our clients are in PHP. We focused a lot on our .NET performance, which was a mistake because it overlooked our client-side consumer experience.
Triage and Investigation
Our existing performance metrics and logging were not helpful in identifying our bottleneck, so all theories were fair game. We got on a shared Google Doc, brain dumped all of our hypotheses on what would improve our performance, and worked through each theory one by one. After our daily scrum standup, we would stay on for an extra 10 minutes to check in -- which hypotheses were tested, which were proven, which were debunked, which hypotheses should people check next. Especially given our work from home setup, this workflow worked extremely well to keep us focused and organized.
Here are some of the things we did that didn’t work:
- Removed as much extra middleware as we could
- Changed our Aerospike calls from batched to un-batched
- Removed metric logging middleware altogether
- Upgraded Aerospike version
- Looked into nginx proxy buffering
- Explored kestrel sockets
- Looked into HTTP routing
In total, we tested over 20 theories!
While it felt like we were banging our heads against the wall, a couple of the theories checked out:
Reducing the number of logs:
We bumped up the default logging level threshold from DEBUG to WARN and saw a 3ms performance improvement!
Removing DataDog APM:
By removing DataDog APM from our deployment, we saw another 2-3ms performance improvement.
The Smoking Gun
Our dependency resolution layer turned out to be our biggest performance bottleneck. Once we changed our stateless class registration to single instances, we saw a 15-20ms performance improvement. We brought down our p50 from 27ms to 6 ms and our p99 below our previous observed p50 during the initial PDP rollout.
Here’s a look at what our current performance metrics look like:
Here are some key takeaways from our performance frenzy:
- We now think about performance in terms of percentiles rather than averages → we should have asked our customers how they measured performance before trying to go live.
- Since our code is generally very lean, we only have high-level metrics and dashboards. We will only add granular metric logging if we are investigating bottlenecks.
- Thin, stateless services that can be made singletons can be a huge performance win if they are handled correctly.
- Performance sacrifices can be made if the trade-off is worth it
- Ex. we decided to keep DataDog APM and take the 2-3ms performance hit
All in all, we learned a lot. Though that week was a stressful and frustrating one, we reported a lot of learnings at our end-of-sprint retro:
Interested in joining the Wayfair team? Explore our open roles here.