The Cost of a Crash

We are all too familiar with the unpleasant experience of a native app crash. You’re happily playing your favorite game, queuing up the next video, researching the next piece of furniture to buy (hint: Wayfair may help!) when all of the sudden, POOF! The app is gone and you’re back to your phone’s home screen. Those who are less technically inclined might not even know that this was an app crash, or worse, they may think that somehow their actions were responsible for the un-magical disappearing act. Android is slightly more helpful: after a crash, the user is presented with a dialog that tells them that unfortunately, their favorite app has stopped. Unfortunate, indeed. But just how unfortunate? This post explores the return on investment for fixing crashes. Keep reading to learn what we learned.

In order to understand the ROI for fixing crashes, I set out to first understand the impact that crashes have. I started with a number of hypotheses, including:

Crashed users generate less revenue
Users who experience a crash are probably more inclined to “give up and try later” rather than reopen the app immediately and keep going
At some point, there are diminishing returns from continuing to optimize crash rates

Wouldn’t it be great to know what that ideal threshold is, after which you should simply stop trying to optimize crash rates? Is it 95% crash-free? 99%? 99.9%? Why not fix them all? Having a formula to compute the return on investment for fixing a crash can help determine if it is more valuable to fix the crash or implement a new feature. For example, two weeks spent fixing crashes might save $1k of lost revenue over the next month, but two weeks spent implementing a new feature might unlock $100k of new revenue in the same time frame.

Analytics Infrastructure

As with any analytical investigation, it is great to start with a clear set of hypotheses and raw data sources that contain the answers. For us, there are two primary data sources: 1) the Google Big Query replica of Firebase Analytics that contains records for each of our crashed sessions; and 2) our click-stream tracking data set which contains a set of all screen views and clicks that users perform in our apps and the amount of revenue ultimately associated with those actions. Access to this click-stream data is restricted and is never shared with third parties. In order for us to derive any real insights between our click-stream logs and Firebase’s crash records, we needed a way to map records between the two. For this purpose, we attached a randomly generated UUID to each of our Firebase sessions which can be directly mapped to a record in our click-stream data set.

Crash Rates Correlated with Engagement

The first question we asked was what the average revenue per user was for those who experienced a crash compared to those who did not. During the test period over which analysis was conducted, crashed users experienced 9% more sessions per user than users who did not experience a crash. But crashed users are not 9% more engaged--when they experience a crash, their session immediately ends and they must start a new session to continue shopping. On the other hand, crashed users viewed 6.5% fewer screens, and users who experienced a crash generated about 7% less revenue than users who had never experienced a crash. Our first hypothesis is confirmed.

As with many other native app engineering shops, Wayfair uses Firebase to monitor crash rates. During our test, our average Firebase Crash-Free User rate was 99.74%. However, this single number does not tell the whole story. Figure 1 shows a trend that tends to be true for most apps: the greater amount of time a user spends in the app, the more likely they are to experience a crash. This is because 1) crashes are inevitable on mobile devices, and 2) the more screens, products, or features a user interacts with, the more likely they are to encounter a scenario that the development shop could not proactively test.

Figure 1.png — Figure 1. Crash frequency based on how long a user spends in the app

To illustrate this scenario, imagine you want to find all of the $20 bills that people randomly drop on the ground around the town where you live. Now imagine that you ask some friends to spend different amounts of time looking for these dropped $20 bills. The friend who spends the least amount of time will be the least likely to find any. The friend who spends the most amount of time searching will be more likely because they are able to traverse more ground. After some amount of time, there will be a diminishing return on searching because the $20 bills dropped in the most frequently-traveled places have already been found. Similarly, if our customers were people who randomly searched our apps for crashes, they too would find more crashes the longer they looked, but it would become more difficult to find more as they continued to search.

As with most consumer apps, the distribution of users who spend a short amount of time in an app greatly outweighs the amount of users who spend a large amount of time in an app. According to statista.com, the average shopping app session duration in 2019 was less than 10 minutes (from https://www.statista.com/statistics/243779/total-minutes-spent-on-average-app-by-category/, on 9/21/2021). The distribution follows an exponentially decreasing curve (as illustrated in Figure 2).

These findings are relevant for all app developers. If your app is likely to be used for longer periods of time, you may need to spend more time optimizing crash rates and the common causes, such as high memory consumption or high CPU usage.

Figure 2.png — Figure 2. The distribution of users and how long they spend in apps

The Moment of Impact

We hypothesized that a crash is an event that has the power to stop a user in their tracks. One shortcoming of our analysis so far is that the measurement of revenue and engagement impact was done over a broad time period that made it difficult to tell what the short-term changes in customer behavior is after they experience a crash.

So far we learned that revenue from our users was 7% lower in the months following their experience of a crash. But what is the short-term impact? Does the user jump right back into the app and attempt to shop some more or do they give up? Next, I plotted screen views relative to the time that users first experience a crash. If users come back to the app, then we should see some screen views after the crash. Figure 3 shows the histogram of these screen views relative to the timing of a crash (where a crash occurs at x=0). As you can see, this measure of user engagement exponentially rises as we approach the time of the crash, then precipitously drops off the moment after crash occurs.

Figure 3.png — Figure 3. Screen views plotted relative to the time of a crash for 1 week prior to the crash and 1 week after the crash.

As Figure 3 shows, there are essentially four phases: 1) The pre-crash steady-state, showing a relatively steady average number of screen views per user; 2) the high-engagement period spanning approximately two days up until the crash; 3) a post-crash re-engagement bump that lasts about 24h, where the engagement equals the steady state; and 4) a post-crash hangover, where engagement wanes.

The pre-crash steady state has an average 21 screen views per user. The period of high-engagement prior to the crash has an average of 30 screen views, a 50% rise over the steady state. Interestingly, the 24 hours following the crash show continued engagement, but at the pre-crash steady state levels, before waning into the post-crash “hangover.” This hangover period is about 40% less engaged than the two days up until the crash, and 14% less engaged than the pre-crash steady state. Since we know that users who experience a crash are 6.5% less engaged over time, presumably this hangover gradually recovers, but not fully.

Impact on Customer Spend

While the chart in Figure 3 depicts the average user behavior over time, it does not show the effect of a crash on revenue. Using a similar method, we can compare the sum of revenue prior to crashes compared to revenue generated after the crash. Comparing revenue from these users over the same time period, the result shows a staggering 89% loss in revenue per user.

What Crash Free User Rate is Right?

The Crash-Free User Rate that is right for our customers is 100%. 0 customers deserve to experience a crash. But as a former Android developer, I will admit that at times crashes can seem incredibly obscure and difficult to replicate when you don’t own the device that a crash was reported on. My advice? Dedicate time in every sprint to fix crashes and focus on the highest value crashes first.

For arguments’ sake, let’s compare two possible projects: 1) a project to fix a crash that can avoid a 7% revenue loss; or 2) a new feature that your product manager promises will generate 7% of revenue. Which of these projects is more valuable? They both ought to generate 7% of revenue for the affected users. Most crashes do not affect all users, though. If the crash affects 50% of users, then the revenue impact would be 0.5 * 7% = 3.5%. On the other hand, rarely do new features that are “promised to bring X% of revenue” actually bring that much revenue. You should also consider the amount of time to work on both projects. The investment in a new feature may take many weeks or months, whereas most crashes take only a couple of days to fix. For this reason, I propose a basic formula that you can use to estimate the return on investment per day spent to fix a crash:

To make this concrete, suppose we wanted to compute the 1-week return on investment from fixing a crash at Wayfair that affects 10% of our customers. Suppose we had 1M daily active users, and 100k users were affected. The ROI would then be

In this case, the return is .0445% of revenue per day of investment. Suppose the new feature promised a 7% return for 100% of users, but took 30 days to invest in.

The work to fix the crash is about 20 times more valuable compared to implementing the feature. If that same feature took only a single day to implement, then it would be more valuable. Finally, you can choose to think about your return over a longer period of time than 1 week. Since the post-crash hangover covers a limited time, if you expand the window for the ROI, likely the return on fixing a crash reduces.

Not All Crashes Are Created Equal

The analysis so far has been agnostic to the type of crash. Surely, a crash on the homepage which renders the app useless would be impactful on both user engagement and revenue. On the other hand, a crash in the checkout screen would likely not impact overall screen views much, but would be very impactful to revenue. For this reason, and because I am a fan of measuring the impact of platform work, one next step for this analysis would be to implement a crash prioritization scheme that is able to measure both the engagement and revenue impact of each crash.

Conclusion

We all know and dislike app crashes. As users, they disrupt our expectations for a smooth and seamless experience. As developers, they can be obscure and difficult to replicate. Nevertheless, my analysis of the business impact of a crash has shown that they can reduce revenue from engaged users by 89%, inducing a period of reduced user engagement that does not fully recover. Even over the long term, these users end up engaging with our app less and spending less. App developers should continue to prioritize app crashes as they are one of the least costly ways to keep their customers happy and engaged.