Here’s a familiar experience: You’re trying to improve retention so you run a series of experiments. You end up releasing the same control experience to several cohorts, with dramatically different results each time. Your sample size was large, your source of users hasn’t changed, and the tests were close enough together that there shouldn’t be any seasonality effects. What’s going on?
It turns out that there’s a nuance in retention calculations that trips a lot of people up. Let’s call it “Bad Bucketing”, and even some analytics companies are getting it wrong.
Wait, isn’t retention just a standard calculation?
While most metrics have a straightforward intuitive explanation, if you’ve ever rolled-your-own and done the actual calculations, you’ll quickly realize that calculating even basic metrics requires you to make numerous decisions.
(For example, for retention: Are we looking at all events, or only session-start events? Or for conversion: Are we calculating it as a percentage of our active users? Or only the ones who opened the app? Or only the ones that viewed the sales page?)
Frequently the right answer to these decisions is obvious. And sometimes the answer doesn’t really matter that much. But sometimes, the right answer is non-obvious and also REALLY matters. Calculating retention is one of those times.
Calculating retention
As a concept, retention is pretty intuitive. It answers the question of “Do people like my app enough to keep coming back to it?” Retention measures the percentage of users that come back to an app on a specified time scale: usually daily, weekly, or monthly. 1-day retention is frequently described as “What percentage of today’s users come back tomorrow?”, and 1-month retention as “What percentage of this month’s users come back next month?”
Retention is one of the most fundamental product metrics. It’s a proxy for product market fit, user lifetimes, and everything that is good. It is arguably the most critical metric to track for any product, but the most common and intuitive way of calculating retention has serious flaws, regardless of sample size.
The Bucket Blunder
The most intuitive way of calculating retention considers each day as a separate bucket. Count the number of new users in today’s bucket – that’s the cohort for today. Then calculate what percent return tomorrow. That percentage is your 1-day retention. This calculation is simple, intuitive… and wrong.
Treating all the users in your bucket the same glosses over the fact that users who show up earlier in the day have to stay engaged for a longer period of time in order to count as retained, as compared to users who show up late in the day.
A more reliable way to calculate retention is to consider each user’s install time individually. A user counts as retained for one day if they show up between 24 and 48 hours after their initial install. In other words, instead of asking “What percentage of today’s users come back tomorrow?”, ask “What percent of people who install today come back 24 hours later?”
How serious is this problem, really?
I’ve seen retention measurements literally double because the UA bursts happened to hit just right. This results in false celebration now, followed by a wild goose chase when the next test inevitably comes back far lower.
Consider two users: Early Ellie installs at 12:01 am on June 1, and Late Larry installs at 11:59 pm on June 1. Ellie has to engage 24 hours after install to count as retained for 1 day. Larry only has to return 2 minutes later. As a result, installs later in the day will show much higher retention numbers.
The size of this effect further depends on how you’ve defined “being active”. Does a user count as being active on June 2nd if we see any activity from him? Or are we only looking at session start events? If taking any action inside our app qualifies Larry as an active user (a reasonable assumption), then if his first session lasts 2 minutes, from 11:59pm to 12:01am, then our system will say he’s been retained for 1 day.
How analytics companies are calculating retention
If you don’t feel up for the challenge of rolling your own analytics, one of the benefits of an off-the-shelf solution should be that you don’t have to worry about any of this. Unfortunately, that’s not the case, because all the top analytics providers calculate retention differently.
Consider how incredible it is that after at least a decade of retention being widely recognized as the single most important product metric, there’s still no standardized way to calculate it and each of the top analytics-as-a-service providers is just using their own judgment.
Mixpanel calculates retention correctly. Hooray Mixpanel!
Flurry calls retention “return rate” and does it wrong but it’s still an improvement over their awful retention calculation.
Amplitude changed their calculation recently and now calculates retention correctly for dates after August 18, 2015. They do round to the nearest hour (I’m not sure why, since we have these things called computers that are really good at dealing with clunky numbers) but that’s probably close enough.
Heap is unclear. Their description of daily retention looks correct, but their description of weekly retention looks incorrect. I’ve emailed them for clarification. (EDIT 7/13/2018: Heap was very helpful and it sounds like they’re calculating retention correctly, using the same methodology as Mixpanel. Hooray Heap!)
How concerned should I personally be?
This is a particularly serious problem if you tend to burst user acquisition when running your experiments.
If the UA faucet gets turned on early in the morning for experiment 1, but late in the day for experiment 2, v1’s test will be full of Early Ellies, and v2’s test will be full of Late Larrys. The product changes won’t even matter; v2’s retention metrics will dominate v1’s.
Turning on UA at the same time of day for each test doesn’t solve the problem either, because the time required for ad networks to ramp up the volume on your campaign varies from day to day and week to week.
This happens on longer timescales, too. Does your August cohort have great monthly retention? Maybe that’s because all your August users installed during back-to-school in the last week of August, so they only had to stick around for a few days to count as retained in September.
Rolling retention and you
I’ve been referring to “What percent of people who install today come back 24 hours later?” as “rolling retention”, because of the rolling 24-hour buckets that are specific to each user. In rolling retention, any user counts as retained if she returns between 24 and 48 hours after her initial install, no matter what time of day she installed.
(Ideally, we would just call it “retention”, but until everyone starts calculating retention the same way, I guess we’re stuck qualifying the name somehow.)
“What % of people who install today return tomorrow?” is an intuitive question, but gives unreliable results. Instead ask, “What % of people who install today come back 24h later?” On the surface the questions are the same, but the latter gives much more trustworthy results.
If you start calculating retention this way, be aware that there will be some weirdness around the end of your retention curves.
You’ll now need to wait 48 hours to get your day1 retention, to give the Late Larrys a full 24h to return. And while you’re waiting for to see if Larry returns for his day1 retention, there could be an Ellie from the same cohort who’s already come back for day2.
It’s a pretty minor nuisance, though, and well worth it to have retention metrics that you can actually rely on. Have you run into something similar? If so, I’d love to hear about it.