# Calculating Stickiness Using AppInsights Analytics

#### Update:

There is a new, simpler, better way to calculate usage metrics such as stickiness, churn and return rate.

In previous posts I  demonstrated some simple yet nifty tricks to get stuff done in app insights analytics – like extracting data from traces, or joining tables.

Those were mostly pretty simple queries, showing some basic Kusto techniques.

In this post I’m going to show something much more complex, with some advanced concepts.

We’re gonna take it slow, but be warned!

What I wanna do is calculate “Stickiness“. This is a measure of user engagement, or addiction to your app. It’s computed by dividing DAU (daily active users) by MAU (monthly active users) in a rolling 28 day window. It basically shows what percentage of your total user base is using your app daily.

Computing your DAU is pretty simple in analytics, and can be done using a simple dcount aggregation:

``````requests
| where timestamp > ago(60d)
| summarize dcount(user_Id) by bin(timestamp, 1d)
``````

But how do you compute a rolling 28-day window unique count of users? For this we’re gonna need to get familiar with some new Kusto operators:

hll() – hyperloglog – calculates the intermediate results of a dcount.

hll_merge() – used to merge together several hll intermediate results.

dcount_hll() – used to calculate the final dcount from an hll intermediate result.

range() – generates a dynamic array with equal spacing

mvexpand() – expands a list into rows

let – binds names to expressions. I’ve already shown a use for let in a past post.

It’s kind of a lot, but let’s get going and see how we’re gonna use each of these along the way.

Let’s do this in steps. Our goal is to calculate a moving 28 day window MAU. First thing, instead of dcount we’ll use hll, to get the intermediate results:

``````requests
| where timestamp > ago(60d)
| summarize hll(user_Id) by bin(timestamp, 1d)
``````

With the intermediate results in place, the next phase is to think about which dates will use each intermediate result. If we take 20/1/2017 as an example, well, we know that each subsequent day, 28 days forward, will want to use this hll for it’s moving window result. So we build a list of [21/1/2017, 22/1/2017 … 18/2/2017].

So what we do here, and this is a little dirty, is create a list of all the future dates that will need this result. We do this using the range operator:

``````requests
| where timestamp > ago(60d)
| summarize hll(user_Id) by bin(timestamp, 1d)
| extend periodKey = range(bin(timestamp, 1d), timestamp+28d, 1d)
``````

Now let’s turn every item in the periodKey column list, into a row in the table. We’ll do this with mvexpand:

``````requests
| where timestamp > ago(60d)
| summarize hll(user_Id) by bin(timestamp, 1d)
| extend periodKey = range(bin(timestamp, 1d), timestamp+28d, 1d)
| mvexpand periodKey
``````

So now, when sorting by periodKey, each date in that column has exactly 28 rows, each with an hll from a different date it needs to calculate the total dcount. We’re almost done! Let’s calculate the dcount:

``````requests
| where timestamp > ago(60d)
| summarize hll(user_Id) by bin(timestamp, 1d)
| extend periodKey = range(bin(timestamp, 1d), timestamp+28d, 1d)
| mvexpand periodKey
| summarize rollingUsers = dcount_hll(hll_merge(hll_user_Id)) by todatetime(periodKey)
``````

That’s the 28 day rolling MAU right there!

Now let’s make this entire query modular, so we can calculate any length rolling dcount we’d like – including a zero day rolling (DAU actually) – and calculate our metric:

``````let start=ago(60d);
let period=1d;
let RollingDcount = (rolling:timespan)
{
requests
| where timestamp > start
| summarize hll(user_Id) by bin(timestamp, period)
| extend periodKey = range(bin(timestamp, period), timestamp+rolling, period)
| mvexpand periodKey
| summarize rollingUsers = dcount_hll(hll_merge(hll_user_Id)) by todatetime(periodKey)
};
RollingDcount(28d)
| join RollingDcount(0d) on periodKey
| where periodKey < now() and periodKey > start + 28d
| project Stickiness = rollingUsers1 *1.0/rollingUsers, periodKey
| render timechart
``````

STICKINESS ON THE FLY!

# App Analytics: Using “Let”, and a really useful investigation query

So here’s just a small tidbit that can be useful.

First the “let” keyword – it basically allows you to bind a name to an expression or to a scalar. This of course is really useful if you plan to re-use the expression.

I’ll give an example that I use in real-life – a basic investigative query into failed requests. I’m joining exceptions and failed dependencies (similar to NRT proactive detection). I’m using the let keyword to easily modify the time range of my query.

Here it is, enjoy!

``````let investigationStartTime = datetime("2016-09-07");
let investigationEndTime = investigationStartTime + 1d;
requests
| where timestamp > investigationStartTime
| where timestamp < investigationEndTime
| where success == "False"
| join kind=leftouter(exceptions
| where timestamp > investigationStartTime
| where timestamp < investigationEndTime
| project exception=type , operation_Id ) on operation_Id
| join kind=leftouter (dependencies
| where timestamp > investigationStartTime
| where timestamp < investigationEndTime
| where success == "False"
| project failed_dependency=name, operation_Id ) on operation_Id
| project timestamp, operation_Id , resultCode, exception, failed_dependency
``````

# App Analytics Machine Learning: Autocluster

Don’t freak out about the title. I’m going to show some powerful machine-learning algorithms behind the scenes — But they are also super-duper easy to use and understand from analytics query results.

I’ll start with Autocluster(). What this operator does, is take all your data, and classify it into clusters. So we’re basically bunching your data into groups. This is very useful in a few scenarios:

1. Classify request failures – easily see if all failures have a certain response code, are on a certain role instance, a certain operation, or from a specific country etc.
2. Classify exceptions.
3. Classify failed dependencies.

This is actually the feature that is being used in the Near Real-Time Proactive Alerts feature to classify the characteristics of the request failure spike.

Let’s get to an example.

I just deployed my service, and checking the portal I see a huge spike in failed requests:

So I know something went terribly wrong, I just don’t know what.

Now, ordinarily what I would do in a situation like this is just take a random failed request, and try to trace the reason it specifically failed. But this can be wrong – several times I just happened to take a failed request that was completely not indicative of the real problem.

So this is where Autocluster() kicks in.

``````requests
| where success == "False"
| where timestamp > datetime("2016-06-09 14:00")
| where timestamp < datetime("2016-06-09 18:00")
| join (exceptions | project type, operation_Id ) on operation_Id
| project name , cloud_RoleInstance , type
| evaluate autocluster(0.85)``````

This is basically a query of all the failed requests in the specific timeframe, joined to exceptions. On top of this query I’m running the “evaluate autocluster()” command.

The result I’m expecting is bunching all these records into several groups, which will help me diagnose the common characteristics of my failures.

The results look like this:

!!!

So the autocluster algorithm went over all the data, and found that

• 71% of the requests failed due to 1 specific exception.
• The exception is found on all of my instances – see the “*” in the instance column.

Autocluster just diagnosed the problem in my service, going over thousands of records, in an instant! It’s easy to see why I think this is awesome.

FYI, Autocluster can take in as input any column, even custom dimensions. Ping me in the comments if you have any questions about the usage.

# App Insights Analytics: Extracting data from traces

I wanna show two real-world examples (it really happened to me!) of extracting data from traces, and then using that data to get really great insights.

So a little context here – I have a service that reads and processes messages from an Azure Queue. This message processing can fail, causing the same message to be retried many times.

I We recently introduced a bug into the service (as usual.. ) which caused requests to fail on a null reference exception. I wanted to know exactly how many messages were affected by this bug, but it was kind of hard to tell because the retries cause a lot of my service metrics to be off.

Luckily I have a trace just as I am beginning to process a message that shows the message id :

Start handling message id: 0828ae20-ba09-4f83-bb46-69f4fe25b510, dequeue count: 1, message: …

So what I did is extract the message id from the trace using a simple regex, and was then able to count messages using dcount:

``````traces
| where timestamp > ago(1d)
| where message startswith "Start handling"
| extend messageid = tostring(extract("Start handling message id: ([^:\\/\\s]+), ", 1, message))
| summarize dcount(messageid)``````

And in order to count how many messages were affected by the exception, I did a double join – to the failed requests and to exceptions tables:

``````requests
| where timestamp > ago(1d)
| where success == "False"
| join (exceptions
| where timestamp > ago(1d)
| where type contains "NullRef"
) on operation_Id
| join (traces
| where timestamp > ago(1d)
| where message startswith "Start handling"
| extend messageid = tostring(extract("Start handling message id: ([^:\\/\\s]+), ", 1, message))
) on operation_Id
| summarize dcount(messageid)``````

Voila!

The second example is similar, but this time I extracted a measurement.

Again I started from a trace – I have a trace detailing exactly how late a message that came in the queue is. It looks like this:

Latency: 21 minutes.

I wanted to turn these traces into measurable data that I can slice and dice on. So I used the same extend+extract method as before + a todouble:

``````traces
| where timestamp > ago(1d)
| where message contains "Latency: "
| extend latency = todouble(extract("Latency: ([^:\\/\\s]+) minutes.", 1, message))
| summarize percentile(latency, 90)``````

AWESOME!

# Cool Azure Log Analytics: How many unique users were affected by 404’s?

Here’s another nifty little trick useful for counting how many unique users were impacted from a service issue.

In order to do this, I use the “dcount” aggregation. It counts how many unique values are in the column.

```requests | where timestamp > ago(7d) | where resultCode == "404" | summarize dcount(user_Id)```

# Cool AppInsights Analytics: Extracting url host with a regular expression

Another nice feature of Kusto / Application Insights Analytics is full on support for regular expressions using the extract keyword.

A very useful application of this is all matter of manipulations you can do over the “url” field in requests. A common ask is understanding how much traffic is generated by any of your different hosts.

Since Analytics only carries the full url field, we need to parse out the host out using a regex. I took a really really simple regex in this case, but obviously it can be much more complex.

``````pageViews
| where timestamp > ago(1d)
| extend urlhost=extract('^(http://|https://)([^:\\/\\s]+)', 2, url)
| summarize count() by urlhost
| render piechart``````

### Update:

There is now a simpler method to extract all url parts – parseurl.

``````requests
| take 5
| extend urlParts = parseurl(url)
| project url, urlParts, urlParts.Scheme, urlParts.Host, urlParts.Path, urlParts.Port``````

# Cool Azure Log Analytics: Joining requests and dependencies

Another cool thing you can do with App Insights Analytics is join different data types to get a good understanding of what’s happening in your app.

A great example are remote dependencies – this is an out-of-the-box feature in App Insights that logs all remote dependency calls such as SQL, Azure, http etc. If you’ve got that data flowing, you can get amazing insights with just a few small queries.

Here’s a small example – Lets’ try and find out which resources are real time-hogs in my service. The query I spun out is – per http request, get the average duration spent calling each dependency type.

``````requests
| where timestamp > ago(1d)
| project timestamp, operation_Id
| join (dependencies
| where timestamp > ago(1d)
| summarize sum(duration) by operation_Id, type
) on operation_Id
| summarize avg_duration_by_type=avg(sum_duration) by type, bin(timestamp, 20m)
| render barchart``````