Tweaking Proactive Alerts

I’ve already talked about how cool proactive alerts are, but one thing I missed is that you can actually tweak these alerts through the Azure portal.

Go to the “Alerts” blade, and there you should find a single “Proactive Diagnostics” alert.

nrt-alert

If you click it and go to the alert configuration, you can set email recipients, setup a webhook,  and enable/disable.

One thing that is really useful is you can set “Received detailed analysis” checkbox:

nrt-detailed

This will make sure you get the entire deep-dive analysis of the incident straight to your email inbox, without needing to go to the portal. This can save some very valuable minutes during a live-site incident!

 

 

App Analytics Machine Learning: Autocluster

Don’t freak out about the title. I’m going to show some powerful machine-learning algorithms behind the scenes — But they are also super-duper easy to use and understand from analytics query results.

I’ll start with Autocluster(). What this operator does, is take all your data, and classify it into clusters. So we’re basically bunching your data into groups. This is very useful in a few scenarios:

  1. Classify request failures – easily see if all failures have a certain response code, are on a certain role instance, a certain operation, or from a specific country etc.
  2. Classify exceptions.
  3. Classify failed dependencies.

This is actually the feature that is being used in the Near Real-Time Proactive Alerts feature to classify the characteristics of the request failure spike.

Let’s get to an example.

I just deployed my service, and checking the portal I see a huge spike in failed requests:

FRRSpike

 

So I know something went terribly wrong, I just don’t know what.

Now, ordinarily what I would do in a situation like this is just take a random failed request, and try to trace the reason it specifically failed. But this can be wrong – several times I just happened to take a failed request that was completely not indicative of the real problem.

So this is where Autocluster() kicks in.

requests
| where success == "False"
| where timestamp > datetime("2016-06-09 14:00")
| where timestamp < datetime("2016-06-09 18:00")
| join (exceptions | project type, operation_Id ) on operation_Id
| project name , cloud_RoleInstance , type
| evaluate autocluster(0.85)

This is basically a query of all the failed requests in the specific timeframe, joined to exceptions. On top of this query I’m running the “evaluate autocluster()” command.

The result I’m expecting is bunching all these records into several groups, which will help me diagnose the common characteristics of my failures.

The results look like this:

autocluster-results

!!!

So the autocluster algorithm went over all the data, and found that

  • 71% of the requests failed due to 1 specific exception.
  • The exception is found on all of my instances – see the “*” in the instance column.

Autocluster just diagnosed the problem in my service, going over thousands of records, in an instant! It’s easy to see why I think this is awesome.

FYI, Autocluster can take in as input any column, even custom dimensions. Ping me in the comments if you have any questions about the usage.

 

 

Near Real-Time Proactive Alerts

Ok, so besides App Analytics obviously – one of the most bestest and awesomest new features to come out of App Insights recently has gotta be proactive alerts in near real-time.

It might be the best thing since custom dimensions.

The way it works, AppInsights will auto-magically scan your data, and alert you to anomalies that might be major service issues. The awesome part is

  1. Absolutely no configuration required. App Insights studies the normal behavior of your service, and finds anomalies from that baseline.
  2. This could really save your ass! The alert should come-in about 10 minutes from the problem start, usually just in time for a quick fix.
  3. They’re doing an root cause analysis for you! As you can see in the mail below, the proactive alert correlates exceptions, failed dependencies, traces and every other piece of data in App Insights to try and get you the root cause right in your face.

 

In the below example, App Insights finds and alerts on a critical problem in my service – and immediately finds the culprit in a failing Http Dependency:

NRT