Don’t freak out about the title. I’m going to show some powerful machine-learning algorithms behind the scenes — But they are also super-duper easy to use and understand from analytics query results.
I’ll start with Autocluster(). What this operator does, is take all your data, and classify it into clusters. So we’re basically bunching your data into groups. This is very useful in a few scenarios:
- Classify request failures – easily see if all failures have a certain response code, are on a certain role instance, a certain operation, or from a specific country etc.
- Classify exceptions.
- Classify failed dependencies.
This is actually the feature that is being used in the Near Real-Time Proactive Alerts feature to classify the characteristics of the request failure spike.
Let’s get to an example.
I just deployed my service, and checking the portal I see a huge spike in failed requests:
So I know something went terribly wrong, I just don’t know what.
Now, ordinarily what I would do in a situation like this is just take a random failed request, and try to trace the reason it specifically failed. But this can be wrong – several times I just happened to take a failed request that was completely not indicative of the real problem.
So this is where Autocluster() kicks in.
requests | where success == "False" | where timestamp > datetime("2016-06-09 14:00") | where timestamp < datetime("2016-06-09 18:00") | join (exceptions | project type, operation_Id ) on operation_Id | project name , cloud_RoleInstance , type | evaluate autocluster(0.85)
This is basically a query of all the failed requests in the specific timeframe, joined to exceptions. On top of this query I’m running the “evaluate autocluster()” command.
The result I’m expecting is bunching all these records into several groups, which will help me diagnose the common characteristics of my failures.
The results look like this:
So the autocluster algorithm went over all the data, and found that
- 71% of the requests failed due to 1 specific exception.
- The exception is found on all of my instances – see the “*” in the instance column.
Autocluster just diagnosed the problem in my service, going over thousands of records, in an instant! It’s easy to see why I think this is awesome.
FYI, Autocluster can take in as input any column, even custom dimensions. Ping me in the comments if you have any questions about the usage.