One of the major use cases for log analytics is root cause investigation. For this, many times you just want to look at all your data, and find records that relate to a specific session, operation, or error. I already showed one way you can do this using ‘search’, but I want to show how you can do this using ‘union *‘ which is a more versatile.
union *
| where timestamp > ago(1d)
| where operation_Id contains '7'
| project timestamp, operation_Id, name, message
union *
| where timestamp > ago(1d)
| where * contains 'error'
| project timestamp, operation_Id, name, message
This is really powerful, and can be used to basically do a full table scan across all your data.
But one thing that always annoyed me is that you never know which table the data came from. I just discovered a really easy way to get this – using the ‘withsource’ qualifier:
union withsource=sourceTable *
| where timestamp > ago(1d)
| where * contains 'error'
| project sourceTable, timestamp, operation_Id, name, message
App Insights Analytics just released Smart Diagnostics, and it is by far the best application of Machine Learning analytics in the service to date.
I’ve posted before about some ML features such as autocluster and smart alerting, but this one really takes the cake as the most powerful and useful yet:
It’s super-duper easy to use! Despite the huge complexity of the Machine Learning algo behind the scenes.
It’s fast!
It can give you awesome answers that save you lots of investigation time and agony.
It works by analyzing spikes in charts, and giving you a pattern that explains the sudden change in the data.
So let’s give it a go!
Analyze spike in dependency duration
I run a service that has all kinds of remote dependencies – calls to Azure blobs, queues, http requests, etc.
In my devops hat, I run this simple query almost daily just to gauge the health of my service – a look at the 95th percentile for call duration by dependency type:
dependencies
| where timestamp > ago(1d)
| where operation_SyntheticSource == ""
| summarize percentile(duration, 95) by bin(timestamp,30m), type
| render timechart
The results look like this:
Right off the bat I can see something very funky going on in my http calls. I wanna know exactly what’s going on, but drilling in to the raw data can be a messy business.
If only there was a way to analyze that spike with just one click…. !!!
Fortunately, there’s a small purple dot on that spike. It signifies that this spike is available for analysis with Machine Learning (aka Smart Diagnostics).
Once I click on it, the magic happens.
Smart Diagnostics just told me that the cause for the spike in call duration was:
It sounded a lot like a challenge to me, so I just couldn’t resist!!
Sankey diagrams for those who don’t know are an amazing tool for describing user flows AND are the basis for one of the most famous data visualizations of all-time. But really, I had no idea how to create one. So, I googled Sankey + PowerBI and came across this fantastic Amir Netz video:
So all you need to create the diagram is a table with 3 columns:
Source
Destination
Count
And PowerBI takes care of the rest.
We already know you can take any App Insights Analytics query and “export” the data to PowerBI.
So the only problem now is how do I transform my AppInsights custom events table into a Sankey events table?
Let’s go to Analytics!
First it’s best to have an idea of what we’re trying to do. Decide on 5-10 events that make sense for a user session flow. In my case, I want to see the flow of users through a new feature called “CTD”. So the events I chose are:
CTD_Available (feature is available for use)
CTD_Result (user used the feature and got a result)
CTD_DrillIn (user chose to further drill-in the results)
CTD_Feedback (user chose to give feedback)
In every step, I’m interested in seeing how many users I’m “losing”, and what they’re doing next.
Ok, let’s get to work!
First query we’ll
Filter out only relevant events
Sort by timestamp asc (don’t forget that in this chart, order is important!)
Summarize by session_id using makelist, to put all events that happened in that session in an ordered list. If you’re unfamiliar with makelist, all it does is take all the values of the column, and stuffs them into a list. The resulting lists are the ordered events that users triggered in each session.
customEvents
| where timestamp > ago(7d)
| where name=="CTD_Available" or name=="CTD_Result" or
name=="CTD_Drillin" or name== "CTD_Feedback"
| sort by timestamp asc
| summarize l=makelist(name) by session_Id
Next step I’ll do is add an “EndSession” event to each list, just to make sure my final diagram is symmetric. You might already have this event as part of your telemetry, I don’t. This is optional, and you can choose to remove this line.
To do this, I need to chop off the first item in the list and “zip” it (like a zipper) with the original list. In c# this is very easy – list.Zip(list.Skip(1))..
Amazingly, App Analytics has a zip command! Tragically, it doesn’t have a skip… :(. Which means we need to do some more ugly regex work in order to chop off the first element.
There’s a pretty nice operator in Kusto (or App Insights Analytics) called top-nested.
It basically allows you to do a hierarchical drill-down by dimensions. Sounds a bit much, but it’s much clearer when looking at an example!
So a simple use for it could be something like getting the top 5 result-codes, and then a drill down for each result code of top 3 request names for each RC.
requests
| where timestamp > ago(1h)
| top-nested 5 of resultCode by count(),
top-nested 3 of name by count()
So I can easily see which operation names are generating the most 404’s for instance.
This is pretty cute, and can be handy for faceting.
But I actually find it more helpful in a couple of other scenarios.
First one is getting a chart of only the top N values. For instance, if I chart my app usage by country, I get a gazillion series of all different countries. How can I easily filter the chart to show just my top 10 countries? Well one way is to do the queries separately, and add a bunch of where filters to the chart…
But top nested can save me all that work:
let top_countries = view()
{
customEvents
| where timestamp > ago(3d)
| top-nested 5 of client_CountryOrRegion by count()
};
top_countries
| join kind= inner
(customEvents
| where timestamp >= ago(3d)
) on client_CountryOrRegion
| summarize count() by bin(timestamp, 1h), client_CountryOrRegion
| render timechart
A beautiful view of just my top 5 countries…
I’ve actually used the same technique for a host of different dimensions (top countries, top pages, top errors etc.), and it can also be useful to filter OUT top values (such as top users skewing the numbers), by changing the join to anti-join.
The second neat scenario is calculating percentages of a whole. For instance – how do you calculate the percentage of traffic per RC daily?
Yeah, you can do this using a summarize and the (newly-added) areachart stacked100 chart kind:
requests
| where timestamp >= ago(3d)
| where isnotempty(resultCode)
| summarize count() by bin(timestamp, 1h), resultCode
| render areachart kind=stacked100
But this only partially solves my problem.
Because ideally, I don’t want to look at all these 200’s crowding my chart. I would like to look at only the 40X’s and 500’s, but still as a percentage of ALL my traffic.
I could do this by adding a bunch of countif(rc=403)/count(), countif(rc=404)/count()… ad nauseum, but this is tiresome + you don’t always know all possible values when creating a query.
Here’s where top-nested comes in. Because it shows the aggregated value for each level, creating the percentages becomes super-easy. The trick is simply doing the first top-nested by timestamp:
requests
| where timestamp > ago(14d)
| top-nested 14 of bin(timestamp, 1d) by count() ,
top-nested 20 of resultCode by count()
| where resultCode !startswith("20")
| where resultCode !startswith("30")
| project pct=aggregated_resultCode * 1.0 / aggregated_timestamp,
timestamp, resultCode
| render timechart
First the “let” keyword – it basically allows you to bind a name to an expression or to a scalar. This of course is really useful if you plan to re-use the expression.
I’ll give an example that I use in real-life – a basic investigative query into failed requests. I’m joining exceptions and failed dependencies (similar to NRT proactive detection). I’m using the let keyword to easily modify the time range of my query.
Here it is, enjoy!
let investigationStartTime = datetime("2016-09-07");
let investigationEndTime = investigationStartTime + 1d;
requests
| where timestamp > investigationStartTime
| where timestamp < investigationEndTime
| where success == "False"
| join kind=leftouter(exceptions
| where timestamp > investigationStartTime
| where timestamp < investigationEndTime
| project exception=type , operation_Id ) on operation_Id
| join kind=leftouter (dependencies
| where timestamp > investigationStartTime
| where timestamp < investigationEndTime
| where success == "False"
| project failed_dependency=name, operation_Id ) on operation_Id
| project timestamp, operation_Id , resultCode, exception, failed_dependency
I’ve already talked about how cool proactive alerts are, but one thing I missed is that you can actually tweak these alerts through the Azure portal.
Go to the “Alerts” blade, and there you should find a single “Proactive Diagnostics” alert.
If you click it and go to the alert configuration, you can set email recipients, setup a webhook, and enable/disable.
One thing that is really useful is you can set “Received detailed analysis” checkbox:
This will make sure you get the entire deep-dive analysis of the incident straight to your email inbox, without needing to go to the portal. This can save some very valuable minutes during a live-site incident!
I wanna show two real-world examples (it really happened to me!) of extracting data from traces, and then using that data to get really great insights.
So a little context here – I have a service that reads and processes messages from an Azure Queue. This message processing can fail, causing the same message to be retried many times.
I We recently introduced a bug into the service (as usual.. ) which caused requests to fail on a null reference exception. I wanted to know exactly how many messages were affected by this bug, but it was kind of hard to tell because the retries cause a lot of my service metrics to be off.
Luckily I have a trace just as I am beginning to process a message that shows the message id :
Another nice feature of Kusto / Application Insights Analytics is full on support for regular expressions using the extract keyword.
A very useful application of this is all matter of manipulations you can do over the “url” field in requests. A common ask is understanding how much traffic is generated by any of your different hosts.
Since Analytics only carries the full url field, we need to parse out the host out using a regex. I took a really really simple regex in this case, but obviously it can be much more complex.
pageViews
| where timestamp > ago(1d)
| extend urlhost=extract('^(http://|https://)([^:\\/\\s]+)', 2, url)
| summarize count() by urlhost
| render piechart
Update:
There is now a simpler method to extract all url parts – parseurl.
Another cool thing you can do with App Insights Analytics is join different data types to get a good understanding of what’s happening in your app.
A great example are remote dependencies – this is an out-of-the-box feature in App Insights that logs all remote dependency calls such as SQL, Azure, http etc. If you’ve got that data flowing, you can get amazing insights with just a few small queries.
Here’s a small example – Lets’ try and find out which resources are real time-hogs in my service. The query I spun out is – per http request, get the average duration spent calling each dependency type.
requests
| where timestamp > ago(1d)
| project timestamp, operation_Id
| join (dependencies
| where timestamp > ago(1d)
| summarize sum(duration) by operation_Id, type
) on operation_Id
| summarize avg_duration_by_type=avg(sum_duration) by type, bin(timestamp, 20m)
| render barchart