I wanna show two real-world examples (it really happened to me!) of extracting data from traces, and then using that data to get really great insights.
So a little context here – I have a service that reads and processes messages from an Azure Queue. This message processing can fail, causing the same message to be retried many times.
I We recently introduced a bug into the service (as usual.. ) which caused requests to fail on a null reference exception. I wanted to know exactly how many messages were affected by this bug, but it was kind of hard to tell because the retries cause a lot of my service metrics to be off.
Luckily I have a trace just as I am beginning to process a message that shows the message id :
Start handling message id: 0828ae20-ba09-4f83-bb46-69f4fe25b510, dequeue count: 1, message: …
So what I did is extract the message id from the trace using a simple regex, and was then able to count messages using dcount:
traces
| where timestamp > ago(1d)
| where message startswith "Start handling"
| extend messageid = tostring(extract("Start handling message id: ([^:\\/\\s]+), ", 1, message))
| summarize dcount(messageid)
And in order to count how many messages were affected by the exception, I did a double join – to the failed requests and to exceptions tables:
requests
| where timestamp > ago(1d)
| where success == "False"
| join (exceptions
| where timestamp > ago(1d)
| where type contains "NullRef"
) on operation_Id
| join (traces
| where timestamp > ago(1d)
| where message startswith "Start handling"
| extend messageid = tostring(extract("Start handling message id: ([^:\\/\\s]+), ", 1, message))
) on operation_Id
| summarize dcount(messageid)
Voila!
The second example is similar, but this time I extracted a measurement.
Again I started from a trace – I have a trace detailing exactly how late a message that came in the queue is. It looks like this:
Latency: 21 minutes.
I wanted to turn these traces into measurable data that I can slice and dice on. So I used the same extend+extract method as before + a todouble:
traces
| where timestamp > ago(1d)
| where message contains "Latency: "
| extend latency = todouble(extract("Latency: ([^:\\/\\s]+) minutes.", 1, message))
| summarize percentile(latency, 90)
AWESOME!
3 thoughts on “App Insights Analytics: Extracting data from traces”