How to Split your Data Into A Fixed Amount of Buckets

March 12, 2019assaf___ 1 Comment

Okay, another question from Twitter (original content will have to wait till I get some more free time!)

Here’s the challenge:

Need help with #Azure #AppInsights: when summarizing, I want to adjust the bin size according to the time range the user selects in the Query editor. I found `bin_auto(timestamp)` which looks promising, but still I need to `set query_bin_auto_size=1m` manually. Any clues? pic.twitter.com/TtwCTH5OtR

— Torben Knerr (@tknerr_de) February 8, 2019

So what we need to do here is somehow infer the time-range of the query, and then create a fixed set of time bins according to that range.

I think the only way to that is by performing 2 queries – one to get the time range and convert it into a fixed interval, and a second query with the actual logic.

To convert the result of the first query into a ‘variable’ we can use in the second query, I’ll use the ‘toscalar‘ operation.

Here we go:

let numberOfBuckets = 24;
let interval = toscalar(requests
| summarize interval = (max(timestamp)-min(timestamp)) / numberOfBuckets
| project floor(interval, 1m));
requests
| summarize count() by bin(timestamp , interval)

I use ‘floor’ here just to round the interval and make the results a bit more readable.

Back-fill Missing Dates With Zeros in a Time Chart

December 19, 2018December 19, 2018assaf___ 2 Comments

A common ask I’ve heard from several users, is the ability to fill gaps in your data in Kusto/App Analytics/DataExplorer (lots of names these days!):

@assaf___ any best practice how to “fill time gaps” in a kusto query after a summarize on timestamp? (a timechart will draw the line between the known points and I want a missing point to be drawn as 0)

— Tom Kuijsten (@tomkuijsten) December 17, 2018

If your data has gaps in time in it, the default behavior for App Analytics is to “connect the dots”, and not really reflect that there was no data in these times. In lots of cases we’d like to fill these missing dates with zeros.

The way to go to handle this, is to use the “make-series” operator. This operator exists to enable advanced time-series analysis on your data, but we’ll just use it for the simple use-case of adding missing dates with a “0” value.

Some added sophistication is converting the series back to a *regular* summarize using “mvexpand”, so we can continue to transform the data as usual.

Here’s the query (Thanks Tom for helping refine this query!) :

let start=floor(ago(3d), 1d);
let end=floor(now(), 1d);
let interval=5m;
requests
| where timestamp > start
| make-series counter=count() default=0 
              on timestamp in range(start, end, interval)
| mvexpand timestamp, counter
| project todatetime(timestamp), toint(counter)
| render timechart

Monitoring and Scaling Azure Functions

February 15, 2018assaf___ 4 Comments

Everybody loves Azure Functions.

My team recently deployed a production service using Azure Functions as the back end backbone. I’d like to share some lessons and tips we learned along the way.

We’re using Azure functions in consumption plan – which basically means the platform scales in and out as required without our intervention. But that doesn’t mean you can just forget about scaling.

Monitor! Monitor! Monitor!

Azure Functions has a really great integration with App Insights. It makes it really easy to get near real-time data on whats going on in your app.

Coupled with Log Analytics, this is extremely valuable to get going right from the beginning. Skip this step at your own peril..

Here’s a little taster of what you can get – a very useful query that’ll get a feel for your app performance – 95th percentile request duration by request name:

requests
| where timestamp > ago(7d)
| summarize percentile(duration, 95) by name, bin(timestamp, 1h)
| render timechart

You Gotta Have Context

We’re using App Insights as our complete monitoring platform – meaning we’re calling App Insights from the Function code itself – we use it to trace logs, events and dependencies.

So if all your application monitoring data is in App Insights, it’s super-duper useful to be able to correlate all the telemetry from one request (request, traces, dependencies, events) under one context.

AF App Insights integration already sets the operation_Id field in all the requests to the context invocation Id. What we did, is set the operation Id for *all* telemetry items. You can’t really use a telmetry initializer because you don’t really control the telemetry client instance. Here’s what we did instead – store the context, and then out it in every telemetry item:

public class ApplicationInsightsTracer 
{
   private static readonly Lazy TelemetryClient = new 
                                     Lazy(InitTelemetryClient);

   public string OperationId { get; set; }

   private static TelemetryClient InitTelemetryClient()
   {
            var telemetryClient = new 
                    TelemetryClient(TelemetryConfiguration.Active)
            {
                InstrumentationKey = ConfigurationManager.AppSettings
                                   ["APPINSIGHTS_INSTRUMENTATIONKEY"]
            };
            return telemetryClient;
        }
   }
   
   public ApplicationInsightsTracer(Guid contextInvocationId) 
   {
       this.OperationId = contextInvocationId.ToString();
   }   

   public void TrackEvent(string name)
   {
      var eventTelemetry = new EventTelemetry(name);
      telemetry.Context.Operation.Id = OperationId;
      TelemetryClient.Value.TrackEvent(eventTelemetry);
   }
   
}

Then, in the function code:

[FunctionName("MyFunc")]
public static async Task Run(
[HttpTrigger(AuthorizationLevel.Function, "post", Route = "My")] HttpRequestMessage req, 
TraceWriter log, 
ExecutionContext context)
{
    var tracer = new ApplicationInsightsTracer(context.InvocationId);
    ...
}

Also – make sure you *don’t* Flush in your function code. In our tests it added about 200ms to every function invocation. Flushes happen periodically on their own.

Roles Matter

Our service has several different roles in it:

A high-usage HTTP API which is utilized with very high concurrency.
A job scheduling HTTP API which get called about once an hour.
A service-bus queue based worker role that does long, heavy data crunching.

At first, when we just got started with Azure Functions, we just shoved all these functions into one Azure Functions resource. Wrong!

When you put them all together, they scale together! so whenever the long drawn processing would scale to more roles, it would scale the Http roles too and adversely affect their performance.

Different roles, with different scaling requirements, should be separated into separate Azure Functions resources.

If you’ve got App Insights integration setup, here is a query that we used a lot to help us understand what exactly is scaling in our service – a distinct count of role instances per hour in our deployment:

requests
| where timestamp> ago(7d)
| summarize dcount(cloud_RoleInstance) 
            by bin(timestamp, 1h), cloud_RoleName
| render timechart

Different roles should also have different properties – things like the client affinity cookie should be disabled/enabled on a per role basis.

Searching all Tables with Union, Searching all Fields with ‘*’

December 20, 2017February 15, 2018assaf___ Leave a comment

One of the major use cases for log analytics is root cause investigation. For this, many times you just want to look at all your data, and find records that relate to a specific session, operation, or error. I already showed one way you can do this using ‘search’, but I want to show how you can do this using ‘union *‘ which is a more versatile.

union *
| where timestamp > ago(1d)
| where operation_Id contains '7'
| project timestamp, operation_Id, name, message

In fact I already used ‘union *’ when I wanted to count users across all tables.
Another useful tool is searching across all fields – you can do this with ‘where *‘:

union *
| where timestamp > ago(1d)
| where * contains 'error'
| project timestamp, operation_Id, name, message

This is really powerful, and can be used to basically do a full table scan across all your data.
But one thing that always annoyed me is that you never know which table the data came from. I just discovered a really easy way to get this – using the ‘withsource’ qualifier:

union withsource=sourceTable *
| where timestamp > ago(1d)
| where * contains 'error'
| project sourceTable, timestamp, operation_Id, name, message

Cross App Queries in Azure Log Analytics

December 17, 2017assaf___ 1 Comment

I’ll keep it short and simple this time. Here’s a great way to debug your app across multiple App Insights instances.

So, I have two Azure Functions services running, with one serving as an API, and the other serving as BE processing engine. Both report telemetry to App Insights (different apps), and I am passing a context along from one to the other – so I can correlate exceptions and bugs.

Wouldn’t it be great to be able to see what happened in a single session across the 2 apps?

It’s possible – using ‘app‘ – just plugin the name of the app insights resource you want to query, and a simple ‘union‘.

Here you go:

let session="reReYiRu";
union app('FE-prod').traces, app('BE-prod').traces
| where session_Id == session 
| project timestamp, session_Id, appName, message
| order by timestamp asc

Don’t forget –

You can use the field ‘appName‘ to see which app this particular trace is coming from.
Different machines have different times.. Don’t count on the timestamp ordering to always be correct.

A Simple Way to Extract Data From Traces – ‘Parse’

October 19, 2017assaf___ 3 Comments

There is a nifty little operator in Azure Log Analytics that has really simplified how I work with regular expressions – It’s called “parse” and I’ll explain it through a little example.

Let’s say you have a service that emits traces like:

traces
| where message contains "Error"
| project message

11:07 Error-failed to connect to DB(code: 100)

12:02 Error-failed to connect to DB(code: 100)

12:05 Error-query failed on syntax(code: 355)

12:06 Error-query failed on timeout(code: 567)

I’d like to count how many errors I have from each code, and then put the whole thing on a timechart that I can add to my dashboard, in order to monitor errors in my service.

Obviously I’d like to extract the error code from the trace, so I need a regular expression.

Well, if you’re anything like me the first thing you’ll do is start feverishly googling regular expressions to try to remember how the heck to do it… and then flailing for like an hour until getting it right.

Well, using parse, things are much much easier:

traces
| where message contains "Error"
| parse message with * "(code: " errorCode ")" *
| project errorCode

100
100
355
567

And from here summarizing is just a breeze:

traces
| where message contains "Error"
| parse message with * "(code: " errorCode ")" *
| summarize count() by errorCode, bin(timestamp, 1h)
| render areachart kind=stacked

Happy parsing!

Using Azure Log Analytics to Calculate User Engagement Metrics

August 23, 2017December 20, 2017assaf___ 6 Comments

Engagement/Usage metrics are some of the most commonly used, yet tricky to calculate metrics out there. I myself have seen just about 17 different ways to calculate stickiness, churn, etc. in analytics – each with its own drawbacks, all of them complex and hard to understand.

I’ve touched on this subject before when I offered a query for stickiness, but

It was complex and convoluted (yes, I’ll admit it!)
Hyper-log-log (hll) has known limitations in precision, especially when dealing with small numbers.

I’m really glad to showcase some new capabilities in Azure Log Analytics that super-simplify everything about these metrics. These are the new operators:

evaluate activity_engagement(...)

evaluate activity_metrics(...)

I really won’t babble too much here, there’s official documentation for that. But the basic concept is so easy you should really just try it out for yourself.

First, stickiness (rolling dau/mau). So, so simple:

union *
| where timestamp > ago(90d)
| evaluate activity_engagement(user_Id, timestamp, 1d, 28d)
| project timestamp, Dau_Mau=activity_ratio*100 
| where timestamp > ago(62d) // remove tail with partial data
| render timechart

Churn + Retention rate (week over week):

union *
| where timestamp > ago(90d)
| evaluate activity_metrics(user_Id , timestamp, 7d)
| project timestamp , retention_rate, churn_rate
| where retention_rate > 0 and 
  timestamp < ago(7d) and timestamp > ago(83d) // remove partial data in tail and head
| render timechart

Even cooler – you can add dimensions to slice your usage data accordingly. Here is a chart of my apps’ retention rates for different versions of the chrome browser:

union *
| where timestamp > ago(90d)
| where client_Browser startswith "chrome" 
| evaluate activity_metrics(user_Id , timestamp, 7d, client_Browser   )
| where dcount_values > 3
| project timestamp , retention_rate, client_Browser 
| where retention_rate > 0 and 
  timestamp < ago(7d) and timestamp > ago(83d) // remove partial data in tail and head
| render timechart

RetentionRate

Welcoming OMS Log Analytics Users

August 9, 2017November 16, 2017assaf___ Leave a comment

So this piece of news is really awesome!

All OMS Log Analytics customers are now able to convert to use the (newly dubbed) Azure Log Analytics, giving them access to the same amazing set of capabilities only Application Insights users had so far.

Also check out the new and improved documentation site. Really nice stuff.

If you’re an OMS user, and are just coming into analytics, here are some of my previous posts showcasing just how fantastic this analytics tool can be:

Search in App Analytics

June 15, 2017August 9, 2017assaf___ 6 Comments

The questions I get most often about Analytics aren’t usually about super-complicated queries or magic ML functions.

It’s usually just about how to find specific logs in an investigation.

App Insights Analytics has a really simple way to do it – search. This will search for a keyword across all your tables, across all columns.

search "Error"

If you look at the results, the first column is called $table – it is the name of the table from which the results came from.

You can combine search with a summarize, or any other filter you need:

search "Error" 
| summarize count() by bin(timestamp, 1h)
| render timechart

It’s also possible to search in specific tables:

search "fail" in (customEvents, dependencies)

Happy searching!

Diagnose Metric Anomalies with Single-Click Machine-Learning Analytics Magic

May 4, 2017August 9, 2017assaf___ 2 Comments

App Insights Analytics just released Smart Diagnostics, and it is by far the best application of Machine Learning analytics in the service to date.

I’ve posted before about some ML features such as autocluster and smart alerting, but this one really takes the cake as the most powerful and useful yet:

It’s super-duper easy to use! Despite the huge complexity of the Machine Learning algo behind the scenes.
It’s fast!
It can give you awesome answers that save you lots of investigation time and agony.

It works by analyzing spikes in charts, and giving you a pattern that explains the sudden change in the data.

So let’s give it a go!

Analyze spike in dependency duration

I run a service that has all kinds of remote dependencies – calls to Azure blobs, queues, http requests, etc.

In my devops hat, I run this simple query almost daily just to gauge the health of my service – a look at the 95th percentile for call duration by dependency type:

dependencies
| where timestamp > ago(1d)
| where operation_SyntheticSource == ""
| summarize percentile(duration, 95) by bin(timestamp,30m), type
| render timechart

The results look like this:

dep_percentile_95

Right off the bat I can see something very funky going on in my http calls. I wanna know exactly what’s going on, but drilling in to the raw data can be a messy business.

If only there was a way to analyze that spike with just one click…. !!!

Fortunately, there’s a small purple dot on that spike. It signifies that this spike is available for analysis with Machine Learning (aka Smart Diagnostics).

Once I click on it, the magic happens.

dep_percentile_95_result

Smart Diagnostics just told me that the cause for the spike in call duration was:

name: https://../observation
operation_Name: HealthyFlow
resultCode: 409
success: False

Whoa!

Furthermore, looking at the chart I see that calls without this pattern are totally flat in this time period.

This is definitive proof that the spike was caused by failed calls to this dependency. My work here is done in seconds.

matrix_whoa

4pp1n51ght5

Hacks, tips and tricks from a dev on the Microsoft Windows Cyber team