Monitoring and Scaling Azure Functions

Everybody loves Azure Functions.

My team recently deployed a production service using Azure Functions as the back end backbone. I’d like to share some lessons and tips we learned along the way.

We’re using Azure functions in consumption plan – which basically means the platform scales in and out as required without our intervention. But that doesn’t mean you can just forget about scaling.

Monitor! Monitor! Monitor!

Azure Functions has a really great integration with App Insights. It makes it really easy to get near real-time data on whats going on in your app.

Coupled with Log Analytics, this is extremely valuable to get going right from the beginning. Skip this step at your own peril..

Here’s a little taster of what you can get – a very useful query that’ll get a feel for your app performance – 95th percentile request duration by request name:

requests
| where timestamp > ago(7d)
| summarize percentile(duration, 95) by name, bin(timestamp, 1h)
| render timechart

You Gotta Have Context

We’re using App Insights as our complete monitoring platform – meaning we’re calling App Insights from the Function code itself – we use it to trace logs, events and dependencies.

So if all your application monitoring data is in App Insights, it’s super-duper useful to be able to correlate all the telemetry from one request (request, traces, dependencies, events) under one context.

AF App Insights integration already sets the operation_Id field in all the requests to the context invocation Id. What we did, is set the operation Id for *all* telemetry items. You can’t really use a telmetry initializer because you don’t really control the telemetry client instance. Here’s what we did instead – store the context, and then out it in every telemetry item:

public class ApplicationInsightsTracer 
{
   private static readonly Lazy TelemetryClient = new 
                                     Lazy(InitTelemetryClient);

   public string OperationId { get; set; }

   private static TelemetryClient InitTelemetryClient()
   {
            var telemetryClient = new 
                    TelemetryClient(TelemetryConfiguration.Active)
            {
                InstrumentationKey = ConfigurationManager.AppSettings
                                   ["APPINSIGHTS_INSTRUMENTATIONKEY"]
            };
            return telemetryClient;
        }
   }
   
   public ApplicationInsightsTracer(Guid contextInvocationId) 
   {
       this.OperationId = contextInvocationId.ToString();
   }   

   public void TrackEvent(string name)
   {
      var eventTelemetry = new EventTelemetry(name);
      telemetry.Context.Operation.Id = OperationId;
      TelemetryClient.Value.TrackEvent(eventTelemetry);
   }
   
}

Then, in the function code:

[FunctionName("MyFunc")]
public static async Task Run(
[HttpTrigger(AuthorizationLevel.Function, "post", Route = "My")] HttpRequestMessage req, 
TraceWriter log, 
ExecutionContext context)
{
    var tracer = new ApplicationInsightsTracer(context.InvocationId);
    ...
}

Also – make sure you *don’t* Flush in your function code. In our tests it added about 200ms to every function invocation. Flushes happen periodically on their own.

Roles Matter

Our service has several different roles in it:

  • A high-usage HTTP API which is utilized with very high concurrency.
  • A job scheduling HTTP API which get called about once an hour.
  • A service-bus queue based worker role that does long, heavy data crunching.

At first, when we just got started with Azure Functions, we just shoved all these functions into one Azure Functions resource. Wrong!

When you put them all together, they scale together! so whenever the long drawn processing would scale to more roles, it would scale the Http roles too and adversely affect their performance.

Different roles, with different scaling requirements, should be separated into separate Azure Functions resources.

If you’ve got App Insights integration setup, here is a query that we used a lot to help us understand what exactly is scaling in our service – a distinct count of role instances per hour in our deployment:

requests
| where timestamp> ago(7d)
| summarize dcount(cloud_RoleInstance) 
            by bin(timestamp, 1h), cloud_RoleName
| render timechart

 

Different roles should also have different properties – things like the client affinity cookie should be disabled/enabled on a per role basis.

 

Welcoming OMS Log Analytics Users

So this piece of news is really awesome!

All OMS Log Analytics customers are now able to convert to use the (newly dubbed) Azure Log Analytics, giving them access to the same amazing set of capabilities only Application Insights users had so far.

Also check out the new and improved documentation site. Really nice stuff.

If you’re an OMS user, and are just coming into analytics, here are some of my previous posts showcasing just how fantastic this analytics tool can be:

  1. Searching in App Analytics
  2. App Insights Analytics: Extracting data from traces
  3. Using Azure Log Analytics to Calculate User Engagement Metrics
  4. Diagnose Metric Anomalies with Single-Click Machine-Learning Analytics Magic
  5. Cool uses for the top-nested operator

 

App Analytics Machine Learning: Autocluster

Don’t freak out about the title. I’m going to show some powerful machine-learning algorithms behind the scenes — But they are also super-duper easy to use and understand from analytics query results.

I’ll start with Autocluster(). What this operator does, is take all your data, and classify it into clusters. So we’re basically bunching your data into groups. This is very useful in a few scenarios:

  1. Classify request failures – easily see if all failures have a certain response code, are on a certain role instance, a certain operation, or from a specific country etc.
  2. Classify exceptions.
  3. Classify failed dependencies.

This is actually the feature that is being used in the Near Real-Time Proactive Alerts feature to classify the characteristics of the request failure spike.

Let’s get to an example.

I just deployed my service, and checking the portal I see a huge spike in failed requests:

FRRSpike

 

So I know something went terribly wrong, I just don’t know what.

Now, ordinarily what I would do in a situation like this is just take a random failed request, and try to trace the reason it specifically failed. But this can be wrong – several times I just happened to take a failed request that was completely not indicative of the real problem.

So this is where Autocluster() kicks in.

requests
| where success == "False"
| where timestamp > datetime("2016-06-09 14:00")
| where timestamp < datetime("2016-06-09 18:00")
| join (exceptions | project type, operation_Id ) on operation_Id
| project name , cloud_RoleInstance , type
| evaluate autocluster(0.85)

This is basically a query of all the failed requests in the specific timeframe, joined to exceptions. On top of this query I’m running the “evaluate autocluster()” command.

The result I’m expecting is bunching all these records into several groups, which will help me diagnose the common characteristics of my failures.

The results look like this:

autocluster-results

!!!

So the autocluster algorithm went over all the data, and found that

  • 71% of the requests failed due to 1 specific exception.
  • The exception is found on all of my instances – see the “*” in the instance column.

Autocluster just diagnosed the problem in my service, going over thousands of records, in an instant! It’s easy to see why I think this is awesome.

FYI, Autocluster can take in as input any column, even custom dimensions. Ping me in the comments if you have any questions about the usage.

 

 

App Insights Analytics: Extracting data from traces

I wanna show two real-world examples (it really happened to me!) of extracting data from traces, and then using that data to get really great insights.

So a little context here – I have a service that reads and processes messages from an Azure Queue. This message processing can fail, causing the same message to be retried many times.

I We recently introduced a bug into the service (as usual.. ) which caused requests to fail on a null reference exception. I wanted to know exactly how many messages were affected by this bug, but it was kind of hard to tell because the retries cause a lot of my service metrics to be off.

Luckily I have a trace just as I am beginning to process a message that shows the message id :

Start handling message id: 0828ae20-ba09-4f83-bb46-69f4fe25b510, dequeue count: 1, message: …

So what I did is extract the message id from the trace using a simple regex, and was then able to count messages using dcount:

traces
 | where timestamp > ago(1d)
 | where message startswith "Start handling"
 | extend messageid = tostring(extract("Start handling message id: ([^:\\/\\s]+), ", 1, message))
 | summarize dcount(messageid)

And in order to count how many messages were affected by the exception, I did a double join – to the failed requests and to exceptions tables:

requests 
| where timestamp > ago(1d)
| where success == "False"
| join (exceptions
   | where timestamp > ago(1d)
   | where type contains "NullRef"
   ) on operation_Id
| join (traces
   | where timestamp > ago(1d)
   | where message startswith "Start handling"
   | extend messageid = tostring(extract("Start handling message id: ([^:\\/\\s]+), ", 1, message))
   ) on operation_Id
| summarize dcount(messageid)

Voila!

The second example is similar, but this time I extracted a measurement.

Again I started from a trace – I have a trace detailing exactly how late a message that came in the queue is. It looks like this:

Latency: 21 minutes.

I wanted to turn these traces into measurable data that I can slice and dice on. So I used the same extend+extract method as before + a todouble:

traces
| where timestamp > ago(1d)
| where message contains "Latency: "
| extend latency = todouble(extract("Latency: ([^:\\/\\s]+) minutes.", 1, message))
| summarize percentile(latency, 90)

AWESOME!

Cool AppInsights Analytics: Extracting url host with a regular expression

Another nice feature of Kusto / Application Insights Analytics is full on support for regular expressions using the extract keyword.

A very useful application of this is all matter of manipulations you can do over the “url” field in requests. A common ask is understanding how much traffic is generated by any of your different hosts.

Since Analytics only carries the full url field, we need to parse out the host out using a regex. I took a really really simple regex in this case, but obviously it can be much more complex.

pageViews
| where timestamp > ago(1d)
| extend urlhost=extract('^(http://|https://)([^:\\/\\s]+)', 2, url)
| summarize count() by urlhost
| render piechart

Update:

There is now a simpler method to extract all url parts – parseurl.

requests
| take 5
| extend urlParts = parseurl(url)
| project url, urlParts, urlParts.Scheme, urlParts.Host, urlParts.Path, urlParts.Port

host

Cool Azure Log Analytics: Joining requests and dependencies

Another cool thing you can do with App Insights Analytics is join different data types to get a good understanding of what’s happening in your app.

A great example are remote dependencies – this is an out-of-the-box feature in App Insights that logs all remote dependency calls such as SQL, Azure, http etc. If you’ve got that data flowing, you can get amazing insights with just a few small queries.

Here’s a small example – Lets’ try and find out which resources are real time-hogs in my service. The query I spun out is – per http request, get the average duration spent calling each dependency type.

requests
| where timestamp > ago(1d)
| project timestamp, operation_Id
| join (dependencies
        | where timestamp > ago(1d)
        | summarize sum(duration) by operation_Id, type 
        ) on operation_Id
| summarize avg_duration_by_type=avg(sum_duration) by type, bin(timestamp, 20m)
| render barchart

request_join_dependencies

Cool AppInsights Analytics: Counting sampled data

If you’re doing stuff you’re supposed to be doing in Analytics – like slicing and dicing request, counting page views, etc. – then you should probably make sure you’re counting correctly.

2 big pitfalls here are:

  1. If you’re sampling your data with App Insights 2.0 sdk, then you should obviously reflect that when counting.
  2. If you’ve got a bunch of tests set up, then you probably don’t want to count those as page views.

For #1, you need to make sure you are always summing items – do sum(itemCount) instead of a simple count().

For #2, remember to add a where clause on the synthetic source field.

Here’s an example:

requests
| where timestamp > ago(1d)
| where operation_SyntheticSource == ""
| summarize sum(itemCount) by performanceBucket

Cool AppInsights Analytics: Charting request failure rate

Here is a really cool App Analytics query over App Insights that shows the request failure ratio of your app over the last week.

I use “extend” with the “iff” features to create a a successes field I can count, and then use “extend” again to create a failure ratio.

requests
| where timestamp > ago(7d)
| extend isSuccesss=iff(success=="True" ,1, 0)
| summarize failures=sum(1-isSuccesss) , successes=sum(isSuccesss)
by timestamp bin=20m
| extend ratio=todouble(failures) / todouble(failures+successes)
| project timestamp, failure_Percent=ratio*100
| render timechart

FailPercent