PaaS

Azure Service Bus Monitoring and Alerting using Azure Function and Application Insights

word-image19.png

Being designing and architecting solutions for our clients on Azure Cloud for many years, we know that Service Bus plays an integral part in most of the application architectures when a messaging layer is involved. At the same time we also know that there is no straight answers when customer ask us about native monitoring and alerting capabilities of the service bus. For visual dashboards, you would need to drill down to the overview section of the queue blade.

For Diagnostic, there are only operational logs available natively.

Although there are few 3rd party products available in the market who have built a good story around monitoring and alerting on azure service bus but they come at an additional cost.

In quest of answering our customer question on how we can get monitoring and alerting capabilities of azure service bus, I figured out that answer lies within azure itself. This blog post illustrate a proof-of-concept solution which was done as part of one of our customer engagement. The PoC solution uses native azure services including:

  • Service Bus
  • Functions
  • Application Insights
  • Application Insight Analytics
  • Application Insight Alerts
  • Dashboard

The only service that would add cost to your monthly azure bill would be functions (assuming application insight is already part of your application architecture). You would need to analyze the cost of purchasing a 3rd part monitoring product vs. function cost.

Let’s deep dive in the actual solution quickly.

Step 1: Create an Azure Service Bus Queue

This is of course a perquisite since we will be monitoring and alerting around this queue. For PoC, I created a queue (by name queue2) under a service bus namespace with root managed key. Also I filled up the queue using one of my favorite tool “Service Bus Explorer”.

Step 2: Create an Azure Function

Next step is to create a function. This function logic is to:

  1. Query the service bus to fetch all the queues and topics available under it.
  2. Get the count of active and dead letter messages
  3. Create custom telemetry metric
  4. And finally log the metric to Application Insight

I choose to use the language “C#” but there are other language available. Also I configured the function to trigger every 5 seconds so it’s almost real time.

Step 3: Add Application Insight to Function

Application Insight will be use to log the telemetry of service bus by the function. Create or reuse an application insight instance and use the instrumentation key in the C# code. I have pasted the function code used in my PoC. The logging part of the code relies on custom metrics concept of application insights. For PoC, I created 2 custom metric – “Active Message Count” and “Dead Letter Count”.

Sample Function:

#r "Microsoft.ServiceBus"
using System;
using Microsoft.ServiceBus;
using Microsoft.ServiceBus.Messaging;
using System.Text.RegularExpressions;
using System.Net.Http;
using static System.Environment;
using Microsoft.ApplicationInsights;
using Microsoft.ApplicationInsights.DataContracts;

public static async Task Run(TimerInfo myTimer, TraceWriter log)
{
var namespaceManager = NamespaceManager.CreateFromConnectionString(
Env("ServiceBusConnectionString"));

foreach(var topic in await namespaceManager.GetTopicsAsync())
{
foreach(var subscription in await namespaceManager.GetSubscriptionsAsync(topic.Path))
{
await LogMessageCountsAsync(
$"{Escape(topic.Path)}.{Escape(subscription.Name)}",
subscription.MessageCountDetails, log);
}
}
foreach(var queue in await namespaceManager.GetQueuesAsync())
{
await LogMessageCountsAsync(Escape(queue.Path),
queue.MessageCountDetails, log);
}
}

private static async Task LogMessageCountsAsync(string entityName,
MessageCountDetails details, TraceWriter log)
{
var telemetryClient = new TelemetryClient();
telemetryClient.InstrumentationKey = "YOUR INSTRUMENTATION KEY";
var telemetry = new TraceTelemetry(entityName);
telemetry.Properties.Add("Active Message Count", details.ActiveMessageCount.ToString());
telemetry.Properties.Add("Dead Letter Count", details.DeadLetterMessageCount.ToString());
telemetryClient.TrackMetric(new MetricTelemetry("Active Message Count", details.ActiveMessageCount));
telemetryClient.TrackMetric(new MetricTelemetry("Dead Letter Count", details.DeadLetterMessageCount));
telemetryClient.TrackTrace(telemetry);
}
private static string Escape(string input) => Regex.Replace(input, @"[^A-Za-z0-9]+", "_");
private static string Env(string name) => GetEnvironmentVariable(name, EnvironmentVariableTarget.Process);

Step 4: Test your function

Next step is to test your function by running it. If everything is setup right, you should start seeing the telemetry in the application insight. When you select one the trace, you should be able to view the “Active Message Count” and “Dead Letter Count” under custom data. In the screenshot below, my queue2 has 17 active messages and 0 dead letter.

Step 5: Add an Application Insight Analytics Query

Next step is to use AI Analytics to render service bus chart for monitoring. From the AI blade, you need to click on the Analytics icon. AI Analytics is a separate portal with a query window. You would need to write a query which can render a time chart for a queue based on those custom metrics. You can use the below sample query as a start.

Sample Query:

traces
| where message has 'queue2'
| extend activemessagecount = todouble( customDimensions.["Active Message Count"])
| summarize avg(timestamp) by activemessagecount
| order by avg_timestamp asc
| render timechart

Step 5: Publish the Chart to Dashboard

The AI Analytics chart can be publish (via pin icon) to Azure Dashboard which will enable monitoring users to actively monitor the service bus metrics when they login to azure portal. This will remove the need to drill down to the service bus blade.

Refer this to know more about the creating and publishing charts to dashboards.

Step 6: Add Alerts on the custom counter

The Last step is to create application insight alerts. For PoC, I created 2 alerts on “Active Message Count” and “Dead Letter Message Count” with a threshold. These will alert monitoring users with an email, if the message count exceeds a threshold limit. You can also send these alert to external monitoring tools via web hook.

Attached is sample email from azure AI alert:

Hope these steps will at least gives you an idea that above custom solution with azure native services can serve basic monitoring and alerting capabilities for service bus and for that matter other azure services as well. The key is to define your custom metrics that you would like to monitor against and then setup the solution.

Claims-to-Windows Identity Translation Solutions and "Considerations" when using AD Application Proxy

Problem Statement:

At one of my consulting engagement this year my team were unable to communicate from a claims aware azure web application via client browser to an on-premise, windows authenticated SOAP endpoint.  To overcome the identity endpoint mismatch, Claims and Windows, we were using the Azure Application Proxy to perform the identity translation. Our problems primarily occur in communicating between the browser and the Azure Application Proxy.  We found multiple potential solutions to solve this problem, but each one has a fatal flaw.  We do not need to solve for every solution, rather we only need 1 to work.

AAD Proxy Problem statement

Solution 1:

Hide an iframe in the page that authenticates to the proxy by hitting a proxy endpoint and performing the redirect dance.  Because the user must first log-in to the application, the iframe can reuse these credentials.

AAD Solution 1

 

Process Flow Description:

  1. Iframe makes a request to the proxy endpoint (without authentication)
  2. Proxy returns a 302 redirect
  3. Iframe is redirected to AAD login page. Login cookies are submitted to AAD login because application requires authentication.
  4. Login successful returning token
  5. Sends token to proxy
  6. Proxy returns cookie that is valid for the proxy.
  7. Any future calls to the proxy can use the proxy cookie and make successful calls.

This solution works for the majority of cases except…

Fatal flaw: During step 3, if the user has multiple logins to Azure AD the user can not automatically be logged in because AAD returns an HTML to the hidden iframe asking which to use for login.

AAD Solution 1 Multiple login

Potential fixes:

  • Enable home realm discovery (Domain_Hint) for the Application Proxy
    • When enabling domain hints, step 2 will return an updated redirect URL to include an extra parameter, ‘&domain_hint=fmi.com’. With this extra information in Step 3, the AAD login page can automatically determine which user to login as.  Now the iframe can successfully login and the requests going forward will succeed.
    • Blocker: this feature is not available yet for App Proxy.
  • Use a Smart Link

Solution 2:

Use ADAL.js to retrieve a bearer token for authentication to the Application Proxy endpoint.

AAD Solution 2 Process Flow Description:

  1. ADAL.js calls AcquireToken to requesting a bearer token for the Application Proxy Endpoint.
  2. AAD returns an authentication token.
  3. We make JavaScript calls adding the header “authentication: bearer [token]” so we are properly authenticated to the endpoint.

This solution works for Internet Explorer but in any other browser it fails

Fatal Flaw: When making requests in step 3 with the authentication header, the browser sends a CORs preflight request.  The proxy is not handling the OPTIONS request properly and is returning a 302.

Potential Fixes:

  • Enable CORs on the Application Proxy so that Preflight requests are handled gracefully.
    • Blocker: this feature is not available yet for App Proxy.

 

Summary:

We communicated these shortcomings of AAD Application Proxy to Microsoft and hope they would prioritize this feature in next release.  Hope you would be able to customize your design keeping the above solutions and it's shortcomings in mind.

App Service Plan – Outbound Network Connection Limit

Few days back I ran into a problem where our production azure web apps were throwing below error:

[SocketException (0x271d): An attempt was made to access a socket in a way forbidden by its access permissions x.x.x.x:80] System.Net.Sockets.Socket.DoConnect(EndPoint endPointSnapshot, SocketAddress socketAddress) +208 System.Net.ServicePoint.ConnectSocketInternal(Boolean connectFailure, Socket s4, Socket s6, Socket& socket, IPAddress& address, ConnectSocketState state, IAsyncResult asyncResult, Exception& exception) +464

We opened a case with Microsoft and upon investigation they told us that your App Service Plan (running on Standard S1 2 Instances) are hitting the outbound connection limit. What? How the heck we know that? As of when i am writing, below were the connection limits given my MS.

App Service Plan Connection Limit
Free F1 250
Shared D1 250
Basic B1 1 Instance 1920
Basic B2 1 Instance 3968
Basic B3 1 Instance 8064
Standard S1 1 Instance 1920
Standard S1 2 Instances 1920 per instance
Standard S2 1 Instance 3968
Standard S3 1 Instance 8064
Premium P1 1 Instance (Preview)  1920

 

On further request, MS gave us a table of apps under the app service place and their open socket connection count. It clearly indicates that Web App 1 worker process is not reusing the connection pool and creating new connections hitting the overall limit of the app service plan.

WebApp Name Process Name Open Socket Count
App2 <WebJob>.WebJob.exe 4
App2 <WebJob>.WebJob.exe 4
App2 w3wp.exe 2
App1 <WebJob>.WebJob.exe 8
App1 <WebJob>.WebJob.exe 4
App1 <WebJob>.WebJob.exe 6
App1 w3wp.exe 2
App1 w3wp.exe 1870
App1 <WebJob>.WebJob.exe 6
App1 w3wp.exe 2
App1 <WebJob>.WebJob.exe 6
App3 w3wp.exe 4
App3 w3wp.exe 2
Total 1920

 

With the above data from MS at least you would be able to know where the problem lies and can review the app again.

For your web apps, you can at least review the code (ensuring it doesn't happen to your azure web apps) where you are handing the connection with external entities. Some of the common external dependencies in modern cloud world are:

  1. SQL - https://azure.microsoft.com/en-us/documentation/articles/sql-database-develop-dotnet-simple/
  2. Redis - https://azure.microsoft.com/en-us/documentation/articles/cache-dotnet-how-to-use-azure-redis-cache/
  3. Service Bus - https://azure.microsoft.com/en-us/documentation/articles/service-bus-performance-improvements/

Thanks to the blog http://www.freekpaans.nl/2015/08/starving-outgoing-connections-on-windows-azure-web-sites/ which explains about the same problem.

However the fact is with no monitoring tool available which monitors the open socket count, you will never be able to know the number of open socket connections for your app service plan unless requested from Microsoft.

Troubleshooting automatic restart of Azure Web Jobs

Lately I was working on production issue where the web jobs hosted on azure were automatically restarting by itself or moving in stopped state. There was no signs of user manually restarting it (you can watch those via Activity Logs on your website). So the question is why was it happening? The answer lies in the web job logs. In order to understand it better, I have laid out mostly all reasons of automatic restart with sample logs (you can get the log from either Kudu web job dashboard or directly from storage account).

Reason 1: Due to website shutdown/restart

[10/24/2016 06:53:36 > b3e7a2: SYS INFO] WebJob is stopping due to website shutting down [10/24/2016 06:53:36 > b3e7a2: SYS INFO] Status changed to Stopping [10/24/2016 06:53:39 > b3e7a2: INFO] Job host stopped [10/24/2016 06:53:41 > b3e7a2: ERR ] Thread was being aborted. [10/24/2016 06:53:42 > b3e7a2: SYS INFO] WebJob process was aborted [10/24/2016 06:53:42 > b3e7a2: SYS INFO] Status changed to Stopped [10/24/2016 06:59:17 > 521cd3: SYS INFO] Status changed to Starting [10/24/2016 06:59:20 > 521cd3: SYS INFO] Run script '<YourWebJobName>.WebJob.exe' with script host - 'WindowsScriptHost' [10/24/2016 06:59:20 > 521cd3: SYS INFO] Status changed to Running

Reason 2: Due to changes in the azure web job directory files or file content (D:\home\site\wwwroot\app_data\jobs\continuous\<WebJobName>)

[10/29/2016 00:00:43 > 521cd3: SYS INFO] Detected WebJob file/s were updated, refreshing WebJob [10/29/2016 00:00:43 > 521cd3: SYS INFO] Status changed to Stopping [10/29/2016 00:00:47 > 8d7eea: SYS INFO] Detected WebJob file/s were updated, refreshing WebJob [10/29/2016 00:00:47 > 8d7eea: SYS INFO] Status changed to Stopping [10/29/2016 00:00:48 > 521cd3: INFO] Job host stopped [10/29/2016 00:00:49 > 521cd3: ERR ] Thread was being aborted. [10/29/2016 00:00:51 > 521cd3: SYS INFO] Status changed to Stopped [10/29/2016 00:00:51 > 521cd3: SYS INFO] Status changed to Starting [10/29/2016 00:00:52 > 521cd3: SYS INFO] WebJob process was aborted [10/29/2016 00:00:52 > 521cd3: SYS INFO] Status changed to Stopped [10/29/2016 00:00:52 > 521cd3: SYS INFO] Job directory change detected: Job file 'ApplicationInsights.config' timestamp differs between source and working directories. [10/29/2016 00:00:51 > 8d7eea: INFO] Job host stopped [10/29/2016 00:00:52 > 8d7eea: ERR ] Thread was being aborted. [10/29/2016 00:00:52 > 8d7eea: SYS INFO] WebJob process was aborted [10/29/2016 00:00:52 > 8d7eea: SYS INFO] Status changed to Stopped [10/29/2016 00:00:53 > 8d7eea: SYS INFO] Status changed to Starting [10/29/2016 00:00:54 > 8d7eea: SYS INFO] Job directory change detected: Job file 'ApplicationInsights.config' timestamp differs between source and working directories.[10/29/2016 00:01:01 > 521cd3: SYS INFO] Run script '<YourWebJobName>.WebJob.exe' with script host - 'WindowsScriptHost' [10/29/2016 00:01:01 > 521cd3: SYS INFO] Status changed to Running

Reason 3: Due to a web app and/or web job deployment

Easy to understand since it will trigger Scenario 2

Reason 4: Due to an azure outage or maintenance

Assuming you can rule out reason 1, 3 and 4 (straightforward too) by using standard azure web app monitoring tools (like Failure History and others), 2 would still require further analysis.

Basically Reason 2 indicates that if there is a change in the web job directory i.e. a new file/folder is added or removed automatically or manually, web jobs will restart. Or if the content of the file within the directory is changes, that would also initiate trigger in web job restart.

In that case need to dig further what triggered the directory and/or file content changes within web job directory. You need to ask below question:

  1. Was there any changes to web app settings via azure portal?
  2. Was there any runtime SDK upgrades? Like upgrading App Insights or installing an extension
  3. Are you creating any temporary files at runtime with the web job directory?

You might want to review the application design if 3 is correct.

This helped me fixing my production issue. Hope same for you.