Azure Stack Danny McDermott Azure Stack Danny McDermott

Azure Stack User Defined Routing Problem

I'm writing this post to highlight a problem I've encountered with User Defined Routes/Route Tables, and their implementation on Azure Stack. I'll not go into detail on when to use UDR's; the official documentation does a great job of that: https://docs.microsoft.com/en-us/azure/virtual-network/virtual-networks-udr-overview#custom-routes

Typically you would use a UDR within an Azure Stack tenant's VNet when you want to direct internet bound traffic via a third party network virtualized appliance (e.g. Firewall).

AzS-UDR-Header.png

I'm writing this post to highlight a problem I've encountered with User Defined Routes/Route Tables, and their implementation on Azure Stack. I'll not go into detail on when to use UDR's; the official documentation does a great job of that: https://docs.microsoft.com/en-us/azure/virtual-network/virtual-networks-udr-overview#custom-routes

Typically you would use a UDR within an Azure Stack tenant's VNet when you want to direct internet bound traffic via a third party network virtualized appliance (e.g. Firewall).

I'll qualify what I mean by 'Internet'.  For Public Azure, Microsoft refer to any network within a virtual network as 'internet' that doesn't have a route entry as defined within the SDN (typically route entries will exist for the virtual network/subnets and BGP advertised routes such as peered VNets, ExpressRoute networks, etc). For unknown/external networks, it will forward the request on to the default gateway and via the Azure fabric to pop out onto the internet.   If an Azure Stack based VNet encounters an unknown/external network, it will forward the traffic via the SDN to the upstream border switch, which in turn is connected to the corporate network.  For disconnected deployments, there may not be a route out to the internet, but there will to the customer network.

Here's a couple of comments I have about this

  • Microsoft appear to use the same code base for Azure / Azure Stack for UDR's. There is no differentiation between the two platforms, which can cause a problem, as will be highlighted shortly.

  • For Azure Stack, the term 'Internet' is incorrect in my opinion. Think of 'internet' within Azure Stack as any network external to the appliance.

Now I've described that, here's a scenario and an issue I have encountered.

An Azure tenant deploys a virtual network and they want to control outbound access from the VM's via a firewall appliance.  They want to be able to perform remote admin for VM's via a jump server from a secure admin workstation (SAW).  They connect to the jump server via a Public IP associated with the jump server NIC. Nothing exotic is being suggested here and is a perfectly normal deployment scenario.

In theory, a UDR is associated to the subnet where the VM's are connected to (10.0.0.128/25) for address prefix 0.0.0.0/0 with next hop as the firewall IP address (10.0.0.4).  In order not to cause an issue with routing or spoofing protection within the firewall, another route is added, with the address prefix being 172.16.100.24/32 and the next hop being 'Internet' (remember, Internet should be thought of as external network to the Azure Stack appliance!).  Due to longest prefix matching rules being applied, the more specific /32 route will take preference over the /0 route, allowing for the firewall to be bypassed and allowing the SAW to connect via the Public IP.

That's the theory, but here's the problem:

The Corporate network is using an address space that's in the IANA Private use IP range ( https://www.iana.org/assignments/iana-ipv4-special-registry/iana-ipv4-special-registry.xhtml ) which is fairly common practice for the majority of enterprise networks.  The problem is that I am unable to assign a UDR using an address prefix in any of these private ranges with the next hop as 'internet', as the validation for the route is expecting it to be in Public Address Space.

Exhibit a:

udr1.png
udr2.png

I have tested for the other Private ranges too and get the same outcome.

Obviously for Public Azure, this is not an issue and the validation will do it's job, but that's the problem in using sharing the code with Azure Stack, you don't know what state the customer network is in and what address space they have in use.

I've raised a support case as it will clearly be a problem for some customers, but at this time, I don't have any workaround. This is the case for all  Azure Stack versions as of writing (1901).

Now, I'm guessing that by removing the validation step, it would fix the problem. It would be nice to rename the next hop type to 'External Network', or something more appropriate whilst they're at it :)

Read More
Azure Stack Danny McDermott Azure Stack Danny McDermott

Adding Public IP Pools to Azure Stack

Azure Stack offers the ability to add Public IP Pools should the one you provided when the installation took place not be sufficient for your needs going forward.  Typically this will be the case when an operator starts to receive alerts in the Admin portal like this:

OK, so this may be an intermittent warning, happening once every so often.  If so, I suggest there's no need to take any action.  However, if you get an alert warning of 90% utilization across all pools, it's time to take action, and that is to look into adding an extra pool.

OMS-Wire-Data.png

Azure Stack offers the ability to add Public IP Pools should the one you provided when the installation took place not be sufficient for your needs going forward.  Typically this will be the case when an operator starts to receive alerts in the Admin portal like this:

PublicIPWarning.png

OK, so this may be an intermittent warning, happening once every so often.  If so, I suggest there's no need to take any action.  However, if you get an alert warning of 90% utilization across all pools, it's time to take action, and that is to look into adding an extra pool.

Reading the remediation steps make it sound straightforward, and the parts it lists are, but in reality it takes a deal of planning and configuration to implement.

The instructions listed here allude to the fact that the Azure Stack OEM is required to carry out some configuration on the Top of Rack switches.  Why?

Well, as part of the initial installation of Azure Stack, all the configuration of the switches is automated and is then locked down so to prevent tampering and compromising the platform.  This is achieved at the switch level by applying ACL's, controlling what traffic is allowed to ingress/egress from specific address ranges.  The OEM has to add additional ACL's for the new Public IP range to ensure the veracity of the configuration and that your appliance acts as you would expect; e.g. external traffic trying to access services that have Public IP address in the new pool is allowed, not dropped at the switch.

Something else to be considered is whether your network service provider uses static routing, rather than BGP to advertise routing changes.  If they use static routing, then they must add in the specific routes to forward traffic to the Top of Rack Switch transit networks.  They will have had to do some similar configuration when Azure Stack was deployed, so they should already have the pertinent details.

Here are my more comprehensive steps that need to be carried out

1. Acquire another block of IP addresses from your network services provider.  They need to make sure that they will be routable and do not overlap with existing addresses within the WAN.

2. Contact the Azure Stack OEM and arrange with them to configure the Top Of Rack Switches to add the new Public IP address range(s).

3. (Optional) If your network service provider uses static routing, rather than BGP to advertise routing changes, they must add in the specific routes to forward traffic to the Top of Rack Switch transit networks.

4. An Azure Stack Operator should sign into the admin portal

PIP_AdminPortal.png

5. Open the Network Resource Provider blade and select Public IP pool usage

PIP_PIP_PoolUsage.png

6.Click Add IP Pool and add the new Public Address range in CIDR format

PIP_AddIPPool.png

7.Make sure the details look correct and click OK to apply *.

PIP_AddIPPoolconfig.png
  • A word of warning - make sure you enter the details correctly as adding a new address pool via the portal is not reversible! If you do make a mistake, a call to Microsoft Support would be needed.

In the future, this process might be automated, but my advice is that at the planning stage, you supply a /22 address range (1022 IP addresses) to save yourself (and your tenants) the hassle

Read More
AD FS, Azure Stack Danny McDermott AD FS, Azure Stack Danny McDermott

Azure Stack update 1811 - my favorite feature in this release

Microsoft have just released Azure Stack Update 1.1811.0.101, and for me, it is one I am looking forward to implementing now that I have read the release notes on the new capabilities.

azure-stack-operator1.jpg

Microsoft have just released Azure Stack Update 1.1811.0.101, and for me, it is one I am looking forward to implementing now that I have read the release notes on the new capabilities. For some, the headline feature is the introduction of Extension Host, which simplifies access to the portals and management endpoints over SSL (it acts as a reverse proxy).  This has been known about for some months, as Microsoft have been warning operators of additional certificate requirements and to be ready for it: https://docs.microsoft.com/en-us/azure/azure-stack/azure-stack-extension-host-prepare. This is good, as it means less firewall rules are required and I'm all for simplification, but not the most exciting introduction for me - that's the support for service principals using client secrets.

Why's that?

I've been working with Azure Stack with AD FS as the identity provider for many months and previously the only way to provision Service Principals (for use by automation or applications) was to use X509 certs for authentication.  Setting up the certs is pretty cumbersome , as they have to be generated, imported to systems that you want to run the automation on, grab the thumbprint, generate PEM files with the private key for use with Azure CLI.  For me, too many areas where stuff might not work (e.g., the certificate may not be present in the local computer store where the automation is running and throw an error.)

Using X509 certs to authenticate worked for a some scenarios, but not for others.  For instance, a number of third party solutions//tools (and first party!) couldn't be used, as they were written to be compatible with Azure AD Service Principals (which primarily uses secrets).  One example is the TerraForm provider; prior to this update, it could only be used for Azure AD implementations, but in theory it's now open to AD FS as well.  What this release also opens up is the possibility of deploying the Kubernetes ARM template that is currently in preview.  The template requires a Service Principal ClientID and Client Secret, so blocked deployment to disconnected systems previously.

I haven't had the chance to apply the update yet, but I will do it ASAP and look forward to testing whether client secrets for ADFS works as I expect.

 

 

Read More
Azure Stack Danny McDermott Azure Stack Danny McDermott

Adding Additional Nodes to Azure Stack

Last week, I had the opportunity to add some extra capacity to a four-node appliance that I look after. Luckily, I got to double the capacity, so making it an eight-node scale unit.

lego.jpg

Last week, I had the opportunity to add some extra capacity to a four-node appliance that I look after. Luckily, I got to double the capacity, so making it an eight-node scale unit. This post documents my experience and fills in the gaps that the official documentation doesn’t tell you 😊 From a high-level perspective, these are the activities that take place:

  • Additional nodes/servers are racked and stacked in the same rack as the existing scale unit.

  • The Node BMC interface is configured with an IP address and Gateway server within the management network, along with the correct username/password (the same as the existing nodes in the cluster).

  • The Azure Stack OEM configures the BMC and Top of Rack switch configurations to enable the additional switch ports for the additional nodes.

  • Azure Stack Operator adds each additional node to the Scale Unit (one at a time)

    • Compute resource becomes available first

    • S2D re-balances the cluster, once completed, the additional storage is made available

Easy huh?

Here’s a bit more detail into each of the steps:

Each of the additional nodes that are being added to the scale unit must be identical to the existing servers. This includes CPU, memory, storage capacity, hardware versions. The hardware must be installed as prescribed by the OEM and connected to the BMC and TOR switches to the correct ports. It is unclear whose responsibility this is, whether it is the OEM or the operator, so check beforehand when you purchase your additional nodes.

For Azure Stack to be able to add the additional nodes into the scale unit, the BMC interface must be configured so that the IP/subnet/Gateway are correctly configured, as well as the username and password that matches the existing nodes in the scale unit. This is critical as any misconfiguration will stop the node being added. As an example, assume that the management network is set to 10.0.0.0/27 and we have 4 existing nodes in our scale unit. 10.0.0.1 would be our Gateway address, 10.0.0.3 – 10.0.0.6 would be the IP address of nodes 1 – 4, so for our first additional node, we would use 10.0.0.7, incrementing from there up to a maximum of 16 nodes (10.0.0.18)

The network switches must be configured by the OEM. There is currently no provision for the additional configuration to be carried out via automation, and if the network switches were to be opened to an operator, this breaks the principle of Azure Stack being a blackbox appliance. The switches need reconfiguring to enable the additional ports on the BMC and two Top of Rack switches. Unused ports are purposely not enabled to keep the configuration as secure as possible.

Prior to attempting addition of the additional nodes, check in the Administrator Portal whether any existing FRU operations are taking place (e.g. rebuild of an existing node due to a hardware issue).

OK, so all the above has been carried out the Azure Stack operator can start to add in the additional nodes. From the Administrator portal:

Select Dashboard -> Region Management

Select Scale Units to open the blade:

We currently have a nice and healthy four-node cluster 😊. Select Add Node to open the configuration blade.

With the current release, as there is only one region and scale unit, there is only one option that we can select for the first two drop downs.

Enter the BMC IP address of the additional node and select OK

My first attempt didn't work as the BMC was incorrectly configured (The Gateway address for the BMC adapter was not set).

This is the error you will see if there is a problem:

Correcting the gateway address solved the problem. Here is what you’ll see when the scale unit is expanding:

After a few minutes, you will see the additional node being listed as a member of the s-cluster scale unit, albeit listed as Stopped.

Clicking on the new node will show the status as ‘Adding’, if you click on it from the blade.

If you prefer, you can check the status via PowerShell. From a system that has the Azure Stack PowerShell modules installed, connect to the Admin Endpoint environment and run:

 
#Retrieve Status for the Scale Unit
Get-AzsScaleUnit|select name,state

#Retrieve Status for each Scale Unit Node
Get-AzsScaleUnitNode |Select Name, ScaleUnitNodeStatus, PowerState

Whilst the expansion takes place, a critical alert fired. It's safe to ignore this.

Successfully completed node addition ill show the power status as running, plus the additional cores and memory available to the cluster :

It takes a little shy of 3 hours to complete the addition of a single cluster node:

Note: you can only add one extra node at a time, if you do, an error will be thrown as below:

I found that the first node added without a hitch, but subsequent nodes had some issues. I got error messages stating ‘Device not found’ on a couple of occasions. In hindsight, I guess that in the background, the cluster was performing some S2D operations and it caused some clashes for the newly added node. To fix this, I had to perform a ‘Repair’ on the new node. This invariably fixed the problem on the first attempt. If there was more information into what is actually happening under the hood, I could give a more qualified answer.

Eventually, All nodes were added 😊

Adding those additional nodes does not add additional storage to the Scale Unit until the S2D cluster has rebalanced. The only way you know that a rebalance is taking place is that the status of the scale unit shows as ‘expanding’, and will do for a long time after adding the additional node(s)!

Here’s how the Infrastructure File Shares blade looks like whist expansion is taking place:

Once expansion has completed, then the additional infrastructure file shares are created:

Unfortunately, there is no way to check the progress of the rebalance operation either in the portal or via PowerShell. The Privileged Endpoint does include the Get-StorageJob CMDlet, but this is useless unless the support session is unlocked. If it is unlocked, the following script could be used to check:


$ClusterName="s-cluster"

$jobs=(Get-StorageSubSystem -CimSession $ClusterName -FriendlyName Clus* | Get-StorageJob -CimSession $ClusterName)

if ($jobs){
	do{
		$jobs=(Get-StorageSubSystem -CimSession $ClusterName -FriendlyName Clus* | Get-StorageJob -CimSession $ClusterName)
		$count=($jobs | Measure-Object).count
		$BytesTotal=($jobs | Measure-Object BytesTotal -Sum).Sum
		$BytesProcessed=($jobs | Measure-Object BytesProcessed -Sum).Sum

		$percent=($jobs.PercentComplete)
		Write-output("$count Storage Job(s) Running. GBytes Processed: $($BytesProcessed/1GB) GBytes Total: $($BytesTotal/1GB) Percent: $($percent)% `r")

		Start-Sleep 10
	}until($jobs -eq $null)
}

Read More
AD FS, Azure Stack Danny McDermott AD FS, Azure Stack Danny McDermott

Azure Stack portal bug

As I’m mainly working with Azure Stack deployments that use AD FS as the identity provider, I’m coming across some differences and bugs compared to where Azure AD is used.

portalgraphic.png

As I’m mainly working with Azure Stack deployments that use AD FS as the identity provider, I’m coming across some differences and bugs compared to where Azure AD is used. One such bug is the following:

A user is a member of a global AD group that is assigned Contributor role to a Tenant Subscription. They aren’t added directly as a user to the subscription.

When that user connects to the portal, they will be presented with the following if they click on the subscription:

If they try and create a resource within the subscription, they get the following:

By connecting as this same user via PowerShell or Azure CLI, they can create a resource group and resources and do everything expected of a Contributor.

I logged a support case with Microsoft and they have confirmed this is a bug in the portal and that it will be fixed in an imminent release (potentially 1811).

In the meantime, the workaround is to assign users directly to the role rather than via a global group or to use the API / PowerShell / Az CLI to manage resources.

Read More