Saving money in the cloud?


One of the cloud’s big selling points is the promise of lower costs, but more often than not customers who move servers to the cloud end up paying more for the same workload.  Have we all been duped?  Is the promise a lie? Over the past several years the ACE team (the group of experts behind the AzureFieldNotes blog) has helped a number of customers on their Azure journey, many of whom were motivated by the economic benefits of moving to the cloud.  Few take the time to truly understand the business value as it applies to their unique technology estate and develop plans to achieve and measure the benefits.  Most simply assume that running workloads in the cloud will result in lower costs - the more they move, the more they will save.  As a result, management establishes a "Cloud First" initiative and IT scrambles to find workloads that are low risk, low complexity candidates.  Inevitably, these end up being existing virtual machines or physical servers which can be easily migrated to Azure.  And here is where the problems begin.

When customers view Azure as simply another datacenter (which just happens to be in the cloud) they apply their existing datacenter thinking to Azure workloads and they negate any cost benefit.  To realize the savings from cloud computing customers need to shift into consumption-based models and this goes far beyond simply migrating virtual machines to Azure.  When server instances are deployed just like those in the old datacenter and left running 24x7, the same workload will most likely end up costing more in Azure.  In addition, if instances aren't decommissioned when no longer needed it leads to sprawl, environment complexity, and costs that quickly get out of control.

Taking it a step further, customers must also consider which services should continue to be built and maintained in-house, and which should simply be consumed as a service.  These decisions will shape the technical cloud foundations for the enterprise.  Unfortunately, many of these decisions are made based on early applications deployed to Azure.  We call this the "first mover" issue.  Decisions made to support the first app in the cloud may not be the right decisions for subsequent apps or for the enterprise as a whole, leading to redundant and perhaps incompatible architecture, poor performance, higher complexity, and ultimately higher cost.  Take identity as an example:  existing identity solutions deployed in-house are often sacred cows because of the historical investment and specialized skills required to maintain the platform.  Previously, these investments were necessary because the only way to deliver this function was to build your own.  But (with limited exception) identity doesn't differentiate your core business and customers don't pay more or buy more product because of your beloved identity solution.  With the introduction of cloud-based identity, such as Azure Active Directory, companies can now choose to consume identity as a service, eliminate the complexity and specialized skills required to support in-house solutions, and focus talent and resources on higher value services which can truly differentiate the business.

Breaking it down, there are a handful of critical elements that must be addressed for any customer to realize value in the cloud:

  • Business Case:  understand what is valuable to your business, how you measure those things, and how you will achieve the value.  The answers to these questions will be different for every customer, but the need to answer them is universal.  Assuming the cloud will bring value - whether you view value as speed to market, cost reduction, evergreen, simplification, etc. - without understanding how you achieve and measure that goal is a recipe for failure.
  • Cloud Foundations:  infrastructure components that will be shared across all services need to be designed for the Enterprise, and not driven based on the first mover.  Its not unusual for Azure environments to quickly evolve from early Proof of Concept deployments to running production workloads, but the foundations (such as subscription model, network, storage, compute, backup, security, identity, etc.) were never designed for production - you need to spend the time early to get these right or your ability to realize results from Azure will be negatively impacted.
  • Ruthless automation:  standardization and automation underpin virtually every element of the cloud's value proposition and you must embrace them to realize maximum benefit from the cloud.  This goes beyond systems admins having scripts to automate build processes (although that is a start).  It means build and configuration become part of the software development practice, including version control, testing, and design patterns.  In other words, you write code to provision and manage cloud resources and the underlying infrastructure is treated just like software:  infrastructure as code.
  • Operating Model: workloads running in the cloud are different from those in your datacenter and supporting these instances will require changes to the traditional operating model.  As you move higher into the as-a-Service stack (IaaS -> PaaS -> SaaS -> BPaaS etc.) the management layer shifts more and more to the cloud provider.  Introduce DevOps in the equation and the impact to traditional operating models is even greater.  When there is an issue, how is the root cause determined when you don't have a single party responsible for the full stack?  Who is responsible for resolution of service and how will hand-offs work between the cloud provider and your in-house support teams?  What tools are involved, what skills are required, and how is information tracked and communicated?  In the end, much of the savings from cloud can come from transformation within the operating model.
  • Governance and Controls:  If you thought keeping a handle on systems running in your datacenter was a challenge, the cloud can make it exponentially worse.  Self-service and near instantaneous access to resources is the perfect storm for introducing server sprawl without proper governance and controls.  In addition, since cloud resources aren't sitting within the datacenter where IT has full control of the entire stack, how can you be sure data is secure, systems are protected, and the company is not exposed to regulatory or legal risk?

In future posts I'll cover each one of these in more detail to help frame how you can maximize the value of Azure (and how Azure Stack can play an important role) in your cloud journey.




Azure Stack TP1 POC Stable Install notes


I thought about writing yet another detailed step by step guide for installing TP1, but figured there are enough of those out there. If you need one you can google here. At Azure Field Notes, we're about sharing things we’ve learned thru our experiance in the field, so I decided to hit the high points based on the notes from one of our most stable POC installs to date. Now, this is a fully supported POC, meaning its running on the supported hardware and we’re not modifying any of the install scripts here. We will post other articles soon that cover some of those tricks and tweaks. With that, lets get going:



Dell PowerEdge R630 Dual Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz 12core (24 cores total) 384GB DDR4 Registered ECC RAM network card: Emulex OCm14104-U1-D 10Gb rNDC (supported for the POC install, not supported for storage spaces direct) Physical Disks: PERC H730 1GB Cache configured as follows 6 300GB SAS 2 disks in a raid 1 mirror for the OS, other 4 disks are pass-through

(Note this box is supported for the POC, but likely won’t match any of the production supported hardware, so don’t go buy a fleet of these expecting to run this stuff in prod). Production supported hardware will be only sold as a pre configured, pre integrated system per this blog post from Mike Neil.


Next the process:

  • Perform a complete firmware update of all components
  • configure the array with a single raid 1 mirror and 4 non raid disks set to pass-through mode
  • Set the Bios to boot from UEFI
  • Ensure the time zone and time are set correctly in the Bios
  • Load a servicing OS with boot from vhd support (2016 TP5 works well). Ensure you install to the mirror
  • Setup a static IP on the NIC
  • Copy the stack bits locally
  • Copy any drivers needed for the machine locally
  • Copy the TP4 VHD from the Stack bits to the root
  • Use BCDEdit to change the boot vhd to the TP4 stack vhd.
  • Boot the machine, copy any drivers from the local drive to win\inf and reboot
  • Enable RDP and set the NIC IP
  • Reboot and install chrome or some non edge browser and ensure it is the default (Edge doesn't like the built in admin account by default. You can obviously choose to change its settings, or use IE, but chrome comes in handy).
  • Install update KB3124262 if needed
  • Disable all except the primary NIC port and ensure you have no more then 4 raw disks in disk manager (should match the 4 SAS disks that are passed through)
  • Disable Windows update
  • Disable defender
  • Double check the system time zone and time and ensure it is correct. (Later if you get an AAD Auth Error talking about verifying the message then your time zone is off, check it again)
  • Install Stack. Use an AAD account, set the natVM Static IP to something on the same subnet as your host that isn't used, set the static gateway to your gateway and set the admin password etc.


[powershell] $secpasswd = ConvertTo-SecureString “urpassword” -AsPlainText -Force $adminpwd = ConvertTo-SecureString “uradminpassword” -AsPlainText -Force $mycreds = New-Object System.Management.Automation.PSCredential (“uraadaccount@yourtenant.com”, $secpasswd) .\DeployAzureStack.ps1 –Verbose -NATVMStaticIP –NATVMStaticGateway -adminpassword $adminpwd -AADCredential $mycreds [/powershell]

    • Wait a bit and check for errors. I had none after I did all of the above. If you do run into errors, I highly recommend blowing away the TP4 VHD and starting over. Re running the installer appears to work, but we’ve had instability later with random failures when it doesn't complete in one pass.
    • Login to the client VM and Install Chrome and set to default browser (see above)
    • Go through and disable windows update and defender on all stack VMs (Look for a post on how to do this in an unsupported way as part of the install, some boxes arent domain joined, so a simple GPO wont do it. Update, Matt's post is live HERE)
    • Turn off TIP tests once you think things are working properly (from the client VM) (If its around 12am the TIP tests will be running, wait until they are complete and have cleaned up everything]

[powershell] Disable-ScheduledTask -TaskName AzureStackSystemvalidationTask [/powershell]

Probably not required, but seems to speed up demos and such:

  • Shutdown the environment
  • Up the memory and CPU on all VMs to 16gig min 100%, update cores to 4, 8 or 12 depending on the original setting.
  • Start it back up, wait about 10 minutes and then run the validation script to make sure all the services are online properly.


Some additional notes:

  • The MUXVM and BGPBM seem to hang occasionally when connected via the hyper-v console. This appears to be Hyper-v on TP4 issues. During these hangs they also seem to stop responding to the network.
  • Sometimes rebooting a VM causes the host to bluescreen and reboot (also appears to be a TP4 issue)
  • Once your up and running, screens in stack that show a user picker seem to sit and spin for  wile, especially on group pickers. Other things are simple not done (new buttons on some of the resource RPs), or need some time to JIT the first time they’re accessed (Quota menus when creating a plan for example).
  • If doing scripted testing, it seems creating and tearing down an empty resource group through Powershell is pretty repeatable. I’d start with deploying a template containing nothing but a resource group if trying to test toolchains. Next least impactful seems to be storage accounts. Compute/Network seems to be the heaviest, and also the most inconsistent with if it will be successful or not on any given attempt.
  • You’ll notice we’re not installing any additional Resource Providers. We’ve had significant stability problems with the current builds and so are waiting until new builds are available before loading them in anything but our most bleeding edge environments.
  • Finally, Remember folks, its TP1, and according to Mike Neil’s Blog post, we’re a year away.