Windows as a Service in the Enterprise Overview Part 1

[Windows as a Service Table of Contents – this link contains a list of blogs covering the solution.] Coming Soon!

Windows as a Service in the Enterprise Overview Part 1

Windows 10 brings several new challenges to the Enterprise – one of the major challenges is deploying Windows 10 and then keeping up to date with the Feature Update releases (i.e. 1709, 1803, etc.). Although Microsoft has done a great job on making things easier (like a non-destructive in-place upgrade), there are still several technical hurdles that need to be overcome. Businesses just want things to work with the least amount of impact as possible and this is where things start to get tricky. There is a much larger payload that needs to be moved around the network, competing for the network traffic that the business uses (as opposed to systems management traffic). There is also the time it takes to perform the upgrade. There are several factors that will determine how long this takes, but the goal should be to not interrupt business operations.

If you are reading this and thinking that you are alone, you are not – we all (Enterprise IT) are facing the same challenges. At my place of business, we knew that we had to do something in order to be able to survive the cadence and volume of Windows 10 upgrades that we need to do. Luckily, myself and my colleagues have been doing OS deployments for a very long time. We put our heads together and came up with a process that we call Windows as a Service (WaaS) in the Enterprise and recently presented two sessions (part 1 and part 2) both repeated at MMSMOA. Our goal is to make this available to the community so that people can implement the parts of it in their environment or just spawn new ideas on how to make Windows 10 deployments easier. Our goal was simple – Minimize Risk, Maximize Velocity.

In Part 1 of the session we covered the ‘What’ and ‘Why’. We needed to have a reason ‘why’ we needed to do something and get into ‘what’ we are doing to solve the problem. A majority of this was a result of our first feature update, 1511 to 1607, on shortcomings in not only the tools, but the need to create processes for a relatively new concept – OS in-place upgrade. Our areas of improvement were the following:

  • Be proactive rather than reactive
  • Better leverage existing technology and tools
  • Give the user better control
  • Don’t just report it, remediate (fix) it
  • Positive hand off between the teams involved

The Windows 10 In-place Upgrades are handled using a Configuration Manager Task Sequence. We realized that once the Task Sequence begins, the user gives up control for the duration of the process. There are things that we do to make this duration as fast as possible, however, if there are any errors in the Task Sequence, it fails and we have to start all over again. This experience was extremely frustrating for end users. The lessoned learned was to do as much as possible before the end user even knows that anything is happening.

Applications continue to be the major hurdle for companies keeping up with Windows 10. Software vendors are not on board with the cadence of Windows 10 and an in-place upgrade of the OS is a foreign concept to most of them. They think that if their app installs fine net-new on a new release of Windows 10 then all is good. I doubt that many of them actually test their applications after the in-place upgrade of the OS. Needless to say, we have ran into a few that cease to work after the upgrade. Third party security and disk encryption products also are a major pain point. Not only does it take them months after a Windows 10 release to be ready, most of these deeply rooted products in the OS really mess with the in-place upgrade process. My recommendation is to drop them like a hot potato – it will save you time, money and frustration and you are likely to be more secure without them.

Disk space and cache management were a big problem. That great idea to buy systems with 120 GB SSDs isn’t turning out to be such a great idea today, especially since Windows 10 x64 wants 20 GB of free disk space for the upgrade. Cowboy management of disk space also made matters worse – randomly deleting the cache without regards to proper cache cleaning methods (like using the COM object) leaves things in a state of a mess.

We talked about levering existing technology and tools, Windows 10 setup has a special switch that allows you to perform a compatibility scan on a system before attempting the upgrade. The disadvantage of having to download the Windows 10 OS Upgrade package is actually an advantage – it allows us to pre-cache the content that we are going to need to do the upgrade ahead of time. And a double bonus is that it can be run completely silently without the user ever knowing. This will enable you to know with certainty that the upgrade is going to complete without any failures. If there are failures or blockers, you will find out about the before ever disrupting the end user.

End user experience – this is something that we wanted to do better. During the previous in-place upgrade, a deferral method was used. The problem with this approach is that it can be misleading based on the execution time. A deferral to one person might mean a 24-hour deferral, however, depending on the execution schedule, it could mean something very different. I personally do not like this approach and wanted to use something more native to Configuration Manager – give the users the ability to opt-in a week or two prior to their deadline.

Improve – don’t just report it, fix it! If something fails for a specific reason, then we want to fix it automatically and have the process continue once things are remediated. This also allows us to route specific issues to the responsible teams in a more automated fashion. We wanted to develop a clear process with defined entry and exit points, also one that was going to be repeatable since we would be doing this frequently.

Lastly, so of our other requirements involved gathering better metrics. We wanted to know what our first time success rates were going to be as we didn’t have any of this information before. Other things of interest were runtimes – how long is this taking? Are the end users opting into the process or waiting for the deadline? Are they on the corporate network or VPN? All this data we want to collect so that we can use it to help define success to our management, use it to possibly diagnose problems, and also use it to further improve our processes.

What we came up with was a gated process to maximize success and improve the end user experience. We only wanted to start the process (the part that the end user is aware of) if everything was ready to go, content cached and we were 99% sure the upgrade was going to be successful. We also wanted to minimize the in-place upgrade time for the end users that would be experiencing it on their own time. Systems need to be patched and ready to go after the upgrade, we didn’t want them to be upgraded and then have to sit through another 30-60 minute cumulative update.

Servicing vs Task Sequences. Task Sequences make this all possible in complex environments, and provide a depth in reporting not obtainable using servicing.
Newer functionality in recently CM releases which have made things easier: Run TS from TS (nested Task Sequences). Persist in Cache / Preserve in Cache Variables. Pre-Deploy Content / Download Package Content.
All of this and the WaaS in the Enterprise concept started to take place and look like this:

 

This turning into a multi-phase, gated approach:

Import Wizard:
The Import Phase is the part where systems enter the WaaS process. For most organizations, this might not need to be split up as all Windows 10 workstations are probably managed by the same team within the company. For our company, we have approximately four different ways we segment workstation clients, so we needed to account for this when the deployment teams are submitting systems into the process. Since they are already scoped to only see what they have rights to see, the Import Wizard needed to be able to have access to these collections under the technicians credentials. They could add systems directly to these Ready for Pre-assessment collections in the console, but the goal of the Import Wizard was to make it easier. Plus, with code optimization, we are able to add 5000 systems to a collection in about 60 seconds.

Ready for Pre-assessment:
The Ready for Pre-assessment collections are just place holder collections for systems that are entering the WaaS process and are scoped for the various workstation clients. A backend job processes systems in these collections and moves them into the Pre-assessment collection.

Pre-assessment:
The goal of the pre-assessment phase is to prevent systems that have known issues based on inventory data from proceeding in the process. These could be issues that would create a hard blocker for Windows 10 setup or applications that are known not to survive the in-place upgrade of the OS. This is split up into three categories:
General (Pass/Fail)

  • OS
  • OS Architecture
  • OS Build
  • Last HW Inv
  • Last MP Client Registration
  • Last Heartbeat
  • CCMCache size

Hardware (Pass/Fail)

  • Free disk space
  • Memory
  • Models (descoped for v1)

Software (Pass/Fail/NA)

  • CM Client
  • 3rd party disk encryption (earlier version is a known blocker)
  • 3rd party anti-virus (earlier version is a known blocker)
  • Earlier app versions that either do not work with 1709 or do not survive the in-place upgrade
  • 16 checks and growing
  • Will probably add more from Compat Scan data.

Currently this is done using SMA Automation and systems that fail any tests remain in the Pre-assessment collection. Systems that pass all checks are moved at night to the next phase – Pre-cache/Compat Scan.

Pre-cache/Compat Scan:
The goal of the Pre-cache/Compat Scan phase is to prevent systems that have known issues from proceeding in the process. This is done by using a Task Sequence and it accomplishes the following:

  • Pre-cache content ahead of the scheduled deployment
  • Run Windows 10 Setup with the Compatibility Scan Option
  • Collect the results for evaluation and metrics
  • Discover previously unknown blockers (and add them into the Pre-assessment checks)
  • Deployment is configured to ‘Download contents before starting’
  • Driver packages are dynamically downloaded during the Task Sequence
  • Metrics are written to the registry

Running the Compat Scan as part of the actual upgrade is possible, but by then end user has already been disrupted and will likely be frustrated if the Compat Scan fails and the upgrade does not happen. For this reason (along with pre-caching) we have split this out as a separate phase. Systems that fail remain in the Pre-cache/Compat Scan collection and the deployment reruns daily. Systems that pass are moved to the next phase – Ready for Scheduling. Another note is that the Upgrade Operating System Task Sequence step places the client in provisioning mode (see UserVoice item here). There are a few options on how to handle this to prevent clients from getting stuck in provisioning mode that will be covered in later blog posts.

This is the first phase where we start writing some key metrics to the registry on the target system. This will enable us to collect key metrics, such as how many times Compat Scan has run, how long it took to run, the return code and return status, along with a few other data points and an overall WaaS Stage progress. This information is used for troubleshooting and reporting metrics.

Ready for Scheduling:
The Ready for Pre-assessment collections are just place holder collections for systems that have completed the Pre-cache/Compat Scan phase and are now ready to be scheduled for the actual in-place upgrade. The deployment technicians are scopes to see these collections and they are used by the Scheduling Wizard.

Scheduling Wizard:
The Scheduling Wizard facilitates the scheduling of systems for the in-place upgrade. It is based on a monthly cycle. End users have the ability to opt-in and run the in-place upgrade before the scheduled date. The Task Sequence pop-up notification is enabled for certain systems (like laptop users). 31 day collections with corresponding deployments and maintenance windows are used for each day of the month. This makes it easier to track when systems are scheduled and makes daily reporting simpler.

Pre-flight:
The Pre-flight phase runs prior to in-place upgrade to double check the readiness rules that are defined in the Pre-assessment phase to make sure nothing has changed. In addition, it also checks a few execution rules as well. The execution rules that are currently defined are: AC/Battery check, MP connectivity, VPN status, Pending reboot, and most importantly – the Kill Switch. The Kill Switch is an extra safety measure (along with the maintenance windows) to prevent the in-place upgrade from running in the case of an emergency. These metrics are also written to the registry, things like the number of Pre-flight attempts, when it was last run, the return code and status, and the version of the Pre-flight script that was run.

The original plan was to run this as a Package/Program before running the Task Sequence. I don’t like stopping the Task Sequence once it starts, like for prompting the user to plug into power with a count down, as this messes with runtimes for maintenance windows and screws up Task Sequence runtime metrics. However, we discovered that the Package/Program does not adhere to maintenance windows. Systems that were off during the required time would power up the next morning. Instead of waiting for the next available maintenance window, the Pre-flight job would go ahead and execute (and possibly prompt the user to plug into power). Therefore, we decided to move this back into the Task Sequence at the very start.

In-place Upgrade:
This is where the main In-place Upgrade Task Sequence runs. Additional metrics are collected as part of this process as well. Things like – In-place Upgrade runtime, return code and return status, if a user is logged on and also if it was kicked off by the user (opt-in vs. running at the required time).

The initial design called for auto-rescheduling of certain events – failed Pre-flight, No Status, Accepted, Waiting. Running (i.e. hung) or Failed states would remain in the collection for 7 days to allow time for investigation. This will give the end user or technician the ability to re-run the Task Sequence in the case that the system did not experience a hard down failure. Lastly, systems that Succeeded would be pulled from the collection and removed from the process. However, we made a tweak to leave all systems in the collections for 7 days in order to simplify reporting and minimize collection churn. After 7 days, devices that have been successfully upgraded are removed from the WaaS process. Those systems in certain states go into a Needs Remediation collection so that a technician can have a closer look and see exactly why they are having issues. Systems that have not tried to run (that qualify) go into a Mandatory collection that does not have any restrictions. This means if a system was powered off each night during the scheduled deployment period, on the 8th day it would get a mandatory upgrade during the day.

Exclusions/Tech Led:
There are always those scenarios where a certain group of systems need to be excluded from the process. In order to handle this situation, exclusion collections were setup so that a system (or group of systems) could be completely excluded from the WaaS process. In addition, a separate, On-demand Task Sequence and deployment was created for the following reasons: manual tech led deployments (i.e. executives) and one off testing. This Task Sequence uses nested Task Sequences and has both the Pre-cache/Compat Scan and In-place Upgrade Task Sequences. The Deployment is configured as ‘Available’.

Hopefully this help defined the ‘why’ we needed to do something and ‘what’ we are doing to solve the problem. Not all organizations will need to get this detailed – adopting just a Pre-cache/Compat Scan strategy with some extra registry information might be good enough for some organizations. Also, the goal is to spawn other ideas as well in how to handle upgrades in your own environment. In Windows as a Service in the Enterprise Part 2, we will go into the ‘how’ and talk about the technical details behind the solution and how all of it fits together.

Originally posted on https://miketerrill.net/

4 thoughts on “Windows as a Service in the Enterprise Overview Part 1

  1. Great writeup. Is this at any moment going to lead to some sort of upgrade guide so others can implement part of this solution into their own workplace? I have a hard time translating the tips to actual upgrade task sequence steps.

  2. Great stuff, really incredible. I’m trying to implement all of this, between you and Gary’s site. I’m not good at WQL queries, and am stuck at trying to create a collection that will capture the value at HKLM\Software\WaaS\1709\CS_SetupEngineReturn (my hardware class is already extended), so I can target the successful ones for upgrade. Could I get you to share your collection query on that? Thank you.

    • Hi Matt, our internal WaaS process uses all direct membership rules in which we have WaaS automation jobs that move machines through the process based on success of the current phase. This way we can have complete control over a system and move it back in (or out of) the process without having to update a registry key on the machine, run hardware inventory and then wait for colleval to run. However, I think Gary did start doing something like that in his lab but he is OOO right now. Also, Chris Buck (@SCCMF12TWICE) has a blog on how to do IPU collections based on WQL queries that you can find here: https://sccmf12twice.com/2018/09/how-to-ipu-collections/. Hopefully our automation gurus that wrote the WaaS jobs from the design spec will start blogging about them as well. I hope this helps and if you still need some assistance, ping me on Twitter.
      Thanks,
      Mike

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.