Troubleshooting for IT Pros: Resolve Office 365 Service Issues Fast

Troubleshoot tough IT problems, understand native tool gaps, assess your current strategy, and identify a clear, actionable picture of the state of your cloud services.

While the promise of the cloud is only getting stronger, the challenge of Office 365 performance management continues to be a concern for IT teams. In every stage of cloud adoption — testing, migration, optimization, adoption, etc. — it is important to ensure that performance does not degrade. Whether cloud-only or hybrid cloud infrastructure is used, holistic performance visibility is required to ensure successful service delivery and achieve high efficiency.

Troubleshooting for IT Pros: Resolve Office 365 Service Issues Fast

In this article, find out how IT pros troubleshoot tough IT problems, understand native tool gaps, assess current strategies, and identify effective, actionable ways to monitor and manage mission-critical Microsoft collaboration systems.

Content Summary

Summary
Troubleshooting Tough IT Problems
Troubleshooting Faster
Common Office 365 Scenarios IT Faces
Scenario 1: Understanding the causes of Office 365 outages or degradations
Scenario 2: Preventing large gaps in delivering communication to users when service is degraded
Scenario 3: Quantifying the experience of end-users connecting from outside the network
Gaps in Native Tools
Service Health Dashboard
Azure AD Connect Health
Assessing Your Current Strategy
Conclusion

Summary

Microsoft 365 is one of the most popular cloud services to build, manage and deploy applications. The Microsoft cloud hosts all types of business-critical IaaS, PaaS, and SaaS workloads for many enterprises worldwide. A shift to Office 365 drives digital workplace maturity, but hidden availability, outages, and performance challenge impact service and end-user experience.

While the promise of the cloud is only getting stronger, the challenge of Office 365 performance management continues to be a concern for IT teams. In every stage of cloud adoption – testing, migration, optimization, adoption, etc. – it is important to ensure that performance does not degrade. Whether cloud-only or hybrid cloud infrastructure is used, holistic performance visibility is required to ensure successful service delivery and achieve high efficiency.

This article to help you troubleshoot tough IT problems, understand native tool gaps, assess your current strategy, and identify a clear, actionable picture of the state of your cloud services.

Troubleshooting Tough IT Problems

One of the most difficult challenges an IT infrastructure team faces is that of pinning down what exactly is broken when the technology a business depends on starts to go awry. A day that begins smoothly can descend quickly into chaos as reports from various departments and sundry locations tell a tale of an application that is not behaving well. Underperforming Applications are a classic complaint. Email is slow or not sending/ receiving. Teams’ call quality is poor, or even worse, calls drop and do not reconnect.

The help desk team escalates to the sysadmin team. The sysadmin team escalates to the application engineering team. The engineering team points fingers at each other. Management hauls everyone into a conference room to facilitate resolution. All the while, the badly behaving application continues its miscreant behavior, angering and alienating the customer base while the IT engineering team sits around the conference room table, poking furiously at laptops in search of a cause.

Troubleshooting Faster

Troubleshooting is a fact of life for IT teams. Even if you do everything in your power to monitor infrastructure and applications closely and address issues proactively, problems arise. With the cost and impact of outages increasing, decreasing mean time to resolution (MTTR) has never been more important. Unfortunately, native tools IT teams turn to when problems occur have not kept pace with the needs of modern applications and infrastructure management. When it comes to resolving issues immediately and quickly, you face three main challenges.

  • Increasing complexity: Your IT environment is more complex and more dynamic than ever. Your current operations may include hybrid cloud and multi-cloud environments, diverse infrastructure, and traditional and cloud-native applications. A single application can depend on dozens of underlying resources.
  • Inevitable finger-pointing: Given the level of complexity, serious issues can lead to finger-pointing. Even before hardware and software vendors are involved, it is common for application, network, storage, and virtualization teams to deflect responsibility. Finger-pointing is so common in troubleshooting situations.
  • Essential expertise: Troubleshooting a difficult problem can require significant expertise. Application, database, network, storage, and virtualization experts may all get pulled into the effort. If you are one of those experts, you must ask yourself whether troubleshooting is the most productive way to spend your day. If you are responsible for hiring people with expertise, you have an entirely different concern. It is impossible to be certain that people with the necessary expertise will be available when a problem occurs.

Why is it suddenly so difficult to monitor performance and identify the root cause of a problem? One reason is that when your application is distributed, you end up with a hodge-podge of performance data, such as SQL logs, error logs, and app crash analytics. With data piecemealed across different environments and tools, you cannot trace what happened at each step of the way. Managing the performance of the application ends up being a largely manual effort, with significant blind spots and gaps along the way.

Most IT teams use a diagnostic approach that is similar to an emergency room doctor evaluating a patient. The doctor relies on a variety of tests, progressing from simple to complex— temperature, blood pressure, pulse, blood tests, x-rays, MRIs, and so on—drawing on experience to synthesize the available information into a diagnosis.

IT teams do the same thing by relying on a separate server, storage, network, application, and other metrics and then attempting to synthesize all that information into an actionable “diagnosis.” This method does not pinpoint problems and uncovers the root cause. Without a singular view of all the elements that make up your IT environment, it becomes difficult to troubleshoot problems quickly and effectively without fingerpointing.

Control and Accountability The complexity lies with not having visibility into the supporting infrastructure, let alone an understanding of how components relate to one another. As such, even if you would have access to all Microsoft metrics of Office 365, the information would not be useful. Even if this information were to be available, should you care? Or is this solely Microsoft’s responsibility as part of the service agreement? The answer is simple: some metrics are just relevant to Microsoft alone. When purchasing a service, how Microsoft delivers that service, or how much load Microsoft’s systems are currently under does not really matter. That is, at least, if the delivered service and functionality continue to live up to the expectations and SLA’s!

The massive scale of services, like Office 365 with users distributed across several data centers and hundreds of thousands of servers, makes it nearly impossible to maintain the same monitoring paradigm as in years past. Because of this, monitoring a single service from a single location no longer represents how applications are used in the real world.

Within a sea of information, you cannot distinguish what information is relevant to you and what is not, in part because you do not have complete information about all the components and in part because of the scale and complexity of the environment. As users roam between various locations and connect from both within or outside the boundaries of the corporate network, an organization needs to understand if service issues are confined to a single location, or if an outage is affecting its operations at a larger scale – potentially even service-wide. This problem is worsened by the fact that the cloud service provider, such as Microsoft, has little information about your network.

Without proper monitoring, and by relying solely on built-in capabilities, administrators are often left to wait for feedback from their users to understand when something is wrong. This is far from ideal. Being able to pick up on early warnings of issues in any of the components within the infrastructure, including Office 365, is vital.

Common Office 365 Scenarios IT Faces

Scenario 1: Understanding the causes of Office 365 outages or degradations

Situation

  • We need to be able to identify the cause of Office 365 outages and know if the problem is on our side or with Microsoft.
  • The Microsoft Service Health Desk is a trailing indicator and does not provide good quantifiable information on outages as they happen.
  • Twitter is often a better source of information, as Microsoft is more aggressive in posting updates there.

Impact

  • When there is a problem, it is difficult to find the right information quickly.
  • This can lead to an increase in help desk calls, as frustrated users who cannot access their services try to find a solution.
  • To keep confidence in the administrative team, we need to know about a problem before it reaches the end-users.

Scenario Example: Microsoft Teams Outage

  • Teams went down for 2 hours for users in Europe, right as companies were ramping up for workers to connect remotely.
  • The Microsoft Service Health Desk did not provide updated details for most users until several hours later.
  • Twitter immediately came alive with concerned users asking for updates, and MSFT responded.
  • Help desks were flooded with calls from users who could not connect.

Scenario 2: Preventing large gaps in delivering communication to users when service is degraded

Situation

  • Microsoft native tools and Service Health Desk do not provide timely updates on the Office 365 issues.
  • In reality, a customer may not meet the threshold of impacted users to meet Microsoft’s SLA, meaning that Microsoft will not provide an update for that particular customer.
  • The Microsoft Service Health Desk is not tenant-specific, so there may be an update that does not apply to a specific tenant, confusing.

Impact

  • Users can be down for several hours without an updated response from Microsoft.
  • Not knowing where to start looking for the cause of an issue will result in millions of dollars in lost productivity.
  • Users and management can lose confidence in the collaboration team if there are several issues that they are not aware of.

Scenario Example: ADFS Authentication Error

  • Administrators need to understand quickly if the problem is on their side (me problem) vs. if it is on Microsoft. Calling Microsoft for me problem wastes time and resources.
  • If users are not able to authenticate through ADFS, that can look like a problem with Microsoft’s service but needs to be investigated internally.

Scenario 3: Quantifying the experience of end-users connecting from outside the network

Situation

  • The modern workplace is evolving. An increasing number of workers can perform their jobs remotely, and as such, the Digital Workplace Team needs to understand complex problems that can occur outside of the main network.
  • It is extremely rare to see a full Office 365 outage. Service degradations and issues that affect a subset of users are far more common.
  • It is difficult to triangulate the source of the problem, and even harder to understand the impact of users connecting from remote locations.

Impact

  • If the help desk receives a high volume of support requests from end-users, they do not always have the tools necessary to troubleshoot.
  • The Microsoft Service Health Desk is lacking and does not quantify the experience of the end-user community for your specific tenant.
  • There could be a significant impact on the remote workforce, or to a site where many users work, and we would not have any visibility which costs time and money.

Scenario Example: SharePoint failure for one site

  • Remote users rely on their ability to connect to their resources from wherever they are.
  • If there is a degradation of service in one location, the Microsoft Service Health Desk cannot give detailed information on the experience of those users.
  • In this scenario, the SharePoint test is only failing in one site, and following the virtual breadcrumb trail into this indicator shows that users in this site cannot access SharePoint resources, but because the other sites are showing green, there is no reason to look to troubleshoot this issue.

Gaps in Native Tools

Microsoft’s internal monitoring team performs numerous monitoring tests on the Office 365 service. However, not all the tools they use internally are made available to administrators. In this section, we will provide an overview of the two main tools available to administrators: the Service Health Dashboard and AAD Connect Health.

Service Health Dashboard

Today, Microsoft exposes health information about Office 365 services through the Service Health Dashboard (SHD). The information that is provided through the dashboard is only of limited use, as it focuses primarily on the overall service health instead of tenant-specific or user-specific problems. One of the limitations of the SHD is that it only gives you part of the end-to-end service view; it provides information on components that Microsoft is responsible for but fails to monitor and report on outages caused by other components, such as your local network, Internet connection or hybrid infrastructures such as Directory Synchronization, federation health, mail flow or AD FS.

Due to the massive scale of Office 365, the dashboard almost always reports some type of issue in one of its services as, logically, there is always a problem going on somewhere. A warning in the dashboard does not necessarily mean that your tenant is affected or that some of your users are experiencing problems. Often, service issues are accompanied by vague descriptions of who might be affected, leaving the customer wondering whether an issue impacts them or not. This creates a new challenge for the administrator, as they are left with the question of whether an issue is relevant and, if it is, to what extent.

The SHD also does not send automated alert notifications. One must purposefully log on to the SHD to view the latest health information. As a matter of fact, Microsoft uses the number of users reading an SHD alert to help determine the scope of impact for that issue. In the past, some issues in Office 365 were directly related to an outage in Azure Active Directory, which prevented access to the SHD. This also creates a Catch-22 situation for the Service Health Dashboard: the inability to authenticate to Azure AD prevents users from getting up-to-date Service Health Information. What good is a health dashboard if you risk being locked out of it?

Microsoft has become better and faster in terms of posting outages to the SHD, unlike during a 2015 outage, when it took them nearly eight hours to acknowledge the issue in the SHD. However, it remains extremely hard for Microsoft to close the gap with external monitoring solutions, as Microsoft cannot simply post messages to the SHD before assessing the issue and making sure that 1) customers are affected, and 2) the appropriate message is sent, as to not create unnecessary confusion. The time between the outage happening and Microsoft being able to confidently assess the issue so that Microsoft can craft an appropriate response to the SHD creates a void for many customers. During this time, customers are left to wonder whether there was an outage and, if so, whether it was affecting them. One of the most common complaints we see on Twitter is, “All my users are affected. Why isn’t this in the Service Health Dashboard?”

Azure AD Connect Health

Another useful tool Microsoft provides to help you monitor your hybrid Office 365 deployment is Azure AD Connect Health. This tool is available if you have an Azure AD Premium subscription. It is an agent-based monitoring solution that helps you gain visibility in both Azure AD Connect synchronization, AD FS and on-premises Active Directory.

It supports AD FS 2.0 on Windows Server 2008 R2, Windows Server 2012, Windows Server 2012 R2, and Windows Server 2016. It also supports monitoring the AD FS proxy or web application proxy servers.

The main benefits are the usage reports, and the that it will notify you if the directory sync engine stops working, or users are unable to authenticate to AD FS. The information is presented in the Azure AD Connect Health portal. The installation is simple and does not take much time.

Although Azure AD Health covers part of the components which are not included in the SHD, some limitations will leave gaps in your monitoring strategy. First, it is not integrated with the SHD, requiring an admin to have permission to access the Azure Portal to view the monitoring results. This also means you do not have a single location to look at all the areas that can affect your users. It also creates a dependency on Azure AD. Which in case of an outage, it can render access to the portal impossible. Secondly, while it does provide the ability to alert on events, it does not offer any capabilities to integrate into 3rd party monitoring systems, which is a typical requirement for enterprise companies.

Azure AD Connect Health only provides limited insights into the components. While it does perform synthetic transactions, it does not monitor the network, the Office 365 service, hybrid server health, connectivity, and functionality from remote locations. This makes it very difficult for an administrator to figure out what caused an outage and if it is something they can fix.

Imagine if you think the issue is on Microsoft’s side and you wait for something that will never be fixed…The lack of visibility hinders an organization’s ability to respond appropriately to reported issues. The ability to detect where an outage stems from is crucial, as it ultimately allows you to drive down the Mean-Time- To-Resolution (MTTR) of reported issues. For instance, if a network issue prevented a remote location from accessing the AD FS servers, you would likely hear from users complaining they could not access the Office 365 service, but what would you do next? The SHD would not show a service outage, and AAD Connect Health would not have visibility into the users at the remote location. The ultimate resolution would be to resolve the network issue, which would be your responsibility.

The way you handle an outage is obviously different when the issue occurs within Microsoft’s datacenters. In these cases, the challenge shifts to first quickly understanding conclusively, that the issue is on Microsoft’s side. The ability to do this as soon as possible is crucial because you do not want your management or users asking you if there is an issue. Finally, you can open a ticket with Microsoft and keep your users up to date as to when the issue will be resolved.

Assessing Your Current Strategy

It is important to plan and develop an effective cloud monitoring strategy. The strategy must be growth-oriented, defined minimally, then refined iteratively; and centered around the ability of the organization to proactively monitor complex distributed applications the business depends on. Without a solid strategy in place, IT teams and administrators noticeably struggle to manage, maintain, and deliver the expected business (and to the IT organization) outcomes for the critical services that IT is charged with delivering.

Monitoring is considered core to managing infrastructure and the business, with a focus on measuring the quality of service, workplace productivity, and customer experiences. The best strategy is one that monitors several systems and workloads and can go deep to support one of your largest investments ultimately delivering the value and ROI to your organization. Moving to Microsoft Office 365 is not a time to relax and requires even more diligence than standalone on-premises environments. It is important to understand that Microsoft is only responsible for their cloud service and has very little incentive to diligently monitor their systems let alone yours. This creates a hefty number of blind spots. Relying on native tools as your strategy indeed falls short of meeting critical needs.

The best monitoring solutions do not depend on a single source of data to make assessments. Logically, the more information that is available, the clearer the picture and the better a decision can be made as to what to do should an outage occur. In addition, monitoring solutions that extract data on an ongoing basis can report availability and outage trends on a historical basis on a monthly, quarterly, yearly, or long-term basis. Analysis of this data can identify any weaknesses that may exist in a configuration and allow administrators to rectify issues before they become a real problem.

Conclusion

Microsoft continues to invest huge amounts of money in growing its reach and expanding the size of Office 365. They are adding new services and capabilities, expanding into new regions, and ramping up their marketing efforts to capture more cloud market share. Along with all these activities, they are continuing to slowly enhance their included monitoring tools. However, the nature of cloud services means that customers will never get the complete visibility they need from Microsoft’s own tools, thus guaranteeing to troubleshoot IT problems continuously. In addition, customer-centered monitoring of service availability and quality is critical to detecting and resolving problems no matter their origin.

We hope you found the insights in this document helpful. With the right strategy and the right monitoring solutions, you can confidently and quickly identify a clear, actionable picture of the state of your cloud services at any given time.