Breaking New Ground with Relationship-Based Observability

In recent years, the concept of Observability has arisen in an attempt to address the persistent risk to a company’s digital experiences and business applications as IT environments continue to become more complex and more dynamic. Relationship-Based Observability breaks new ground by adding 3 new capabilities to help companies detect, prevent, and rapidly resolve incidents.

Breaking New Ground with Relationship-Based Observability
Breaking New Ground with Relationship-Based Observability

Read this article to learn about what’s missing in your observability solutions and how you can close the gaps.

Table of contents

The Challenges Created by Digital Transformation
The Consequences of Poor Digital Experience
The Current State of Monitoring and Incident Management
Enter Relationship-Based Observability
The Value of Relationship-Based Observability

What is Relationship-based Observability, and how does it dramatically improve upon existing Observability solutions and existing incident management processes? These are the important questions addressed in this article.

The Challenges Created by Digital Transformation

Enterprises globally are implementing their core customer, prospect, partner, supplier, and employee-facing processes in software. This means that every enterprise is becoming a software company with the need to build software, test software, operate software in production, and most importantly continuously enhance this software to keep it competitive and to respond to requirements from the users of the software.

This demand on the part of companies worldwide for software engineering has well exceeded the supply of available software engineers. It has then resulted in a set of innovations designed to speed the development of software and the delivery of software into production:

  • Agile Development broke development into smaller chunks (sprints), allowing for continuous incremental progress in software development.
  • DevOps made supporting software in production more effective and efficient.
  • CI/CD automates much of the process of delivering software into production.
  • New languages like Python, PHP, Node-JS, and Go were invented to speed the development of the classes of applications for which these languages are appropriate.
  • New runtimes like Spring Boot, OpenShift and Pivotal took progressively more of the burden of building and running infrastructure software off of the shoulders of the developers.
  • Docker allowed applications to be isolated in containers along with their supporting libraries which smoothed the process of testing and integrating new releases into production.
  • Kubernetes automated the process of orchestrating groups of containers as needed to respond to changes in demand and other conditions.
  • Databases proliferated to meet the needs for various data models, scalability needs, performance needs, data types and redundancy requirements.

In summary, the response to Digital Transformation has created the following unprecedented dynamics:

  • A very high pace of innovation
  • A high rate of updates of application software in production
  • A very diverse set of languages
  • A very diverse set of runtimes, supporting software and databases
  • A dynamic execution environment in which services are rapidly scaled up and down

The Consequences of Poor Digital Experience

All of this software must be reliable and provide excellent end-user experiences or else the enterprise suffers the following adverse consequences:

  • Reductions in online revenue: 63% of consumers say they often abandon a brand when the online experience is poor
  • Impacts on online company reputation: 72% of consumers say they are loyal – until they have a bad experience
  • Impacts on customer experience: 54% of U.S. consumers say customer experience at most companies needs improvement

Furthermore, internally time is wasted addressing issues instead of enhancing online functionality and business performance. Each month, companies experience 5 or more outages, lasting an average of 200 minutes and involve 25 or more employees to resolve.

The Current State of Monitoring and Incident Management

To combat the negative impacts of poor digital experiences, modern enterprises have turned to monitoring tools to help identify and prevent incidents. In other words, if you want it to work, you have to monitor it. You have to monitor each layer or component in your stack to ensure the following:

  • That it is working (up and available to do work)
  • That it is performing the work required in the required interval of time (latency or response time)
  • That it is performing the amount of work required per interval of time (throughput)
  • That it is not experiencing errors while performing the required work
  • That it is not experiencing contention for resources (CPU, memory, network I/O and storage I/O) with other software running in the environment
  • That the number of resources available is sufficient to meet the needs of the component (capacity)

There are two basic approaches to monitoring business-critical applications and services in production. The first is to monitor the infrastructure for the applications. In the world before virtualization and the cloud, this means monitoring the physical servers, network devices, and storage devices that comprised the IT environment. In the modern virtualized and cloud world, infrastructure means all of the software running in the virtualization or cloud environment (the operating systems, application frameworks, databases, containers and orchestration systems) that supports the applications.

The Current State of Monitoring and Incident Management
The Current State of Monitoring and Incident Management

The second approach is to monitor the operation of the applications themselves. This typically involves injecting an agent into the application or the run time of the application and then measuring the performance, throughput and error rate of the actual transactions executed by the users of the application. APM (Application Performance Management) tools are widely used to monitor applications in production, as are a variety of open-source alternatives to commercial APM tools like OpenTelemetry and Prometheus.

Each of these tools feeds its alerts and incidents into some form of event management, incident management, or alert notification system.

This existing monitoring and incident management process lead to the following issues for enterprises:

  • If monitoring thresholds are set aggressively, the support teams drown in alerts, many of them false alarms or false positives.
  • If monitoring thresholds are set conservatively, many issues get missed or are only reported via an alarm after the severity of the incident has impacted online revenue, reputation and customer experience.
  • The number of incidents and the time it takes to solve them is generally unacceptable to the business.
  • Incidents do not contain the information needed to resolve the issues, requiring extensive drill down into various tools often occurring in physical or virtual war-room meetings.
  • The entire existing drill-down and war room process consumes valuable time on the part of support teams and engineering teams that could be better spent on improving the quality and performance of the online offerings.

Enter Relationship-Based Observability

In recent years, the concept of Observability has arisen in an attempt to address the persistent risk to a company’s digital experiences/business applications as IT environments continue to become more complex and more dynamic. The traditional focus of Observability is on metrics, logs, and traces. Countless vendors who support some combination of metrics, logs, and traces all now claim to be Observability vendors. However, most of these vendors supported metrics, logs, and traces before the term Observability became fashionable, so for these vendors, Observability is just a new way of talking about existing capabilities.

Relationship-Based Observability breaks new ground by adding the following unique capabilities to an Observability platform or solution:

  • Relationship change analysis between transactions, microservices, service meshes, applications, containers, Kubernetes Pods, Kubernetes Nodes, virtualization and cloud platforms, and all of the resources (compute, memory, networking, and storage) that support each application in real-time over time. This means knowing exactly what these relationships are and how they changed before, during and after each incident.
  • Changes in configuration state of the application and its entire supporting infrastructure in real-time over time. This means to know exactly how the configuration state changed before, during, and after each incident.
  • Root cause based upon AI AND deterministic relationships. Relationship-Based Observability means that the AI knows with certainty that a set of objects are related to each other and that they all support and affect the transaction of interest. This avoids false alarms (false positives) that result from the AI confusing correlation with causation.

The Value of Relationship-Based Observability

Relationship-Based Observability automatically adds the information needed to resolve issues into the existing incident management system. This dramatically improves the existing incident management process without disrupting it by replacing either the underlying monitoring tools or the incident management tools themselves. These improvements in the incident management process lead to:

  • 60% reduction in the time to address incidents
  • 20% fewer staff needed to fix incidents
  • 65% decrease in the number of incidents per month
  • And a 30% reduction in the cost per incident

All of which leads to dramatic improvements in online revenue, online reputation, and online customer experience.

The deterministic nature of the relationships and configuration changes over time are also the necessary pre-conditions to being able to automate problem resolution as it is essential to know what impacted what and what changed to take automated actions accurately.

Source: StackState