Redefine IT Observability to Gain Control in Fast-Moving IT Landscapes

The ever increasing complexity in the IT landscape is diminishing the company’s productivity. Are you still using only metrics, logs & traces to monitor your IT infrastructure?

Redefine IT Observability to Gain Control in Fast-Moving IT Landscapes
Redefine IT Observability to Gain Control in Fast-Moving IT Landscapes. Photo by Christina @ wocintechchat.com on Unsplash

If so, then you are missing 3 simple steps to improve observability. Read this article to learn 3 key steps to improve IT infrastructure enabling you to:

  • Prevent Outages: Avoid up to 70% of outages by identifying anomalies across the IT environment before they lead to service interruptions.
  • Reduce Noise: Eliminate 90%+ of alerts by correlating event streams, reducing signal noise, and eliminating non-actionable events.
  • Decrease MTTR: Reduce resolution times by 80% by pinpointing root cause instantly in Time-Traveling Topology.

Table of contents

Introduction
Observability
Still Too Siloed
Redefining Observability
Summary

Introduction

Market Challenges

For many organizations, the ever-increasing complexity of IT landscapes has diminished productivity. As companies move towards more dynamic architectures—like hybrid clouds, containers, and microservices—IT monitoring requirements change drastically. The onslaught of data these new infrastructures create, combined with a real shortage of skilled people and time, makes it increasingly difficult to maintain the pace. In response, many teams use ‘observability’ to gain some control. This article defines the conventional view of observability and explains why it’s essential, but not enough. It also outlines a simple 3-step approach to dramatically improve observability to prevent outages, crush mean-time-to-repair (MTTR), and maximize productivity.

Observability

Current Market Definition

Current market definitions differ to some extent, but most IT experts concur: observability is about bringing different types of data together on a single pane. But which types of data? Most experts state that observability should correlate three: logs, metrics, and traces.

  • (Event) logs: An event log is an immutable, time-stamped record of discrete events that happened over time. Event logs, in general, come in three forms but are fundamentally the same: a time-stamp and a payload in some context.
  • Metrics: Metrics are a numeric representation of data measured over time. Metrics can harness the power of mathematical modeling and prediction to derive knowledge of the system’s behavior over intervals in the present and future.
  • Traces: A trace is a representation of a series of causally related distributed events that encode the end-to-end request flow through a distributed system. Traces are a representation of logs and the data structure looks similar. A single trace can provide visibility into both the path and structure of a request.

Many observability solutions are available in today’s market monitor logs, metrics, and traces. They ingest these three types of data to provide an overall picture of an IT landscape’s health and performance. When an incident strikes, it automatically alerts a team to the problem by sending a notification. Here’s an example of the business value of this kind of observability in action:

Example Case: An IT service manager at a major financial institution experiences an outage
Example Case: An IT service manager at a major financial institution experiences an outage

This outage resulted in:

  • 2 hours of downtime
  • $85k sales loss
  • and -0.3% net promoter loss

Observability of logs, metrics, and traces was in place at the bank to assess the performance and health of the IT environment. However, since the cause and the impact needed to be correlated manually with different team members, precious time and resources were wasted.

Still Too Siloed

In the past, different data types were kept in separate siloes, which caused extremely long MTTRs when incidents occurred. Removing the silos and bringing metrics, logs, and traces together was the birth, and the very definition of, observability. Now, monitoring these three types of data allows you to reduce the damage of a significant outage. However, it’s not enough to cope with the increasingly faster and continuously changing IT landscapes of today.

Migrating workloads to the cloud and implementing continuous delivery to speed up software development is one solution, but it can strain IT teams. The fast, short release cycles and dynamic cloud and container environments make it difficult to keep track of the application and infrastructure landscape. A small change deep down in the infrastructure may seem innocuous but can have a significant impact over time. IT teams must understand how their infrastructure and applications are interrelated and how changes can affect the business.

That’s why observability is still too siloed. Traditional observability tools do not integrate existing metrics, logs, and traces stored in other tools. And the landscape is too complex to bring everything into one place, especially for medium and large enterprises.

Redefining Observability

Three essential ingredients need to be added to observability to understand how applications relate to infrastructure: real-time dependencies, changes, and artificial intelligence (AI).

Step 1: Adding Real-Time Topology

For the metrics, logs, and traces data to be more comprehensible and actionable, a context must be placed around all the data ingested—cloud, automation, service registries, CMDB, virtualization, networking, and deployment tooling. A topological overview that automatically merges the variety of existing data sources in real-time provides the perfect context and improves the quality and accuracy of the data collected.

Step 1: Adding Real-Time Topology
Step 1: Adding Real-Time Topology

Step 2: Tracking All Real-Time Changes

Placing the siloed data into context via real-time topology provides an understanding of real-time dependencies. Now it is essential to observe every change that occurs in the fast-moving IT landscape.

  • Health changes: monitored IT components that change to a new health state, e.g., from healthy to critical.
  • Version changes: deployed upgrades of service versions, e.g., upgrading the payment service from 3.0 to 4.0.
  • Topological changes: New components that appear and disappear in the IT landscape and affect dependencies between existing running components.
  • Component property changes: changing labels and tags of components, for example.
Step 2: Tracking All Real-Time Changes
Step 2: Tracking All Real-Time Changes

Step 3: Adding Autonomous AI

A natural evolution of IT operations analytics (ITOA) is the application of AI and machine learning (ML) techniques. Continuous machine learning allows you to detect anomalous behavior across your environments proactively. Data that is automatically put in context is the best fuel for AI and ML techniques, as the model does not need computation power to cut down the noise; it just knows. Early warning signals get your operations teams out in front of upcoming issues, enabling them to prevent the problems from impacting your business. At the very least, the time gained cuts down remediation time.

Step 3: Adding Autonomous AI
Step 3: Adding Autonomous AI

Let’s go back to the same example case as mentioned before of the IT service manager at a major financial institution experiencing an outage. But now the 3-steps of real-time topology, real-time changes, and autonomous AI are added to the IT environment:

Example Case: An IT Service Manager at a major financial institution experiencing an outage uses the 3-step approach
Example Case: An IT Service Manager at a major financial institution experiencing an outage uses the 3-step approach

Compared to the first example this outage resulted in:

  • 15 minutes of downtime
  • $10k sales loss
  • -0.08% net promoter loss

Instead of…

  • 2 hours of downtime
  • $85k sales loss
  • -0.3% net promoter loss

Because the cause and the impact of the outage don’t have to be correlated manually with different team members anymore, precious time and resources are saved.

Redefined Observability: Driving Actionable Insights

There is great value in knowing what, when, and why something happened, and its impact on other IT environments. The addition of topology and real-time changes to traditional observability provides instant visibility of the cause and effect of any change or failure across the silos. This knowledge allows you to accelerate your mean-time-to-repair (MTTR), quickly solve costly incidents, and avoid expensive meetings. The 3-step approach outlined in this article enables:

  • Real-Time Observability: Detect and alert incidents across your dynamic environment with automatic checks and self-driving anomaly detection.
  • Automatic Root Cause Analysis: Speed up troubleshooting with automated discovery and dependency mapping to pinpoint the root cause across multiple teams.
  • Business Impact Analysis: Instantly see the impact of new deployments on dynamic IT landscapes and business and automatically act.
  • Unified Insights: Utilize current monitoring and IT investments by streaming your data into one scalable open observability platform.

Summary

The adoption of fast-moving IT landscapes has created new challenges for the market, and observability is the current response. However, limited to logs, metrics, and traces, traditional observability falls short. For one, it’s too siloed. DevOps teams across the organization must understand how their infrastructure and applications interrelate and how changes affect the business.

“StackState makes enterprise-scale observability possible without relying only on a single tool.” Lodewijk Bogaards, CTO StackState

A 3-step approach to improve observability adds real-time topology, real-time tracking over time, and artificial intelligence:

  1. Real-time topology places your logs, metrics, and traces from different teams in context.
  2. Real-time tracking over time shows how deployments and other changes affect the IT environment’s structure and health.
  3. AI detects anomalous behavior and can leverage the contextualized metrics, logs, traces, and changes over time.

Source: StackState: Next-Gen Monitoring and AIOps

Thomas Apel Published by Thomas Apel

, a dynamic and self-motivated information technology architect, with a thorough knowledge of all facets pertaining to system and network infrastructure design, implementation and administration. I enjoy the technical writing process and answering readers' comments included.