CIONET Trailblazer: The Role of AI in Resolving IT Disruptions

The Role of AI in Resolving IT Disruptions

As digital environments become increasingly interconnected, the nature of 'IT disruption' is fundamentally changing. We are witnessing a shift in how organisations maintain stability, moving away from traditional manual oversight towards more automated, data-led systems. Today, we are joined by Farlei Kothe, Stefanini North America, Europe and Asia Pacific to explore the complexities of this transition. Our goal is to examine the current pressures facing enterprise IT, the inherent challenges of integrating AI into operational workflows, and what a realistic pathway forward looks like for a resilient organisation in 2026.

In this month’s Trailblazer article, we speak with Farlei Kothe to explore the critical questions facing enterprise leaders as they navigate the growing complexity of AI-driven IT operations.

Farlei, As enterprise stacks become more 'mesh-like', at what point does it become physically impossible for a human team to monitor and intervene fast enough to prevent a business impact? Have we reached a 'complexity ceiling'?

In many ways, we already have.

Modern enterprise environments are no longer linear systems that can be monitored component by component. They are ecosystems, spanning on-premise infrastructure, multiple clouds, edge environments, APIs, legacy systems and SaaS platforms. The number of interactions between these components grows exponentially.

A human team simply cannot correlate thousands of signals per second across that landscape and detect patterns quickly enough to prevent impact.
This is what I would call the complexity ceiling. Not because humans are incapable, but because the speed and scale of modern digital systems exceed human reaction time.
The only viable response is to augment operations with systems that can observe the environment continuously, identify anomalies in real time, and trigger preventive actions before users feel the disruption.

That is why the future of infrastructure operations is increasingly AI-assisted and predictive, rather than purely reactive.

How has the definition of a 'critical incident' changed recently? Is it still primarily about systems being 'up or down', or are we now facing more subtle, 'grey' forms of environmental instability?

The definition has definitely evolved.

Ten years ago, a critical incident typically meant something very visible: a server down, an application unavailable, a network outage.

Today, the most dangerous incidents are often not binary failures. They are degradations.

Examples include:

Performance is slowly deteriorating across a microservices chain
Latency increases between cloud regions
Intermittent authentication failures
Memory leaks that only surface under peak loads

Individually, these signals might seem minor. But together they create systemic instability that eventually impacts the business.

So the modern challenge is not simply detecting outages. It is identifying patterns of instability early enough to prevent them from escalating into full incidents.

This shift is exactly why observability and AIOps have become so critical.

Beyond the obvious financial cost of downtime, what are the secondary impacts of constant IT disruptions on a company’s ability to innovate or retain its best technical talent?

The financial cost of outages is easy to quantify. The hidden cost is cultural.
When teams spend most of their time firefighting incidents, two things happen.
First, innovation slows down. Engineers who should be building new capabilities are instead focused on maintaining stability.

Second, talent frustration increases. Skilled professionals want to solve meaningful problems. If their daily work consists of reacting to alerts and repetitive operational tasks, they eventually disengage or move elsewhere.

Over time, this creates a vicious cycle: operational instability reduces innovation capacity, which in turn makes systems harder to modernise.
Breaking that cycle requires shifting operations from reactive maintenance to proactive resilience.
Automation and AI are essential in making that transition possible.

There is significant discussion regarding the move from 'Monitoring' to 'Observability'. How would you define the practical difference between these two philosophies in a complex, real-world IT environment?

Monitoring answers the question: Is something broken?

Observability answers the deeper question: Why is the system behaving the way it is?

Traditional monitoring relies on predefined metrics and thresholds. It works well for stable, predictable architectures.

But in modern distributed systems—where services are dynamic, ephemeral and interconnected—failures rarely follow predefined patterns.

Observability collects and correlates logs, metrics, traces and contextual signals across the entire environment. It allows teams and algorithms to explore unknown behaviours rather than only detect known failures.

In practice, observability enables two critical capabilities:

identifying hidden dependencies in complex environments
detecting anomalies before they escalate into incidents

It moves operations from reactive alert management to systemic understanding.

When we speak about AI 'resolving' disruptions, what does that look like in practice? To what degree is the ultimate goal total autonomy, versus a more 'augmented' human workforce?

In practice, AI resolution usually happens in layers.

The first layer is noise reduction. AI correlates thousands of alerts into a small number of meaningful incidents.

The second layer is recommendation. The system identifies probable root causes and suggests remediation actions.

The third layer is automation. For well-understood scenarios, such as restarting services, scaling resources or rerouting traffic, the system can trigger automated responses.
Full autonomy is not the primary goal today.

The real objective is augmented operations, where AI handles the high-speed data analysis and repetitive actions, while humans focus on strategy, architecture and complex decision-making.

Think of it less as replacing operators and more as expanding their operational reach.

It is often said that AI is only as effective as its data. What are the common characteristics of an IT environment that is actually 'AI-ready', and what are the warning signs of one that is not?

AI-ready environments typically share three characteristics.

First, high-quality telemetry. Systems generate consistent logs, metrics and traces across the infrastructure.

Second, integration across tools and platforms. Data flows across environments instead of being trapped in isolated monitoring systems.

Third, operational discipline. Incidents, changes and resolutions are documented in ways that AI can learn from historical patterns.

The warning signs are the opposite.

Fragmented tooling, incomplete data, inconsistent incident records, and a lack of standardised processes.

In those environments, the first step is not deploying AI; it is building the operational foundation that AI can learn from.

If an AI system identifies and fixes a disruption autonomously, how do we maintain meaningful oversight? How does an organisation balance the requirement for speed with the need for a clear audit trail?

Governance is essential.

Autonomous remediation must operate within clearly defined guardrails. Every automated action should be traceable, logged, and auditable.

In practice, this means:

predefined automation policies
approval thresholds for high-risk actions
full observability of AI decisions and actions

Modern AIOps platforms already support this approach by maintaining detailed records of every automated response.

Speed and accountability are not mutually exclusive. With the right architecture, automation can actually increase transparency rather than reduce it.

How must the role of the IT professional evolve as these systems take over more of the 'detect and repair' cycle? What happens to the traditional career path within IT operations?

The role of IT professionals is evolving from system operators to system orchestrators.

Instead of manually managing infrastructure components, engineers increasingly focus on:

Defining automation frameworks
Designing resilient architectures
Analysing system behaviour at a strategic level

This evolution actually expands the profession.

Operational knowledge remains valuable, but it is complemented by skills in data analysis, automation design and platform engineering.

The career path shifts from running systems to engineering how systems run themselves.

What are the risks of 'over-automation'? Are there specific types of disruptions where human intuition and 'tribal knowledge' remain fundamentally superior to any algorithmic response?

Automation works best in repeatable scenarios with clear patterns.

But there are situations where human experience remains critical, particularly when disruptions involve:

complex business logic
multi-system cascading failures
unusual edge cases that historical data has never captured

Human intuition is especially valuable in understanding context, something algorithms still struggle with.

The goal should never be blind automation. It should be intelligent automation with human oversight.

Looking towards 2028, what do you anticipate will be the most significant shift in how a global enterprise structures its IT support and resilience teams?

The most significant shift will be from operations teams to resilience engineering teams.

Instead of large groups focused on reactive support, organisations will operate smaller, highly skilled teams supported by automation and AI-driven platforms.

These teams will focus on:

system reliability engineering
predictive capacity planning
automated incident prevention

The role of infrastructure teams will move closer to the business, because reliability will become a strategic capability rather than a technical function.

Is the ultimate objective simply to fix disruptions faster, or is it to create an environment where 'disruption' becomes an obsolete concept? How close are we to that reality?

Fixing incidents faster is only the first step.

The real objective is preventing incidents from happening in the first place.
With predictive analytics, automated remediation, and self-healing architectures, we are moving toward environments where many disruptions are detected and resolved before users ever notice them.

We are not yet in a world without incidents, but we are clearly moving toward systems that are increasingly self-correcting.

That shift is already visible in highly mature digital platforms.

Finally, what will distinguish a 'resilient' company from a 'fragile' one by the end of this decade? Is technology the primary differentiator, or does it come down to organisational culture?

Technology will be an important factor, but culture will be the real differentiator.

Resilient companies adopt three cultural principles:

They treat infrastructure as a strategic asset, not just a cost centre
They invest in automation and continuous improvement
They encourage collaboration between operations, development and business teams

Fragile organisations, by contrast, treat IT as something that simply needs to “keep the lights on.”

By the end of the decade, resilience will be defined not just by technology choices but by how organisations think about reliability, innovation, and change.

CIONET

CIONET Trailblazer: The Role of AI in Resolving IT Disruptions

You May Also Like

CIONET Trailblazer: Managed Secured Networks in Critical Infrastructure

An Inconvenient Truth

CIONET Awards 2026 - The Gala

Subscribe by Email