CIONET Trailblazer: Building for Zero Downtime: Why ‘Always On’ Starts with a Mindset Shift

How can CIOs ensure that their critical systems never go dark, even amid failures, updates, or entire cloud region outages? For Haytham Elkhoja, Principal Architect and Principal SRE at Kyndryl, the answer lies well beyond lift-and-shift strategies or generic cloud SLAs. In this CIONET Trailblazer interview, Haytham shares insights from his report “Cloud Adoption for Mission-Critical Workloads – Principles for Always On Applications,” offering a new blueprint for CIOs navigating the high-stakes transition of core systems to the cloud.

From multi-active architectures and chaos engineering to experience-level governance, Haytham outlines what it truly takes to architect Always On services in today’s disruption-prone digital economy. This interview is essential reading for IT leaders seeking to rethink reliability not as a technical afterthought but as a cross-enterprise priority.

What is the main business challenge your report seeks to address?

The report Cloud Adoption for Mission-Critical Workloads – Principles for Always On Applications tackles the challenges organisations face in reliably migrating and modernising mission-critical workloads to the cloud, where traditional approaches often fall short.

The core issue is that while 80% of enterprise workloads remain on-premise, organisations often struggle to migrate mission-critical applications to the cloud due to reliability concerns.

Despite the cloud's compelling value proposition, only 32% of enterprises entrust certain critical workloads to public clouds, according to the Uptime Institute.

The fundamental issue lies in reconciling the cloud's innovation potential with the uncompromising reliability requirements of mission-critical services that cannot afford downtime. Traditional lift-and-shift approaches fail because they assume that on-premise and cloud architectures are similar, leading to unsuccessful migrations. This reinforces scepticism about cloud reliability for mission-critical operations.

How do you define ‘Always On’ and why does it matter today?

In today’s hyper-connected, service-driven economy, expectations for uninterrupted digital access are non-negotiable; any failure can mean lost revenue, regulatory penalties, or lasting damage to brand trust.

‘Always On’ is the ability of digital services to remain continuously available, autonomously overcoming both planned and unplanned disruptions with zero perceived downtime for end users.

Unlike traditional resiliency approaches that focus on recovery time, Always On prioritises preventing service interruption altogether, embracing failures and proactively designing for them at every layer of the stack. This ensures the system can withstand all types of disruptions, catastrophic events, and application releases in a non-disruptive, transparent manner.

This is especially crucial today, as customer expectations have evolved beyond tolerance for either planned or unplanned outages. According to ISG, over 60% of outages now cost more than $100,000 (up from 39% in 2019), and 15% exceed $1 million. As a result, organisations are increasingly recognising that they cannot afford prolonged recovery time. The stakes are particularly high for industries like healthcare, finance, and transportation, where downtime can impact human safety and regulatory compliance.

What are the main reasons cloud adoption often fails for critical systems?

Cloud adoption for core workloads fails when organisations simply “lift and shift” legacy architectures, assuming cloud infrastructure alone ensures reliability. There is a common industry misconception that merely moving applications to a "resilient cloud" guarantees reliability.

Unfortunately, this overlooks the nuances of what customers get in terms of cloud platforms' SLAs.

Why are cloud provider SLAs insufficient for mission-critical workloads?

Cloud SLAs typically guarantee the uptime of infrastructure or managed services, but not the actual end-to-end business service as experienced by the customer.

Mission-critical applications require a holistic approach to reliability, addressing all layers of the stack and real user transactions, or what I call the “transactional critical path.” SLAs typically do not cover this, leaving organisations exposed.

The key distinction is that an Always On strategy focuses on end-to-end business service availability rather than underlying infrastructure metrics. For example, a major airline's booking system outage affects not only sales but also triggers negative publicity. Backend system failures can ground planes, require passenger compensation, and invite regulatory scrutiny – impacts far beyond what an infrastructure SLA will cover.

How feasible is it to achieve 99.999% availability for core business services?

Reliability demands implementing four key domains: Multi-Active Cell-based Architectures across geographically distributed autonomous location scopes, Application Reliability Patterns, Site Reliability Engineering practices, and Chaos Engineering.

The report explains each topic and their importance and the role each plays in achieving reliability and zero downtime architectures.

What is the most important architectural shift for supporting Always On systems?

The shift from recovery to reliability is what I focus on in this report.

Other important architectural shifts revolve around a “designing for failure” mindset that assumes every single component is inherently fragile. Architects and developers must understand how things break because, eventually, everything fails.

How should governance evolve to support mission-critical cloud adoption?

Governance must evolve toward a user-centric model, moving beyond infrastructure-focused approaches. This means establishing a robust Site Reliability Engineering (SRE) framework that brings together business stakeholders, architects, development, and operations teams around shared End-to-End Service Level Objectives (SLOs) or what some refer to as Experience Level Objectives (XLOs).

To make this shift effective, organisations must align funding and roles across the enterprise. Reliability becomes a shared responsibility: not just for IT, but for business units too, ensuring joint accountability, and a culture of continuous improvement and proactive risk management.

What role does chaos engineering play in ensuring system reliability?

Chaos Engineering institutionalises resilience by regularly testing systems with real-world fault injection, such as failures, latency spikes, or outages, to uncover hidden weaknesses before they cause incidents.

It provides the crucial “evidence, then confidence” that Always On services are in fact “always on” and reliable. This is done by proactively testing reliability through controlled failure injection, from node failures and network partitions to full cloud region outages.

Despite the name, Chaos Engineering is anything but chaotic. It follows a rigorous methodology: understanding end-to-end systems, formulating hypotheses, designing contained experiments, and measuring results against expectations to identify and remediate weaknesses before they impact production.

Think of it as the new continuous disaster recovery testing, not something done once every six months, but integrated into every application release.

What mindset should CIOs adopt to lead a successful cloud transformation for critical workloads?

CIOs must champion an Always On mindset: prioritising reliability as a shared, organisation-wide goal that shapes architectural, cultural, and investment decisions.

This requires fostering cross-functional collaboration, breaking down silos, demanding accountability for user experience, and embedding resilience into both technical design and business operations, rather than treating it as an afterthought.

As organisations push deeper into digital transformation, the difference between success and failure will increasingly hinge on the ability to deliver uninterrupted, seamless services, especially when it matters most. Haytham Elkhoja challenges CIOs to go beyond traditional resiliency and embrace an Always On mindset: one that designs for failure, aligns business and IT around shared reliability goals, and treats resilience as a product, not a project.

CIONET

CIONET Trailblazer: Building for Zero Downtime: Why ‘Always On’ Starts with a Mindset Shift

You May Also Like

CIONET Trailblazer: Designing for AI Control and Strategic Flexibility

Trust

CIONET Trailblazer: From Contract to Code: Architecting True Digital Sovereignty

Subscribe by Email