Delta's outage: agile at the surface but sclerotic at the core

Many IT departments have addressed the urgent technology demands from the business by leaving the management of their legacy IT intact and by adding a second agile mode where experimentation and rapid iterations are possible. Web front ends and mobile apps often are developed through this agile mode of operations.

Nevertheless, leaving the old sclerotic IT core unchanged is similar to putting more lipstick on the pig. High costs, increasing operational risks and a continuous drag on fundamental innovation are often a consequence of this deceptively straightforward approach. Organisations need to put serious effort on renovating their IT core if they wish to be able to compete with companies that do not face such constraints. The Delta Airlines data center disruption this summer provides some vivid insights.

Bimodal is a concept pertaining to the way one manages IT that was introduced by Gartner in 2014. It was developed in organisations with legacy IT to meet urgent (technology) demands to increase enterprise agility. Gartner defines bimodal* as follows: it is the practice of managing two separate but coherent styles of work: one focused on situations of greater predictability, the other where exploration is required.

Over the last years many organisations operating extensive legacy IT systems have interpreted and implemented bimodal IT practices in a deceptively straightforward manner:

  • Leave legacy IT as little changed as possible (often with the notable exception of stripping costs from it as much as possible). This is mode 1.
  • Experiment with new developments (typically with the front end systems such as websites and mobile apps). This is mode 2.

Many of these organisations also believe they have achieved enterprise agility simply by adding a second -agile- mode. A risky belief that triggered Gartner recently to publish a paper in an attempt to bust the most widespread “myths” that have emerged and clouded the key messages. In “Busting Bimodal Myths” Gartner correctly says and emphasises “Mode 1 is neither “business as usual” nor an excuse to avoid tackling the troublesome issue of renovating the IT core.’’ and also “Both modes will play a crucial role in innovation and the digital transformation.”

* Other advisory firms discuss the same concept but name it differently; for example McKinsey calls it "two speed IT".

We experienced the negative effects of such a misguided approach to bimodal IT this past summer while travelling in the USA. On the 8th of August, Delta Airlines was forced to cancel all its flights worldwide due to a large scale data centre disruption. The disruption not only grounded thousands of flights on Monday and Tuesday but had a knock on effect on the airline's flights during the whole week.

As a consequence of the delays we couldn't make our connecting flight and were forced to change our schedule. Thousands of travellers faced similar issues leading to congested help desks and queues at the airports. For us, the original return flight had been booked via KLM with “economy plus“ seats. Delta, the US KLM partner, could not see in its systems that such seats had been booked by their partner KLM. We asked whether we could upgrade our seats on the spot and pay for the upgrade, but we were told that Delta’s IT systems did not allow such upgrades once tickets had been purchased. Also we were assured that KLM would later refund the difference. Three months later, this has not happened yet. Such refunds are dealt with by Air France, the third partner, and apparently, aligning administrative systems to reconcile records between these collaborating companies is everything but straightforward.

There are two interesting issues in this story:

  1. How could such a major disruption take place?

A few days after the Delta disruption, articles appeared in the press providing background information. It appears that Delta airlines had selected to run all its systems in one single data centre. It is beyond the scope of this post to provide details about the sequence of events (read here for an interesting reconstruction) but we wish to highlight one point from this article:

“Delta Airlines computer systems responsible for online check-in, kiosks, flight dispatching, crew scheduling, airport-departure information displays, ticket sales, frequent-flier programs and flight info displays are all located in a single datacenter located in Atlanta, Georgia. Most likely for cost reasons Delta Airlines decided not to operate a twin data center concept. Atlanta in the past has not been hit by any serious earthquakes nor floodings. In the past Atlanta area had hurricanes and tornados (like in 2008) but not at a scale which can damage a datacenter. So probably the financial responsible management of Delta believed a single datacenter was the best option.”

Using a single data centre must have saved Delta money but it was a risky bet, with such vital systems concentrated on one site (even if on site back up and recovery capabilities would have been in place). Delta now estimates that the costs of this outage amount to $150 mln.

     2. How come the IT systems  sabotage user experience?

For anyone flying regularly, it is obvious that airlines are very actively engaged in developing modern IT tools. Frequent flyer loyalty schemes, ticketless travel and extended partnerships with other airlines are all supported by a variety of websites and mobile apps. All these developments are being made in mode 2: continuous experimentation, quick iterations and visible added value to customers. Despite all these efforts, all key processes and business rules are still baked into the patchwork of old legacy applications that have not changed for decades thus effectively degrading the user experience at the moments that matter most. 

As USA Today points out: 

“Where once software was used primarily to book flights and issue tickets, today it's a matrix of overlapping, often disjointed systems that interact with mobile apps, track loyalty awards and help the airline industry bring in billions of dollars through the sale of perks like extra leg room. That growing complexity makes for hiccups, and they are difficult to avoid. Some of the systems, such as Delta's, are built on top of systems that are decades old.  For instance, Delta’s reservation and passenger service system is multilayered, built on a 52-year-old program called Deltamatic.”

In other words, under the Mode 2 veneer of the latest mobile app, grind the cogs of an old inflexible and brittle machine. This old machinery is ultimately responsible for high (and inelastic) operational costs, brittle operations, inflexible and broken processes and ultimately disappointed customers.

IT under Mode 2 helped companies like Delta Airlines to apply a fresh coat of paint over a sclerotic back-end IT infrastructure

IT under Mode 2 helped companies like Delta Airlines to apply a fresh coat of paint over a sclerotic back-end IT infrastructure

The advice to develop and operate in a bimodal (or two speed) manner got significant push back recently by Forrester and BCG (respective competitors to Gartner and McKinsey). They argue strongly that organisations go mode 2 only: bimodal is too complex to implement correctly and it is the wrong compromise in times when speed and agility are everything. Interestingly enough, when Forrester published its critique in April, it used Delta Airlines' investments in its back end systems to illustrate the point.

Should organisations run everything in mode 2, as Forrester and BCG suggest? The concept of bimodal itself is not the problem; the way it is being interpreted and implemented is. Even the most modern IT architecture based on loosely coupled components and microservices contains components where change needs to happen in a predictable and more conservative manner while other components can be best dealt with in the second mode allowing for exploration and experimentation.

Changing the way you have been operating for many years is difficult but necessary. It is enticingly easier to limit change and add a second mode and use it to develop new systems. Worth noting: some companies admired for agility, like Amazon and Netflix, have renovated their monolithic core in the past successfully.

We close this note by quoting Gartner:

“Mode 1 is neither “business as usual” nor an excuse to avoid tackling the troublesome issue of renovating the IT core. For the digital strategy to succeed, renovating the IT core and opening it up with a service-oriented architecture (SOA) and application programmable interface (API) strategy is a very real need, and it is a central part of the bimodal approach. If Mode 1 doesn’t renovate the IT core then any Mode 2 capability will prove to be a niche capability. Therefore, renovation activities conducted largely in a Mode 1 style are as essential to the digital transformation as those activities undertaken in a Mode 2 style.”