Mean time between failures

Normal people should not know names like “US-East-1”, someone said this week on Mastodon, or more or less. “US-East-1” is the section of Amazon Web Services that went out last month to widely disruptive effect. What this social media poster was getting at, while contemplating this week’s Cloudflare outage, is the fact that the series of recent Internet-related outages has made network nodes that were previously only known to technical specialists into household names.

For the history-minded, there was a moment like this in 1988, when a badly-written worm put the Internet on newspapers’ front pages for the first time. The Internet was then so little known that every story had to explain what it was – primarily, then, a network connecting government, academic, and corporate scientific research institutions. Now, stories are explaining network architecture. I guess that’s progress?

Much less detailed knowledge was needed to understand what happened on Tuesday, when Cloudflare went down, taking with it access to Spotify, Uber, Grindr, Ikea, Microsoft CoPilot, Politico, and even, in London, its VPN service (says Wikipedia). Cloudflare offers content delivery and protection against distributed denial of service attacks, and as such it interpolates itself into all sorts of Internet interactions. I often see it demanding action to prove I’m not a robot; in that mode it’s hard to miss. That said, many sites really do need the protection it offers against large-scale attacks. Attacks at scale require defense at scale.

Ironically, one of the sites lost in the Cloudflare outage was DownDetector, a site that helps you know if the site you can’t reach is down for everyone or just you, one of several such debugging tools for figuring out who needs to fix what.

So, Cloudflare was Tuesday. Amazon’s outage was just about a month ago, on October 20. Microsoft Azure, another DNS error, was just a week later. All three of these had effects across large parts of the network.

Is this a trend or just a random coincidental cluster in a sea of possibilities?

One thing that’s dispiriting about these outages is that so often the causes are traceable to issues that have been well-understood for years. With Amazon it was a DNS error. Microsoft also had a DNS issue “following an inadvertent configuration change”. Cloudflare’s issue may have been less predictable; The Verge reports its problem was a software crash caused by a “feature file” used by its bot management system abruptly doubling in size, taking it above the size the software was designed to handle.

Also at The Verge, Emma Roth thinks it’s enough of a trend that website owners need to start thinking about backup – that is, failover – plans. Correctly, she says the widespread impact of these outages shows how concentrated infrastructure service provision has become. She cites Signal CEO Meredith Whittaker: the encrypted messaging service can’t find an alternative to using one of the three or four major cloud providers.

At Krebs on Security, Brian Krebs warns sites that managed to pivot their domains away from Cloudflare to keep themselves available during the outage need to examine their logs for signs of the attacks Cloudflare normally protects them from and put effort into fixing the common vulnerabilities they find. And then also: consider spreading the load so there isn’t a single point of failure. As I understand it, Netflix did this after the 2017 AWS outage.

For any single one of these giant providers, significant outages are not common. This was, Jon Brodin says at Ars Technica, Cloudflare’s worst outage since 2019. That one was due to a badly written firewall rule. But increasing size also brings increasing complexity, and, as these outages have also shown, even the largest network can be disrupted at scale by a very small mistake.

Elsewhere, a software engineer friend and I have been talking about “mean time between failures”, a measure normally applied to hard drives, servers, or other components. There, it’s much more easily measured – run a load of drives, time when they fail, take an average… With the Internet, so much depends on your individual setup. But beyond that: what counts as failure? My friend suggested setting thresholds based on impact: number of people, length of time, extent of cascading failures. Being able to quantify outages might help get a better sense of whether it’s a trend or a random cluster. The bottom line, though, is clear already: increasing centralization means that when outages occur they are further-reaching and disruptive in unpredictable ways. This trend can only continue, even if the outages themselves become rarer.

Most of us have no control over the infrastructure decisions sites and services make, or even any real way to know what they are. We can counter this to some extent by diversifying our own dependencies.

In the first decade or two of the Internet, we could always revert to older ways of doing things. Increasingly, this is impossible because either those older methods have been turned off or because technology has taken us places where the old ways didn’t go. We need to focus a lot more on making the new systems robust, or face a future as hostages.

Illustrations: Traffic jam in New York’s Herald Square, 1973 (via Wikimedia).

Also this week
– At the Plutopia podcast, we interview Jennifer Granick, the Surveillance and Cybersecurity counsel at the ACLU about the expansion of government and corporate surveillance and the increasing threat to civil liberties.
– At Skeptical Inquirer, I interview juggling mathematician Colin Wright about spreading enthusiasm for mathematics.

Wendy M. Grossman is an award-winning journalist. Her Web site has an extensive archive of her books, articles, and music, and an archive of earlier columns in this series. She is a contributing editor for the Plutopia News Network podcast. Follow on Mastodon or Bluesky.

It’s always DNS…

Years ago, someone in tech support at Telewest, then the cable supplier for southwest London, told me that if my broadband went out I should hope its television service went down too: the volume of complaints would get it fixed much faster. You could see this in action some years later, in 2017, when Amazon Web Services went down, taking with it Netflix. Until that moment few had realized that Netflix built its streaming service on Amazon’s cloud computing platform to take advantage of its flexibility in up- and down-sizing infrastructure. The source – an engineer’s typing error – was quickly traced and fixed, and later I was told the incident led Netflix to diversify its suppliers. You would think!

Even so, Netflix was one of the companies affected on Monday, when a DNS error took out a chunk of AWS, and people from gamers on Roblox to governments with mission-critical dependencies were affected. On the list of the affected are both the expected (Alexa and Ring) and the unexpected (Apple TV, Snapchat, Hulu, Google, Fortnite, Lyft, T-Mobile, Verizon, Venmo, Zoom, and the New York Times). To that add the UK government. At the Guardian, Simon Goodley says the UK government has awarded AWS £1.7 billion in contracts across 35 public sector authorities, despite warnings from the Treasury, the Financial Conduct Authority, and the Prudential Regulation Authority. Among the AWS-dependent: the Home Office, the Department of Work and Pensions, HM Revenue and Customs, and the Cabinet Office.

First, to explain the mistake – so common that experts said “It’s always DNS” and so old that early Internet pioneers said “We shouldn’t be having DNS errors any more”. The Domain Name System, conceived in 1983 by Paul Mockapetris, is a core piece of how the Internet routes traffic. When you type or click on a domain name such as “pelicancrossing.net”, behind the scenes a computer translates that name into a series of dotted numbers that identify the request’s destination. An error in those numbers, no matter how small, means the message – data, search request, email, whatever – can’t reach its destination, just as you can’t reach the recipient you want if you get a telephone number wrong. The upshot of all that is that DNS errors snarl traffic. In the AWS case, the error affected just one of its 30 regions, which is why Monday’s outages were patchy.

As Dan Milmo and Graham Wearden write at the Guardian, the outage has focused many minds on the need to diversify cloud computing. Taken together, Amazon (30%), Microsoft Azure (20%), and Google (13%) jointly control 63% of the market worldwide. There have been many such warnings.

At The Register, Carly Page reports on the individual level: smart homes turned dumb. Eightsleep beds stuck in an upright position and lost their temperature controls. App-controlled litter boxes stopped communicating. “Smart” light bulbs stayed dark. The Internet of Other People’s Things at its finest.

Also at The Register, Corey Quinn suggests the DNS error was ultimately attributable to an ongoing exodus of senior AWS engineers who took with them essential institutional memory. Once you’ve reached a certain level of scale, Quinn writes, every problem is complex and being able to remember that a similar issue on a previous occasion was traced improbably to a different system in a corner somewhere can be crucial. As departures continue, Quinn believes failures like these will become more common.

If that global picture is dispiriting, consider also the question of dependence within organizations; if your country depends on a single company’s infrastructure to power mission-critical systems, the diversity in the rest of the world won’t help you if that single company goes out. In the UK, Sam Trendall reports at Public Technology, the government activated incident-response mechanisms. Notable among the failures as prime minister Keir Starmer pushes for a mandatory digital ID: the government’s new One Login, as well as some UK banks. This outage provides evidence for the digital sovereignty many have been advocating.

I admit to mixed feelings. I agree with the many who believe the public sector should embrace digital sovereignty…but I also know that the UK government has a terrible record of failed IT projects, no matter who builds them. In 2010, fixing that was part of the motivation for setting up the Government Digital Service, as first GDS leader Mike Bracken writes at Public Digital. Yet the failures keep coming; see also the Post Office Horizon scandal. Bracken believes the solution is to invest in public sector capacity and digital expertise in order to end this litany of expensive failures.

At TechRadar, Benedict Collins rounds up further expert commentary, largely in agreement about the lessons we should learn. But will we? We should have learned in 2017.

Still, it would be a mistake to focus solely on Amazon. It is just one of many centralized points of failure. The is dangerously important as a unique resource for archived web pages. And the UK is not the only government flying at high-risk. Consider South Korea, where a few weeks ago a data center fire may have consumed 85TB of government data – with no backups. It seems we never really learn.

Illustrations: Traffic jam in New York’s Herald Square, 1973 (via Wikimedia).

Wendy M. Grossman is an award-winning journalist. Her Web site has an extensive archive of her books, articles, and music, and an archive of earlier columns in this series. She is a contributing editor for the Plutopia News Network podcast. Follow on Mastodon or Bluesky.

Crowdstricken

This time two weeks ago the media were filled with images from airports clogged with travelers unable to depart because of…a software failure. Not a cyberattack, and not, as in 2017, limited to a single airline’s IT systems failure.

The outage wasn’t just in airports: NHS hospitals couldn’t book appointments, the London Stock Exchange news service and UK TV channel Sky News stopped functioning, and much more. It was the biggest computer system outage not caused by an attack to date, a watershed moment like 1988’s Internet worm.

Experienced technology observers quickly predicted: “bungled software update”. There are prior examples aplenty. In February, an AT&T outage lasted more than 12 hours, spanned 50 US states, Puerto Rico, and the US Virgin Islands, and blocked an estimated 25,000 attempted calls to the 911 emergency service. Last week, the Federal Communications Commission attributed the cause to an employee’s addition of a “misconfigured network element” to expand capacity without following the established procedure of peer review. The resulting cascade of failures was an automated response designed to prevent a misconfigured device from propagating. AT&T has put new preventative controls in place, and FCC chair Jessica Rosenworcel said the agency is considering how to increase accountabiliy for failing to follow best practice.

Much of this history is recorded in Peter G. Neumann’s ongoing RISKS Forum mailing list. In 2014, an update Apple issued to fix a flaw in a health app blocked users of its then-new iPhone 6 from connecting. In 2004, a failed modem upgrade knocked Cox Communications subscribers offline. My first direct experience was in the 1990s, when for a day CompuServe UK subsccribers had to dial Germany to pick up our email.

In these previous cases, though, the individuals affected had a direct relationship with the screw-up company. What’s exceptional about Crowdstrike is that the directly affected “users” were its 29,000 huge customer businesses. It was those companies’ resulting failures that turned millions of us into hostages to technological misfortune.

What’s more, in those earlier outages only one company and their direct customers were involved, and understanding the problem was relatively simple. In the case of Crowdstrike, it was hard to pinpoint the source of the problem at first because the direct effects were scattered (only Windows PCs awake to receive Crowdstrike updates) and the indirect effects were widespread.

The technical explanation of what happened, simplified, goes like this: Crowdstrike issued an update to its Falcon security software to block malware it spotted exploiting a vulnerability in Windows. The updated Falcon software sparked system crashes as PCs reacted to protect themselves against potential low-level damage (like a circuit breaker in your house tripping to protect your wiring from overload). Crowdstrike realized the error and pushed out a corrected update 79 minutes later. That fixed machines that hadn’t yet installed the faulty update. The machines that had updated in those 79 minutes, however, were stuck in a doom loop, crashing every time they restarted. Hence the need for manual intervention to remove those files in order to reboot successfully.

Microsoft initially estimated that 8.5 million PCs were affected – but that’s probably a wild underestimate as the only machines it could count were those that had crash reporting turned on.

The root cause is still unclear. Crowdstrike has said it found a hole in its Content Validator Tool, which should have caught the flaw. Microsoft is complaining that a 2009 interoperability agreement forced on it by the EU required it to allow Crowdstrike’s software to operate at the very low level on Windows machines that pushed the systems to crash. It’s wrong, however, to blame companies for enabling automated updates; security protection has to respond to new threats in real time.

The first financial estimates are emerging. Delta Airlines estimates the outage, which borked its crew tracking system for a week, cost it $500 million. CEO Ed Bastian told CNN, “They haven’t offered us anything.” Delta has hired lawyer David Boies, whose high-profile history began with leading the successful 1990s US government prosecution of Microsoft, to file its lawsuit.

Delta will need to take a number. Massachusetts-based Plymouth County Retirement Association has already filed a class action suit on behalf of Crowdstrike shareholders in Texas federal court, where Crowdstrike is headquartered, for misrepresenting its software and its capabilities. Crowdstrike says the case lacks merit.

Lawsuits are likely the only way companies will get recompense unless they have insurance to cover supplier-caused system failures. Like all software manufacturers, Crowdstrike has disclaimed all liability in its terms of use.

In a social media post, Federal Trade Commission chair Lina Khan said that, “These incidents reveal how concentration can create fragile systems.”

Well, yes. Technology experts have long warned of the dangers of monocultures that make our world more brittle. The thing is, we’re stuck with them because of scale. There were good reasons why the dozens of early network and operating systems consolidated: it’s simpler and cheaper for hiring, maintenance, and even security. Making our world less brittle will require holding companies – especially those that become significant points of failure – to meet higher standards of professionalism, including product liability for software, and requiring their customers to boost their resilience.

As for Crowdstrike, it is doomed to become that worst of all things for a company: a case study at business schools everywhere.

Illustrations: XKCD’s Dependency comic, altered by Mary Branscombe to reflect Crowdstrike’s reality.

Wendy M. Grossman is the 2013 winner of the Enigma Award. Her Web site has an extensive archive of her books, articles, and music, and an archive of earlier columns in this series. She is a contributing editor for the Plutopia News Network podcast. Follow on Mastodon.