Crowdstricken

This time two weeks ago the media were filled with images from airports clogged with travelers unable to depart because of…a software failure. Not a cyberattack, and not, as in 2017, limited to a single airline’s IT systems failure.

The outage wasn’t just in airports: NHS hospitals couldn’t book appointments, the London Stock Exchange news service and UK TV channel Sky News stopped functioning, and much more. It was the biggest computer system outage not caused by an attack to date, a watershed moment like 1988’s Internet worm.

Experienced technology observers quickly predicted: “bungled software update”. There are prior examples aplenty. In February, an AT&T outage lasted more than 12 hours, spanned 50 US states, Puerto Rico, and the US Virgin Islands, and blocked an estimated 25,000 attempted calls to the 911 emergency service. Last week, the Federal Communications Commission attributed the cause to an employee’s addition of a “misconfigured network element” to expand capacity without following the established procedure of peer review. The resulting cascade of failures was an automated response designed to prevent a misconfigured device from propagating. AT&T has put new preventative controls in place, and FCC chair Jessica Rosenworcel said the agency is considering how to increase accountabiliy for failing to follow best practice.

Much of this history is recorded in Peter G. Neumann’s ongoing RISKS Forum mailing list. In 2014, an update Apple issued to fix a flaw in a health app blocked users of its then-new iPhone 6 from connecting. In 2004, a failed modem upgrade knocked Cox Communications subscribers offline. My first direct experience was in the 1990s, when for a day CompuServe UK subsccribers had to dial Germany to pick up our email.

In these previous cases, though, the individuals affected had a direct relationship with the screw-up company. What’s exceptional about Crowdstrike is that the directly affected “users” were its 29,000 huge customer businesses. It was those companies’ resulting failures that turned millions of us into hostages to technological misfortune.

What’s more, in those earlier outages only one company and their direct customers were involved, and understanding the problem was relatively simple. In the case of Crowdstrike, it was hard to pinpoint the source of the problem at first because the direct effects were scattered (only Windows PCs awake to receive Crowdstrike updates) and the indirect effects were widespread.

The technical explanation of what happened, simplified, goes like this: Crowdstrike issued an update to its Falcon security software to block malware it spotted exploiting a vulnerability in Windows. The updated Falcon software sparked system crashes as PCs reacted to protect themselves against potential low-level damage (like a circuit breaker in your house tripping to protect your wiring from overload). Crowdstrike realized the error and pushed out a corrected update 79 minutes later. That fixed machines that hadn’t yet installed the faulty update. The machines that had updated in those 79 minutes, however, were stuck in a doom loop, crashing every time they restarted. Hence the need for manual intervention to remove those files in order to reboot successfully.

Microsoft initially estimated that 8.5 million PCs were affected – but that’s probably a wild underestimate as the only machines it could count were those that had crash reporting turned on.

The root cause is still unclear. Crowdstrike has said it found a hole in its Content Validator Tool, which should have caught the flaw. Microsoft is complaining that a 2009 interoperability agreement forced on it by the EU required it to allow Crowdstrike’s software to operate at the very low level on Windows machines that pushed the systems to crash. It’s wrong, however, to blame companies for enabling automated updates; security protection has to respond to new threats in real time.

The first financial estimates are emerging. Delta Airlines estimates the outage, which borked its crew tracking system for a week, cost it $500 million. CEO Ed Bastian told CNN, “They haven’t offered us anything.” Delta has hired lawyer David Boies, whose high-profile history began with leading the successful 1990s US government prosecution of Microsoft, to file its lawsuit.

Delta will need to take a number. Massachusetts-based Plymouth County Retirement Association has already filed a class action suit on behalf of Crowdstrike shareholders in Texas federal court, where Crowdstrike is headquartered, for misrepresenting its software and its capabilities. Crowdstrike says the case lacks merit.

Lawsuits are likely the only way companies will get recompense unless they have insurance to cover supplier-caused system failures. Like all software manufacturers, Crowdstrike has disclaimed all liability in its terms of use.

In a social media post, Federal Trade Commission chair Lina Khan said that, “These incidents reveal how concentration can create fragile systems.”

Well, yes. Technology experts have long warned of the dangers of monocultures that make our world more brittle. The thing is, we’re stuck with them because of scale. There were good reasons why the dozens of early network and operating systems consolidated: it’s simpler and cheaper for hiring, maintenance, and even security. Making our world less brittle will require holding companies – especially those that become significant points of failure – to meet higher standards of professionalism, including product liability for software, and requiring their customers to boost their resilience.

As for Crowdstrike, it is doomed to become that worst of all things for a company: a case study at business schools everywhere.

Illustrations: XKCD’s Dependency comic, altered by Mary Branscombe to reflect Crowdstrike’s reality.

Wendy M. Grossman is the 2013 winner of the Enigma Award. Her Web site has an extensive archive of her books, articles, and music, and an archive of earlier columns in this series. She is a contributing editor for the Plutopia News Network podcast. Follow on Mastodon.

Trust busted

It’s hard not to be agog at the ongoing troubles of Boeing. The covid-19 pandemic was the first break in our faith in the ready availability of travel; this is the second. As one of only two global manufacturers of commercial airplanes, Boeing’s problems are the industry’s problems.

I’ve often heard cybersecurity researchers talk with envy of the aviation industry. Usually, it’s because access to data is a perennial problem; many companies don’t want to admit they’ve been hacked or talk about it when they do. By contrast, they’ve said, the aviation industry recognized early that convincing people flying was safe was crucial to everyone’s success and every crash hurt everyone’s prospects, not just those of the competitor whose plane went down. The result was industry-wide adoption of strategies designed to maximize collaboration across the industry to improve safety: data sharing, no-fault reporting, and so on. That hard-won public trust has, we now see, allowed modern industry players to coast on their past reputations. With this added irony: while airlines and governments everywhere have focused on deterring terrorists, the risks are coming from within the industry.

I’m the right age to have rarely worried about aviation safety – young enough to have missed the crashes of the early years, old enough that my first flights were taken in childhood. Isaac Asimov, born in 1920, who said he refused to fly because passengers didn’t have a “sporting” chance of survival in a crash, was actually wrong; the survival rate for airplane crashes in over 90%. Many people feel safer when they feel in control. Yet, as Bruce Schneier has frequently said, you’re at greater risk on the drive to the airport than you are on the plane.

In fact, it’s an extraordinary privilege that most of us worry more about delays, lost luggage, bad food, and cramped seating than whether our flight will land safely. The 2018 crash of a Boeing 737 MAX 8 did little to dislodge this general sense of safety, even though 189 people died, and the same was true following the 2019 crash of the same plane, which killed another 156 people. Boeing tried to sell the idea that it was inadequately trained pilots working for substandard (read: not American or European) airlines, but the reality quickly became plain: the company had skimped on testing and training and its famed safety-first engineering-led culture had disintegrated under pressure to reward shareholders and executives.

We were able to tell ourselves that it was one model plane, and that changes followed, as Bloomberg investigative reporter Peter Robison documents in Flying Blind: The 737 MAX Tragedy and the Fall of Boeing. In particular, the US Congress undid the 2020 legal change that had let Boeing self-certify and restored the Federal Aviation Administration’s obligation of direct oversight, some executives were replaced, and a test pilot went to jail. However, Robison wrote for publication in 2021, many inside the industry, not just at Boeing, thought the FAA’s 20-month grounding of the MAX was “an overreaction”. You might think – as I did – that the airlines themselves would be strongly motivated not to fly planes that could put their safety record at risk, but Robison’s reporting is not comforting about that: the MAX, he writes, is “a moneymaker” for the airlines in that it saves 15% on fuel costs per flight.

Still, the problem seemed to be confined to one model of plane. Until, on January 5, the door plug blew out of a 737 MAX 9. A day later, the FAA grounded all planes of that model for safety inspections.

On January 13, a crack was found in a cockpit window of a 737-800 in Japan. On January 19, a cargo 747-8 caught fire leaving Miami. On January 24, Alaska Airlines reported finding many loose bolts during its fleetwide inspection of 737 Max 9s. Then on January 24, the nose wheel fell off a 757 departing Atlanta. Near-simultaneously, the Seattle Times reported that Boeing itself installed the door plug that blew out, not its supplier, Spirit Aerosystems. The online booking agent and price comparison site Kayak announced that increasing use of its aircraft-specific filter had led it to add separate options to avoid 737 MAX 8s and 9s.

The consensus that formed about the source of the troubles that led to the 2018-2019 crashes is holding: blame focuses on the change in company culture brought by the 1997 merger with McDonnell Douglas, valuing profits and shareholder payouts over engineering. Boeing is in for a period of self-reinvention in which its output will be greatly slowed. As airlines’ current fleets age, this will have to mean reduced capacity; there are only two major aircraft manufacturers in the world, and the other one – Airbus – is fully booked.

As Cory Doctorow writes, that’s only one constraint going forward, at least in the US: there aren’t enough pilots, air traffic controllers, or engine manufacturers. Anti-monopolist Matt Stoller proposes to nationalize and then break up Boeing, arguing that its size and importance mean only the state can backstop its failures. Ten years ago, when the US’s four big legacy airlines consolidated to three, it was easy to think passengers would pay in fees and lost comfort; now we know safety was on the line, too.

Illustrations: The Wright Brothers’ first heavier-than-air flight, in 1903 (via Wikimedia).

Wendy M. Grossman is the 2013 winner of the Enigma Award. Her Web site has an extensive archive of her books, articles, and music, and an archive of earlier columns in this series. She is a contributing editor for the Plutopia News Network podcast. Follow on Mastodon