That didn’t take long. Since last week’s fret about AI startups ignoring the robots.txt convention, Thomas Claburn has reported at The Register that Cloudflare has developed a scraping prevention tool that identifies and blocks “content extraction” bots attempting to crawl sites at scale.

It’s a stopgap, not a solution. As Cloudflare’s announcement makes clear, the company knows there will be pushback; given these companies’ lack of interest in following existing norms, blocking tools versus scraping bots is basically the latest arms race (previously on this plotline: spam). Also, obviously, the tool only works on sites that are Cloudflare customers. Although these include many of the web’s largest sites, there are hundreds of millions more that won’t, don’t, or can’t pay for its services. If we want to return control to site owners, we’re going to need a more permanent and accesible solution.

In his 1999 book Code and Other Laws of Cyberspace, Lawrence Lessig finds four forms of regulation: norms, law, markets, and architecture. Norms are failing. Markets will just mean prolonged arms races. We’re going to need law and architecture.


We appear to be reaching peak “AI” hype, defined by (as in the peak of app hype) the increasing absurdity of things venture capitalists seem willing to fund. I recall reading the comment that at the peak of app silliness a lot of startups were really just putting a technological gloss on services that young men will previously have had supplied by their mothers. The AI bubble seems to be even less productive of long-term value, calling things “AI” that are not at all novel, and proposing “AI” to patch problems that call for real change.

As an example of the first of those, my new washing machine has a setting called “AI patterns”. The manual explains: it reorders the preset programs on the machine’s dial so the ones you use most appear first. It’s not stupid (although I’ve turned it off anyway, along with the wifi and “smart” features I would rather not pay for), but let’s call it what it is: customizing a menu.

As an example of the second…at Gizmodo, Maxwell Zeff reports that Softbank is claiming to have developed an “emotion canceling” AI that “alters angry voices into calm ones”. The use Softbank envisages is to lessen the stress for call center employees by softening the voices of angry customers without changing their actual words. There are, as people pointed out on Mastodon after the article was posted there, a lot smarter alternatives to reducing those individuals’ stress. Like giving them better employment conditions, or – and here’s a really radical thought – designing your services and products so your customers aren’t so frustrated and angry. What this software does is just falsify the sound. My guess is that if there is a result it will be to make customers even more angry and frustrated. More anger in the world. Great.


Oh! Sarcasm, even if only slight! At the Guardian, Ned Carter Miles reports on “emotional AI” (can we say “oxymoron”?). Among his examples is a team at the University of Groningen that is teaching an AI to recognize sarcasm using scenes from US sitcoms such as Friends and The Big Bang Theory. Even absurd-sounding research can be a good thing. I’m still not sure how good a guide sitcoms are for identifying emotions in real-world context even apart from the usual issues of algorithmic bias. After all, actors are given carefully crafted words and work harder to communicate their emotional content than ordinary people normally do.


Finally, again in the category of peak-AI-hype is this: at the New York Times Cade Metz is reporting that Ilya Sutskever, a co-founder and former chief scientist at OpenAI, has a new startup whose goal is to create a “safe superintelligence”.

Even if you, unlike me, believe that a “superintelligence” is an imminent possibility, what does “safe” mean, especially in an industry that still treats security and accessibility as add-ons? “Safe” is, like “secure”, meaningless without context and a threat model. Safe from what? Safe for what? To do what? Operated by whom? Owned by whom? With what motives? For how long? We create new intelligent humans all the time. Do we have any ability to ensure they’re “safe” technology? If an AGI is going to be smarter than a human, how can anyone possibly promise it will be, in the industry parlance, “aligned” with our goals? And for what value of “our”? Beware the people who want to build the Torment Nexus!

It’s nonsense. Safety can’t be programmed into a superintelligence any more than Isaac Asimov’s Laws of Robotics.

Sutskever’s own comments are equivocal. In a video clip at the Guardian, Sutsekver confusingly says both that “AI will solve all our problems” and that it will make fake news, cyber attacks, and weapons much worse and “has the potential to create infinitely stable dictatorships”. Then he adds, “I feel that technology is a force of nature.” Which is exactly the opposite of what technology is…but it suits the industry to push the inevitability narrative that technological progress cannot be stopped.

Cue Douglas Adams: “This is obviously some strange use of the word ‘safe’ I wasn’t previously aware of.”

Illustrations: The Big Bang Theory‘s Leonard (Johnny Galecki) teaching Sheldon (Jim Parsons) about sarcasm (Season 1, episode 2, “The Big Bran Hypothesis”).

Wendy M. Grossman is the 2013 winner of the Enigma Award. Her Web site has an extensive archive of her books, articles, and music, and an archive of earlier columns in this series. She is a contributing editor for the Plutopia News Network podcast. Follow on Mastodon.

Changing the faith

The governance of Britain and the governance of the Internet have this in common: the ultimate authority in both cases is to a large extent a “gentleman’s agreement”. For the same reason: both were devised by a relatively small, homogeneous group of people who trusted each other. In the case of Britain, inertia means that even without a written constitution the country goes on electing governments and passing laws as if.

Most people have no reason to know that the Internet’s technical underpinnings are defined by a series of documents known as RFCs, for Requests(s) for Comments. RFC1 was defined in April 1969; the most recent, RFC9598, is dated just last month. While the Internet Engineering Task Force oversees RFCs’ development and administration, it has no power to force anyone to adopt them. Throughout, RFC standards have been created collaboratively by volunteers and adopted on merit.

A fair number of RFCs promote good “Internet citizenship”. There are, for example, email addresses (chiefly, webmaster and postmaster) that anyone running a website is supposed to maintain in order to make it easy for a third party to report problems. Today, probably millions of website owners don’t even know this expectation exists. For Internet curmudgeons over a certain age, however, seeing email to those addresses bounce is frustrating.

Still, many of these good-citizen practices persist. One such is the Robots Exclusion Protocol, updated in 2022 as RFC 9309, which defines a file, “robots.txt”, that website owners can put in place to tell automated web crawlers which parts of the site they may access and copy. This may have mattered less in recent years than it did in 1994, when it was devised. As David Pierce recounts at The Verge, at that time an explosion of new bots were beginning to crawl the web to build directories and indexes (no Google until 1998!). Many of those early websites were hosted on very small systems based in people’s homes or small businesses, and could be overwhelmed by unrestrained crawlers. Robots txt, devised by a small group of administrators and developers, managed this problem.

Even without a legal requirement to adopt it, early Internet companies largely saw being good Internet citizens as benefiting them. They, too, were small at the time, and needed good will to bring them the users and customers that have since made them into giants. It served everyone’s interests to comply.

Until more or less now. This week, Katie Paul is reporting at Reuters that “AI” companies are blowing up this arrangement by ignoring robots.txt and scraping whatever they want. This news follows reporting by Randall Lane at Forbes that is using its software to generate stories and podcasts using news sites’ work without credit. At Wired, Druv Mehrotra and Tim Marchman report a similar story: Perplexity is ignoring robots.txt and scraping areas of sites that owners want left alone. At 404 Media, Emmanuel Maiberg reports that Perplexity also has a dubious history of using fake accounts to scrape Twitter data.

Let’s not just pick on Perplexity; this is the latest in a growing trend. Previously, hiQ Labs tried scraping data from LinkedIn in order to build services to sell employers, the courts finally ruled in 2019 that hiQ violated LinkedIn’s terms and conditions. More controversially, in the last few years Clearview AI has been responding to widespread criticism by claiming that any photograph published on the Internet is “public” and therefore within its rights to grab for its database and use to identify individuals online and offline. The result has been myriad legal actions under data protection law in the EU and UK, and, in the US, a sheaf of lawsuits. Last week, Kashmir Hill reported at the New York Times, that because Clearview lacks the funds to settle a class action lawsuit it has offered a 23% stake to Americans whose faces are in its database.

As Pierce (The Verge) writes, robots.txt used to represent a fair-enough trade: website owners got search engine visibility in return for their data, and the owners of the crawlers got the data but in return sent traffic.

But AI startups ingesting data to build models don’t offer any benefit in return. Where search engines have traditionally helped people find things on other sites, the owners of AI chatbots want to keep the traffic for themselves. Perplexity bills itself as an “answer engine”. A second key difference is this: none of these businesses are small. As Vladen Joler pointed out last month at CPDP, “AI comes pre-monopolized.” Getting into this area requires billions in funding; by contrast many early Internet businesses started with just a few hundred dollars.

This all feels like a watershed moment for the Internet. For most of its history, as Charles Arthur writes at The Overspill, every advance has exposed another area where the Internet operates on the basis of good faith. Typically, the result is some form of closure – spam, for example, led the operators of mail servers to close to all but authenticated users. It’s not clear to a non-technical person what stronger measure other than copyright law could replace the genteel agreement of robots.txt, but the alternative will likely be closing open access to large parts of the web – a loss to all of us.

Illustrations: Vladen Joler at CPDP 2024, showing his map of the extractive industries required to underpin “AI”.

Wendy M. Grossman is the 2013 winner of the Enigma Award. Her Web site has an extensive archive of her books, articles, and music, and an archive of earlier columns in this series. She is a contributing editor for the Plutopia News Network podcast. Follow on Mastodon.