Hallucinations

It makes obvious sense that the people most personally affected by a crime should have the right to present their views in court. Last week, in Arizona, Stacey Wales, the sister of Chris Pelkey, who was killed in a road rage shooting in 2021, delegated her victim impact statement offering forgiveness to Pelkey’s artificially-generated video likeness. According to Cy Neff at the Guardian, the judge praised this use of AI and said he felt the forgiveness was “genuine”. It is unknown if it affected his sentencing.

It feels instinctively wrong to use a synthesized likeness this way to represent living relatives, who could have written any script they chose – even, had they so desired, one presenting this reportedly peaceful religious man’s views as a fierce desire for vengeance. *Of course* seeing it acted out by a movie-like AI simulation of the deceased victim packs emotional punch. But that doesn’t make it *true* or, as Wales calls it at the YouTube video link above, “his own impact statement”. It remains the thoughts of his family and friends, culled from their possibly imperfect memories of things Pelkey said during his lifetime, and if it’s going to be presented in a court, it ought to be presented by the people who wrote the script.

This is especially true because humans are so susceptible to forming relationships with *anything*, whether it’s a basketball that reminds you of home, as in the 2000 movie Cast Away, or a chatbot that appears to answer your questions, as in 1966’s ELIZA or today’s ChatGPT.

There is a lot of that about. Recently, Miles Klee reported at Rolling Stone that numerous individuals are losing loved ones to “spiritual fantasies” engendered by intensive and deepening interaction with chatbots. This reminds of Ouija boards, which seem to respond to people’s questions but in reality react to small muscle movements in the operators’ hands.

Ouija boards “lie” because their operators unconsciously guide them to spell out words via the ideomotor effect. Those small, unnoticed muscle movements are also, more impressively, responsible for table tilting. The operators add to the illusion by interpreting the meaning of whatever the Ouija board spells out.

Chatbots “hallucinate” because the underlying large language models, based on math and statistics, predict the most likely next words and phrases with no understanding of meaning. But a conundrum is developing: as the large language models underlying chatbots improve, the bots are becoming *more*, not less, prone to deliver untruths.

At The Register, Thomas Claburn reports that researchers at Carnegie-Mellon, the University of Michigan, and the Allen Institute for AI find that AI models will “lie” in to order to meet the goals set for them. In the example in their paper, a chatbot instructed to sell a new painkiller that the company knows is more addictive than its predecessor will deny its addictiveness in the interests of making the sale. This is where who owns the technology and sets its parameters is crucial.

This result shouldn’t be too surprising. In her 2019 book, You Look Like a Thing and I Love You, Janelle Shane highlighted AIs’ tendency to come up with “short-cuts” that defy human expectations and limitations to achieve the goals set for them. No one has yet reported that a chatbot has been intentionally programmed to lead its users from simple scheduling to a belief that they are talking to a god – or are one themselves, as Klee reports. This seems more like operator error, as unconscious as the ideomotor effect

OpenAI reported at the end of April that it was rolling back GPT-4o to an earlier version because the chatbot had become too “sycophantic”. Tthe chatbot’s tendency to flatter its users apparently derived from the company’s attempt to make it “feel more intuitive”.

It’s less clear why Elon Musk’s Grok has been shoehorning rants alleging white genocide in South Africa into every answer it gives to every question, no matter how unrelated, as Kyle Orland reports at Ars Technica.

Meanwhile, at the New York Times Cade Metz and Karen Weise find that AI hallucinations are getting worse as the bots become more powerful. They give examples, but we all have our own: irrelevant search results, flat-out wrong information, made-up legal citations. Metz and Weise say “it’s not entirely clear why”, but note that the reasoning systems that DeepSeek so explosively introduced in February are more prone to errors, and that those errors compound the more time they spend stepping through a problem. That seems logical, just as a tiny error in an early step can completely derail a mathematical proof.

This all being the case, it would be nice if people would pause to rethink how they use this technology. At Lawfare, Cullen O’Keefe and Ketan Ramakrishnan are already warning about the next stage, agentic AI, which is being touted as a way to automate law enforcement. Lacking fear of punishment, AIs don’t have the motivations humans do to follow the law (nor can a mistargeted individual reason with them). Therefore, they must be instructed to follow the law, with all the problems of translating human legal code into binary code that implies.

I miss so much the days when you could chat online with a machine and know that really underneath it was just a human playing pranks.

Illustrations: “Mystic Tray” Ouija board (via Wikimedia).

Wendy M. Grossman is the 2013 winner of the Enigma Award. Her Web site has an extensive archive of her books, articles, and music, and an archive of earlier columns in this series. She is a contributing editor for the Plutopia News Network podcast. Follow on Mastodon or Bluesky.

The data grab

It’s been a good week for those who like mocking flawed technology.

Numerous outlets have reported, for example, that “AI is getting dumber at math”. The source is a study conducted by researchers at Stanford and the University of California Berkeley comparing GPT-3.5’s and GPT-4’s output in March and June 2023. The researchers found that, among other things, GPT-4’s success rate at identifying prime numbers dropped from 84% to 51%. In other words, in June 2023 ChatGPT-4 did little better than chance at identifying prime numbers. That’s psychic level.

The researchers blame “drift”, the problem that improving one part of a model may have unhelpful knock-on effects in other parts of the model. At Ars Technica, Benj Edwards is less sure, citing qualified critics who question the study’s methodology. It’s equally possible, he suggests, that as the novelty fades, people’s attempts to do real work surface problems that were there all along. With no access to the algorithm itself and limited knowledge of the training data, we can only conduct such studies by controlling inputs and observing the outputs, much like diagnosing allergies by giving a child a series of foods in turn and waiting to see which ones make them sick. Edwards advocates greater openness on the part of the companies, especially as software developers begin building products on top of their generative engines.

Unrelated, the New Zealand discount supermarket chain Pak’nSave offered an “AI” meal planner that, set loose, promptly began turning out recipes for “poison bread sandwiches”, “Oreo vegetable stir-fry”, and “aromatic water mix” – which turned out to be a recipe for highly dangerous chlorine gas.

The reason is human-computer interaction: humans, told to provide a list of available ingredients, predictably became creative. As for the computer…anyone who’s read Janelle Shane’s 2019 book, You Look LIke a Thing and I Love You, or her Twitter reports on AI-generated recipes could predict this outcome. Computers have no real world experience against which to judge their output!

Meanwhile, the San Francisco Chronicle reports, Waymo and Cruise driverless taxis are making trouble at an accelerating rate. The cars have gotten stuck in low-hanging wires after thunderstorms, driven through caution tape, blocked emergency vehicles and emergency responders, and behaved erratically enough to endanger cyclists, pedestrians, and other vehicles. If they were driven by humans they’d have lost their licenses by now.

In an interesting side note that reminds of the cars’ potential as a surveillance network, Axios reports that in a ten-day study in May Waymo’s driverless cars found that human drivers in San Francisco speed 33% of the time. A similar exercise in Phoenix, Arizona observed human drivers speeding 47% of the time on roads with a 35mph speed limit. These statistics of course bolster the company’s main argument for adoption: improving road safety.

The study should – but probably won’t – be taken as a warning of the potential for the cars’ data collection to become embedded in both law enforcement and their owners’ business models. The frenzy surrounding ChatGPT-* is fueling an industry-wide data grab as everyone tries to beef up their products with “AI” (see also previous such exercises with “meta”, “nano”, and “e”), consequences to be determined.

Among the newly-discovered data grabbers is Intel, whose graphics processing unit (GPU) drivers are collecting telemetry data, including how you use your computer, the kinds of websites you visit, and other data points. You can opt out, assuming you a) realize what’s happening and b) are paying attention at the right moment during installation.

Google announced recently that it would scrape everything people post online to use as training data. Again, an opt-out can be had if you have the knowledge and access to follow the 30-year-old robots.txt protocol. In practical terms, I can configure my own site, pelicancrossing.net, to block Google’s data grabber, but I can’t stop it from scraping comments I leave on other people’s blogs or anything I post on social media sites or that’s professionally published (though those sites may block Google themselves). This data repurposing feels like it ought to be illegal under data protection and copyright law.

In Australia, Gizmodo reports that the company has asked the Australian government to relax copyright laws to facilitate AI training.

Soon after Google’s announcement the law firm Clarkson filed a class action lawsuit against Google to join its action against OpenAI. The suit accuses Google of “stealing” copyrighted works and personal data,

“Google does not own the Internet,” Clarkson wrote in its press release. Will you tell it, or shall I?

Whatever has been going on until now with data slurping in the interests of bombarding us with microtargeted ads is small stuff compared to the accelerating acquisition for the purpose of feeding AI models. Arguably, AI could be a public good in the long term as it improves, and therefore allowing these companies to access all available data for training is in the public interest. But if that’s true, then the *public* should own the models, not the companies. Why should we consent to the use of our data so they can sell it back to us and keep the proceeds for their shareholders?

It’s all yet another example of why we should pay attention to the harms that are clear and present, not the theoretical harm that someday AI will be general enough to pose an existential threat.

Illustrations: IBM Watson, Jeopardy champion.

Wendy M. Grossman is the 2013 winner of the Enigma Award and contributing editor for the Plutopia News Network podcast. Her Web site has an extensive archive of her books, articles, and music, and an archive of earlier columns in this series. Follow on Mastodon.

Performing intelligence

“Oh, great,” I thought when news broke of the release of GPT-4. “Higher-quality deception.”

Most of the Internet disagreed; having gone mad only a few weeks ago over ChatGPT, everyone’s now agog over this latest model. It passed all these tests!

One exception was the journalist Paris Marx, who commented on Twitter: “It’s so funny to me that the AI people think it’s impressive when their programs pass a test after being trained on all the answers.”

Agreed. It’s also so funny to me that they call that “AI” and don’t like it when researchers like computational linguist Emily Bender call it a “stochastic parrot”. At Marx’s Tech Won’t Save Us podcast, Goldsmith professor Dan McQuillan, author of Resisting AI: An Anti-fascist Approach to Artificial Intelligence, calls it a “bullshit engine” whose developers’ sole goal is plausibility – plausibility that, as Bender has said, allows us imaginative humans to think we detect a mind behind it, and the result is to risk devaluing humans.

Let’s walk back to an earlier type of system that has been widely deployed: benefits scoring systems. A couple of weeks ago, Lighthouse Reports and Wired magazine teamed up on an investigation of these systems, calling them “suspicion machines”.

Their work focuses on the welfare benefits system in use in Rotterdam between 2017 and 2021, which uses 315 variables to risk-score benefits recipients according to the likelihood that their claims are fraudulent. In detailed, worked case analyses, they find systemic discrimination: you lose points for being female, for being female and having children (males aren’t asked about children), for being non-white, and for ethnicity (knowing Dutch a requirement for welfare recipients). Other variables include missing meetings, age, and “lacks organizing skills”, which was just one of 54 variables based on case workers’ subjective assessments. Any comment a caseworker adds translates to a 1 added to the risk score, even if it’s positive. The top-scoring 10% are flagged for further investigation.

This is the system that Accenture, the city’s technology partner on the early versions, said at its unveiling in 2018 was an “ethical solution” and promised “unbiased citizen outcomes”. Instead, Wired says, the algorithm “fails the city’s own test of fairness”.

The project’s point wasn’t to pick on Rotterdam; of the dozens of cities they contacted it just happened to be the only one that was willing to share the code behind the algorithm, along with the list of variables, prior evaluations, and the data scientists’ handbook. It even – after being threatened with court action under freedom of information laws, shared the mathematical model itself.

The overall conclusion: the system was so inaccurate it was little better than random sampling “according to some metrics”.

What strikes me, aside from the details of this design, is the initial choice of scoring benefits recipients for risk of fraud. Why not score them for risk of missing out on help they’re entitled to? The UK government’s figures on benefits fraud indicate that in 2021-2022 overpayment (including error as well as fraud) amounted to 4%; and *underpayment* 1.2% of total expenditure. Underpayment is a lot less, but it’s still substantial (£2.6 billion). Yes, I know, the point of the scoring system is to save money, but the point of the *benefits* system is to help people who need it. The suspicion was always there, but the technology has altered the balance.

This was the point the writer Ellen Ullman noted in her 1996 book Close to the Machine”: the hard-edged nature of these systems and their ability to surveil people in new ways, “infect” their owners with suspicion even of people they’ve long trusted and even when the system itself was intended to be helpful. On a societal scale, these “suspicion machines” embed increased division in our infrastructure; in his book, McQuillan warns us to watch for “functionality that contributes to violent separations of ‘us and them’.”

Along those lines, it’s disturbing that Open AI, the owner of ChatGPT and GPT-4 (and several other generative AI gewgaws) has now decided to keep secret the details of its large language models. That is, we have no sight into what data was used in training, what software and hardware methods were used, or how energy-intensive it is. If there’s a machine loose in the world’s computer systems pretending to be human, shouldn’t we understand how it works? It would help with damping down imagining we see a mind in there.

The company’s argument appears to be that because these models could become harmful it’s bad to publish how they work because then bad actors will use them to create harm. In the cybersecurity field we call this “security by obscurity” and there is a general consensus that it does not work as a protection.

In a lengthy article at New York magazine, Elizabeth Weil. quotes Daniel Dennett’s assessment of these machines: “counterfeit people” that should be seen as the same sort of danger to our system as counterfeit money. Bender suggests that rather than trying to make fake people we should be focusing more on making tools to help people.

The thing that makes me tie it to the large language models that are producing GPT is that in both cases it’s all about mining our shared cultural history, with all its flaws and misjudgments, in response to a prompt and pretending the results have meaning and create new knowledge. And *that’s* what’s being embedded into the world’s infrastructure. Have we learned nothing from Clever Hans?

Illustrations: Clever Hans, performing in Leipzig in 1912 (by Karl Krali, via Wikimedia.

Wendy M. Grossman is the 2013 winner of the Enigma Award. Her Web site has an extensive archive of her books, articles, and music, and an archive of earlier columns in this series. Follow on Mastodon or Twitter.