The data grab

It’s been a good week for those who like mocking flawed technology.

Numerous outlets have reported, for example, that “AI is getting dumber at math”. The source is a study conducted by researchers at Stanford and the University of California Berkeley comparing GPT-3.5’s and GPT-4’s output in March and June 2023. The researchers found that, among other things, GPT-4’s success rate at identifying prime numbers dropped from 84% to 51%. In other words, in June 2023 ChatGPT-4 did little better than chance at identifying prime numbers. That’s psychic level.

The researchers blame “drift”, the problem that improving one part of a model may have unhelpful knock-on effects in other parts of the model. At Ars Technica, Benj Edwards is less sure, citing qualified critics who question the study’s methodology. It’s equally possible, he suggests, that as the novelty fades, people’s attempts to do real work surface problems that were there all along. With no access to the algorithm itself and limited knowledge of the training data, we can only conduct such studies by controlling inputs and observing the outputs, much like diagnosing allergies by giving a child a series of foods in turn and waiting to see which ones make them sick. Edwards advocates greater openness on the part of the companies, especially as software developers begin building products on top of their generative engines.

Unrelated, the New Zealand discount supermarket chain Pak’nSave offered an “AI” meal planner that, set loose, promptly began turning out recipes for “poison bread sandwiches”, “Oreo vegetable stir-fry”, and “aromatic water mix” – which turned out to be a recipe for highly dangerous chlorine gas.

The reason is human-computer interaction: humans, told to provide a list of available ingredients, predictably became creative. As for the computer…anyone who’s read Janelle Shane’s 2019 book, You Look LIke a Thing and I Love You, or her Twitter reports on AI-generated recipes could predict this outcome. Computers have no real world experience against which to judge their output!

Meanwhile, the San Francisco Chronicle reports, Waymo and Cruise driverless taxis are making trouble at an accelerating rate. The cars have gotten stuck in low-hanging wires after thunderstorms, driven through caution tape, blocked emergency vehicles and emergency responders, and behaved erratically enough to endanger cyclists, pedestrians, and other vehicles. If they were driven by humans they’d have lost their licenses by now.

In an interesting side note that reminds of the cars’ potential as a surveillance network, Axios reports that in a ten-day study in May Waymo’s driverless cars found that human drivers in San Francisco speed 33% of the time. A similar exercise in Phoenix, Arizona observed human drivers speeding 47% of the time on roads with a 35mph speed limit. These statistics of course bolster the company’s main argument for adoption: improving road safety.

The study should – but probably won’t – be taken as a warning of the potential for the cars’ data collection to become embedded in both law enforcement and their owners’ business models. The frenzy surrounding ChatGPT-* is fueling an industry-wide data grab as everyone tries to beef up their products with “AI” (see also previous such exercises with “meta”, “nano”, and “e”), consequences to be determined.

Among the newly-discovered data grabbers is Intel, whose graphics processing unit (GPU) drivers are collecting telemetry data, including how you use your computer, the kinds of websites you visit, and other data points. You can opt out, assuming you a) realize what’s happening and b) are paying attention at the right moment during installation.

Google announced recently that it would scrape everything people post online to use as training data. Again, an opt-out can be had if you have the knowledge and access to follow the 30-year-old robots.txt protocol. In practical terms, I can configure my own site, pelicancrossing.net, to block Google’s data grabber, but I can’t stop it from scraping comments I leave on other people’s blogs or anything I post on social media sites or that’s professionally published (though those sites may block Google themselves). This data repurposing feels like it ought to be illegal under data protection and copyright law.

In Australia, Gizmodo reports that the company has asked the Australian government to relax copyright laws to facilitate AI training.

Soon after Google’s announcement the law firm Clarkson filed a class action lawsuit against Google to join its action against OpenAI. The suit accuses Google of “stealing” copyrighted works and personal data,

“Google does not own the Internet,” Clarkson wrote in its press release. Will you tell it, or shall I?

Whatever has been going on until now with data slurping in the interests of bombarding us with microtargeted ads is small stuff compared to the accelerating acquisition for the purpose of feeding AI models. Arguably, AI could be a public good in the long term as it improves, and therefore allowing these companies to access all available data for training is in the public interest. But if that’s true, then the *public* should own the models, not the companies. Why should we consent to the use of our data so they can sell it back to us and keep the proceeds for their shareholders?

It’s all yet another example of why we should pay attention to the harms that are clear and present, not the theoretical harm that someday AI will be general enough to pose an existential threat.

Illustrations: IBM Watson, Jeopardy champion.

Wendy M. Grossman is the 2013 winner of the Enigma Award and contributing editor for the Plutopia News Network podcast. Her Web site has an extensive archive of her books, articles, and music, and an archive of earlier columns in this series. Follow on Mastodon.

Own goals

There’s no point in saying I told you so when the people you’re saying it to got the result they intended.

At the Guardian, Peter Walker reports the Electoral Commission’s finding that at least 14,000 people were turned away from polling stations in May’s local elections because they didn’t have the right ID as required under the new voter ID law. The Commission thinks that’s a huge underestimate; 4% of people who didn’t vote said it was because of voter ID – which Walker suggests could mean 400,000 were deterred. Three-quarters of those lacked the right documents; the rest opposed the policy. The demographics of this will be studied more closely in a report due in September, but early indications are that the policy disproportionately deterred people with disabilities, people from certain ethnic groups, and people who are unemployed.

The fact that the Conservatives, who brought in this policy, lost big time in those elections doesn’t change its wrongness. But it did lead the MP Jacob Rees-Mogg (Con-North East Somerset) to admit that this was an attempt to gerrymander the vote that backfired because older voters, who are more likely to vote Conservative, also disproportionately don’t have the necessary ID.

***

One of the more obscure sub-industries is the business of supplying ad services to websites. One such little-known company is Criteo, which provides interactive banner ads that are generated based on the user’s browsing history and behavior using a technique known as “behavioral retargeting”. In 2018, Criteo was one of seven companies listed in a complaint Privacy International and noyb filed with three data protection authorities – the UK, Ireland, and France. In 2020, the French data protection authority, CNIL, launched an investigation.

This week, CNIL issued Criteo with a €40 million fine over failings in how it gathers user consent, a ruling noyb calls a major blow to Criteo’s business model.

It’s good to see the legal actions and fines beginning to reach down into adtech’s underbelly. It’s also worth noting that the CNIL was willing to fine a *French* company to this extent. It makes it harder for the US tech giants to claim that the fines they’re attracting are just anti-US protectionism.

***

Also this week, the US Federal Trade Commission announced it’s suing Amazon, claiming the company enrolled millions of US consumers into its Prime subscription service through deceptive design and sabotaged their efforts to cancel.

“Amazon used manipulative, coercive, or deceptive user-interface designs known as “dark patterns” to trick consumers into enrolling in automatically-renewing Prime subscriptions,” the FTC writes.

I’m guessing this is one area where data protection laws have worked, In my UK-based ultra-brief Prime outings to watch the US Open tennis, canceling has taken at most two clicks. I don’t recognize the tortuous process Business Insider documented in 2022.

***

It has long been no secret that the secret behind AI is human labor. In 2019, Mary L. Gray and Siddharth Suri documented this in their book Ghost Work. Platform workers label images and other content, annotate text, and solve CAPTCHAs to help train AI models.

At MIT Technology Review, Rhiannon Williams reports that platform workers are using ChatGPT to speed up their work and earn more. A team of researchers from the Swiss Federal Institute of Technology study (PDF)found that between 33% and 46% of the 44 workers they tested with a request to summarize 16 extracts from medical research papers used AI models to complete the task.

It’s hard not to feel a little gleeful that today’s “AI” is already eating itself via a closed feedback loop. It’s not good news for platform workers, though, because the most likely consequence will be increased monitoring to force them to show their work.

But this is yet another case in which computer people could have learned from their own history. In 2008, researchers at Google published a paper suggesting that Google search data could be used to spot flu outbreaks. Sick people searching for information about their symptoms could provide real-time warnings ten days earlier than the Centers for Disease Control could.

This actually worked, some of the time. However, as early as 2009, Kaiser Fung reported at Harvard Business Review in 2014, Google Flu Trends missed the swine flu pandemic; in 2012, researchers found that it had overestimated the prevalence of flu for 100 out of the previous 108 weeks. More data is not necessarily better, Fung concluded.

In 2013, as David Lazer and Ryan Kennedy reported for Wired in 2015 in discussing their investigation into the failure of this idea, GFT missed by 140% (without explaining what that means). Lazer and Kennedy find that Google’s algorithm was vulnerable to poisoning by unrelated seasonal search terms and search terms that were correlated purely by chance, and failed to take into account changing user behavior as when it introduced autosuggest and added health-related search terms. The “availability” cognitive bias also played a role: when flu is in the news, searches go up whether or not people are sick.

While the parallels aren’t exact, large language modelers could have drawn the lesson that users can poison their models. ChatGPT’s arrival for widespread use will inevitably thin out the proportion of text that is human-written – and taint the well from which LLMs drink. Everyone imagines the next generation’s increased power. But it’s equally possible that the next generation will degrade as the percentage of AI-generated data rises.

Illustrations: Drunk parrot seen in a Putney garden (by Simon Bisson).

Wendy M. Grossman is the 2013 winner of the Enigma Award. Her Web site has an extensive archive of her books, articles, and music, and an archive of earlier columns in this series. Follow on