Small data

Shortly before this gets posted, Jon Crowcroft and I will have presented this year’s offering at Gikii, the weird little conference that crosses law, media, technology, and pop culture. This is what we will possibly may have said, as I understand it, with some added explanation for the slightly less technical audience I imagine will read this.

Two years ago, a team of four researchers – Timnit Gebru, Emily Bender, Margaret Mitchell (writing as Shmargaret Shmitchell), and Angelina McMillan-Major – wrote a now-famous paper called On the Dangers of Stochastic Parrots (PDF) calling into question the usefulness of the large language models (LLMs) that have caused so much ruckus this year. The “Stochastic Four” argued instead of small models built on carefully curated data: less prone to error, less exploitive of people’s data, less damaging to the planet. Gebru got fired over this paper; Google also fired Mitchell soon afterwards. Two years later, neural networks pioneer Geoff Hinton quit Google in order to voice similar concerns.

Despite the hype, LLMs have many problems. They are fundamentally an extractive technology and are resource-intensive. Building LLMs requires massive amounts of training data; so far, the companies have been unwilling to acknowledge their sources, perhaps because (as is happening already) they fear copyright suits.

More important from a technical standpoint, is the issue of model collapse; that is, models degrade when they begin to ingest synthetic AI-generated data instead of human input. We’ve seen this before with Google Flu Trends, which degraded rapidly as incoming new search data included many searches on flu-like symptoms that weren’t actually flu, and others that simply reflected the frequency of local news coverage. “Data pollution” as LLM-generated data fills the web, will mean that the web will be an increasingly useless source of training data for future generations of generative AI. Lots more noise, drowning out the signal (in the photo above, the signal would be the parrot).

Instead, if we follow the lead of the Stochastic Four, the more productive approach is small data – small, carefully curated datasets that train models to match specific goals. Far less resource-intensive, far fewer issues with copyright, appropriation, and extraction.

We know what the LLM future looks like in outline: big, centralized services, because no one else will be able to amass enough data. In that future, surveillance capitalism is an essential part of data gathering. SLM futures could look quite different: decentralized, with realigned incentives. At one point, we wanted to suggest that small data could bring the end of surveillance capitalism; that’s probably an overstatement. But small data could certainly create the ecosystem in which the case for mass data collection would be less compelling.

Jon and I imagined four primary alternative futures: federation, personalization, some combination of those two, and paradigm shift.

Precursors to a federated small data future already exist; these include customer service chatbots, predictive text assistants. In this future, we could imagine personalized LLM servers designed to serve specific needs.

An individualized future might look something like I suggested here in March: a model that fits in your pocket that is constantly updated with material of your own choosing. Such a device might be the closest yet to Vannevar Bush’s 1945 idea of the Memex (PDF), updated for the modern era by automating the dozens of secretary-curators he imagined doing the grunt work of labeling and selection. That future again has precursors in techniques for sharing the computation but not the data, a design we see proposed for health care, where the data is too sensitive to share unless there’s a significant public interest (as in pandemics or very rare illnesses), or in other data analysis designs intended to protect privacy.

In 2007, the science fiction writer Charles Stross suggested something like this, though he imagined it as a comprehensive life log, which he described as a “google for real life”. So this alternative future would look something like Stross’s pocket $10 life log with enhanced statistics-based data analytics.

Imagining what a paradigm shift might look like is much harder. That’s the kind of thing science fiction writers do; it’s 16 years since Stross gave that life log talk. However, in his 2018 history of advertising, The Attention Merchants, Columbia professor Tim Wu argued that industrialization was the vector that made advertising and its grab for our attention part of commerce. A hundred and fifty-odd years later, the centralizing effects of industrialization are being challenged starting with energy via renewables and local power generation and social media via the fediverse. Might language models also play their part in bringing a new, more collaborative and cooperative society?

It is, in other words, just possible that the hot new technology of 2023 is simply a dead end bringing little real change. It’s happened before. There have been, as Wu recounts, counter-moves and movements before, but they didn’t have the technological affordances of our era.

In the Q&A that followed, Miranda Mowbray pointed out that companies are trying to implement the individualized model, but that it’s impossible to do unless there are standardized data formats, and even then hard to do at scale.

Illustrations: Spot the parrot seen in a neighbor’s tree.

Wendy M. Grossman is the 2013 winner of the Enigma Award. Her Web site has an extensive archive of her books, articles, and music, and an archive of earlier columns in this series. She is a contributing editor for the Plutopia News Network podcast. Follow on Wendy M. GrossmanPosted on Categories AI, Events, New tech, old knowledgeTags 1 Comment on Small data

The safe place

For a long time, fear that technical decisions – new domain names ($)(, cooption of open standards or software, laws mandating data localization – would splinter the Internet. “Balkanize” was heard a lot.

A panel at the UK Internet Governance Forum a couple of weeks ago focused on this exact topic, and was mostly self-congratulatory. Which is when it occurred to me that the Internet may not *be* fragmented, but it *feels* fragmented. Almost every day I encounter some site I can’t reach: email goes into someone’s spam folder, the site or its content is off-limits because it’s been geofenced to conform with copyright or data protection laws, or the site mysteriously doesn’t load, with no explanation. The most likely explanation for the latter is censorship built into the Internet feed by the ISP or the establishment whose connection I’m using, but they don’t actually *say* that.

The ongoing attrition at Twitter is exacerbating this feeling, as the users I’ve followed for years continue to migrate elsewhere. At the moment, it takes accounts on several other services to keep track of everyone: definite fragmentation.

Here in the UK, this sense of fragmentation may be about to get a lot worse, as the long-heralded Online Safety bill – written and expanded until it’s become a “Frankenstein bill”, as Mark Scott and Annabelle Dickson report at Politico – hurtles toward passage. This week saw fruitless debates on amendments in the House of Lords, and it will presumably be back in the Commons shortly thereafter, where it could be passed into law by this fall.

A number of companies have warned that the bill, particularly if it passes with its provisions undermining end-to-end encryption intact, will drive them out of the country. I’m not sure British politicians are taking them seriously; so often such threats are idle. But in this case, I think they’re real, not least because post-Brexit Britain carries so much less global and commercial weight, a reality some politicians are in denial about. WhatsApp, Signal, and Apple have all said openly that they will not compromise the privacy of their masses of users elsewhere to suit the UK. Wikipedia has warned that including it in the requirement to age-verify its users will force it to withdraw rather than violate its principles about collecting as little information about users as possible. The irony is that the UK government itself runs on WhatsApp.

Wikipedia, Ian McRae, the director of market intelligence for prospective online safety regulator Ofcom, showed in a presentation at UKIGF, would be just one of the estimated 150,000 sites within the scope of the bill. Ofcom is ramping up to deal with the workload, an effort the agency expects to cost £169 million between now and 2025.

In a legal opinion commissioned by the Open Rights Group, barristers at Matrix Chambers find that clause 9(2) of the bill is unlawful. This, as Thomas Macaulay explains at The Next Web, is the clause that requires platforms to proactively remove illegal or “harmful” user-generated content. In fact: prior restraint. As ORG goes on to say, there is no requirement to tell users why their content has been blocked.

Until now, the impact of most badly-formulated British legislative proposals has been sort of abstract. Data retention, for example: you know that pervasive mass surveillance is a bad thing, but most of us don’t really expect to feel the impact personally. This is different. Some of my non-UK friends will only use Signal to communicate, and I doubt a day goes by that I don’t look something up on Wikipedia. I could use a VPN for that, but if the only way to use Signal is to have a non-UK phone? I can feel those losses already.

And if people think they dislike those ubiquitous cookie banners and consent clickthroughs, wait until they have to age-verify all over the place. Worst case: this bill will be an act of self-harm that one day will be as inexplicable to future generations as Brexit.

The UK is not the only one pursuing this path. Age verification in particular is catching on. The US states of Virginia, Mississippi, Louisiana, Arkansas, Texas, Montana, and Utah have all passed legislation requiring it; Pornhub now blocks users in Mississippi and Virginia. The likelihood is that many more countries will try to copy some or all of its provisions, just as Australia’s law requiring the big social media platforms to negotiate with news publishers is spawning copies in Canada and California.

This is where the real threat of the “splinternet” lies. Think of requiring 150,000 websites to implement age verification and proactively police content. Many of those sites, as the law firm Mischon de Reya writes may not even be based in the UK.

This means that any site located outside the UK – and perhaps even some that are based here – will be asking, “Is it worth it?” For a lot of them, it won’t be. Which means that however much the Internet retains its integrity, the British user experience will be the Internet as a sea of holes.

Illustrations: Drunk parrot in a Putney garden (by Simon Bisson; used by permission).

Wendy M. Grossman is the 2013 winner of the Enigma Award. Her Web site has an extensive archive of her books, articles, and music, and an archive of earlier columns in this series. Follow on Mastodon.