Small data

Shortly before this gets posted, Jon Crowcroft and I will have presented this year’s offering at Gikii, the weird little conference that crosses law, media, technology, and pop culture. This is what we will possibly may have said, as I understand it, with some added explanation for the slightly less technical audience I imagine will read this.

Two years ago, a team of four researchers – Timnit Gebru, Emily Bender, Margaret Mitchell (writing as Shmargaret Shmitchell), and Angelina McMillan-Major – wrote a now-famous paper called On the Dangers of Stochastic Parrots (PDF) calling into question the usefulness of the large language models (LLMs) that have caused so much ruckus this year. The “Stochastic Four” argued instead of small models built on carefully curated data: less prone to error, less exploitive of people’s data, less damaging to the planet. Gebru got fired over this paper; Google also fired Mitchell soon afterwards. Two years later, neural networks pioneer Geoff Hinton quit Google in order to voice similar concerns.

Despite the hype, LLMs have many problems. They are fundamentally an extractive technology and are resource-intensive. Building LLMs requires massive amounts of training data; so far, the companies have been unwilling to acknowledge their sources, perhaps because (as is happening already) they fear copyright suits.

More important from a technical standpoint, is the issue of model collapse; that is, models degrade when they begin to ingest synthetic AI-generated data instead of human input. We’ve seen this before with Google Flu Trends, which degraded rapidly as incoming new search data included many searches on flu-like symptoms that weren’t actually flu, and others that simply reflected the frequency of local news coverage. “Data pollution” as LLM-generated data fills the web, will mean that the web will be an increasingly useless source of training data for future generations of generative AI. Lots more noise, drowning out the signal (in the photo above, the signal would be the parrot).

Instead, if we follow the lead of the Stochastic Four, the more productive approach is small data – small, carefully curated datasets that train models to match specific goals. Far less resource-intensive, far fewer issues with copyright, appropriation, and extraction.

We know what the LLM future looks like in outline: big, centralized services, because no one else will be able to amass enough data. In that future, surveillance capitalism is an essential part of data gathering. SLM futures could look quite different: decentralized, with realigned incentives. At one point, we wanted to suggest that small data could bring the end of surveillance capitalism; that’s probably an overstatement. But small data could certainly create the ecosystem in which the case for mass data collection would be less compelling.

Jon and I imagined four primary alternative futures: federation, personalization, some combination of those two, and paradigm shift.

Precursors to a federated small data future already exist; these include customer service chatbots, predictive text assistants. In this future, we could imagine personalized LLM servers designed to serve specific needs.

An individualized future might look something like I suggested here in March: a model that fits in your pocket that is constantly updated with material of your own choosing. Such a device might be the closest yet to Vannevar Bush’s 1945 idea of the Memex (PDF), updated for the modern era by automating the dozens of secretary-curators he imagined doing the grunt work of labeling and selection. That future again has precursors in techniques for sharing the computation but not the data, a design we see proposed for health care, where the data is too sensitive to share unless there’s a significant public interest (as in pandemics or very rare illnesses), or in other data analysis designs intended to protect privacy.

In 2007, the science fiction writer Charles Stross suggested something like this, though he imagined it as a comprehensive life log, which he described as a “google for real life”. So this alternative future would look something like Stross’s pocket $10 life log with enhanced statistics-based data analytics.

Imagining what a paradigm shift might look like is much harder. That’s the kind of thing science fiction writers do; it’s 16 years since Stross gave that life log talk. However, in his 2018 history of advertising, The Attention Merchants, Columbia professor Tim Wu argued that industrialization was the vector that made advertising and its grab for our attention part of commerce. A hundred and fifty-odd years later, the centralizing effects of industrialization are being challenged starting with energy via renewables and local power generation and social media via the fediverse. Might language models also play their part in bringing a new, more collaborative and cooperative society?

It is, in other words, just possible that the hot new technology of 2023 is simply a dead end bringing little real change. It’s happened before. There have been, as Wu recounts, counter-moves and movements before, but they didn’t have the technological affordances of our era.

In the Q&A that followed, Miranda Mowbray pointed out that companies are trying to implement the individualized model, but that it’s impossible to do unless there are standardized data formats, and even then hard to do at scale.

Illustrations: Spot the parrot seen in a neighbor’s tree.

Wendy M. Grossman is the 2013 winner of the Enigma Award. Her Web site has an extensive archive of her books, articles, and music, and an archive of earlier columns in this series. She is a contributing editor for the Plutopia News Network podcast. Follow on Wendy M. GrossmanPosted on Categories AI, Events, New tech, old knowledgeTags Leave a comment on Small data

Review: Making a Metaverse That Matters

Making a Metaverse That Matters: From Snow Crash and Second Life to A Virtual World Worth Fighting For
By Wagner James Au
Publisher: Wiley
ISBN: 978-1-394-15581-1

A couple of years ago, when “the metaverse” was the hype-of-the-month, I kept wondering why people didn’t just join 20-year-old Second Life, or a game world. Even then the idea wasn’t new: the first graphical virtual world, Habitat, launched in 1988. And even *that* was preceded by text-based MUDs that despite their limitations afforded their users the chance to explore a virtual world and experiment with personal identity.

I never really took to Second Life. The initial steps – download the software, install it, choose a user name and password, and then an avatar – aren’t difficult. The trouble begins after that: what do I do now? Fly to an island, and then…what?

I *did*, once, have a commission to interview a technology company executive, who dressed his avatar in a suit and tie to give a lecture in a virtual auditorium and then joined me in the now-empty auditorium to talk, now changedinto jeans, T-shirt, and baseball cap.

In his new book, Making a Metaverse That Matters, the freelance journalist Wagner James Au argues that this sort of image consciousness derives from allowing humanoid avatars; they lead us to bring the constraints of our human societies into the virtual world, where instead we could free our selves. Humanoid form leads people to observe the personal space common in their culture, apply existing prejudices, and so on. Au favors blocking markers such as gender and skin color that are the subject of prejudice offline. I’m not convinced this will make much difference; even on text-based systems with numbers instead of names disguising your real-life physical characteristics takes work.

Au spent Second Life’s heyday as its embedded reporter; his news and cultural reports eventually became his 1999 book, The Making of Second Life: Notes from a New World. Part of his new book reassesses that work and reports regrets. He wishes he had been a stronger critic back then instead of being swayed by his own love for the service. Second Life’s biggest mistake, he thinks, was persistently refusing to call itself a game or add game features. The result was a dedicated user base that stubbornly failed to grow beyond about 600,000 as most people joined and reacted the way I did: what now? But some of those 600,000 benefited handsomely, as Au documents: some remade their lives, and a few continue to operate million-dollar businesses built inside the service.

Au returns repeatedly to Snow Crash author Neal Stephenson‘s original conception of the metaverse, a single pervasive platform. The metaverse of Au’s dreams has community as its core value, is accessible to all, is a game (because non-game virtual worlds have generally failed), and collaborative for creators. In other words, pretty much the opposite of anything Meta is likely to build.


Whatever you’re starting to binge-watch, slow down. It’s going to be a long wait for fresh content out of Hollywood.

Yesterday, the actors union, SAG-AFTRA, went out on strike alongside the members of the Writers Guild of America, who have been “>walking picket lines since May 2. Like the writers, actors have seen their livelihoods shrink as US TV shows’ seasons shorten, “reruns” that pay residuals fade into the past, and DVD royalties dry up, while royalties from streaming remain tiny by comparison. At the Hollywood and Levine podcast, the veteran screenwriter Ken Levine gives the background to the WGA’s action. But think of it this way: the writers and cast of The Big Bang Theory may be the last to share fairly in the enormous profits their work continues to generate.

The even bigger threat? AI that makes it possible to capture the actor’s likeness and then reuse it ad infinitum in new work. This, as Malia Mendez writes at the LA Times, is the big fear. In a world where Harrison Ford at 80 is making movies in which he’s aged down to look 40 and James Earl Jones has agreed to clone his voice for reuse after his death, it’s arguably a rational big fear.

We’ve had this date for a long time. In the late 1990s I saw a demonstration of “vactors” – virtual actors that were created by scanning a human actor moving in various ways and building a library of movements that thereafter could be rendered at will. At the time, the state of the art was not much advanced from the liquid metal man in Terminator 2. Rendering film-quality characters was very slow, but that was then and this is now, and how long before rendering moving humans can be done in high-def in real-time at action speed?

The studios are already pushing actors into allowing synthesized reuse. California law grants public figures, including actors, publicity rights that prevent the commercial use of their name and likeness without consent. However, Mendez reports that current contracts already require actors to waive those rights to grant the studios digital simulation or digital creation rights. The effects are worst in reality television, where the line is blurred between the individual as a character on a TV show and the individual in their off-screen life. She quotes lawyer Ryan Schmidt: “We’re at this Napster 2001 moment…”

That moment is even closer for voice actors. Last year, Actors Equity announced a campaign to protect voice actors from their synthesized counterparts. This week, one of those synthesizers is providing commentary – more like captions, really – for video clips like this one at Wimbledon. As I said last year, while synthesized voices will be good enough for many applications such as railway announcements, there are lots of situations that will continue to require real humans. Sports commentary is one; commentators aren’t just there to provide information, they’re *also* there to sell the game. Their human excitement at the proceedings is an important part of that.

So SAG-AFTRA, like the Writers Guild of America, is seeking limitations on how studios may use AI, payment for such uses, and rules on protecting against misuse. In another LA Times story, Anoushka Sakoui reports that the studios’ offer included requiring “a performer’s consent for the creation and use of digital replicas or for digital alterations of a performance”. Like publishers “offering” all-rights-in perpetuity contracts to journalists and authors since the 1990s, the studios are trying to ensure they have all the rights they could possibly want.

“You cannot change the business model as much as it has changed and not expect the contract to change, too,” SAG-AFTRA president Fran Drescher said yesterday in a speech that has been widely circulated.

It was already clear this is going to be a long strike that will damage tens of thousands of industry workers and the economy of California. Earlier this week, Dominic Patten reported at Deadline that the Association of Movie and Television Producers plans to delay resuming talks with the WGA until October. By then, Patten reports producers saying, writers will be losing their homes and be more amenable to accepting the AMPTP’s terms. The AMPTP officially denies this, saying it’s committed to reaching a deal. Nonetheless, there are no ongoing talks. As Ken Levine pointed out in a pair of blogposts written during the 2007 writers strike, management is always in control of timing.

But as Levine also says, in the “old days” a top studio mogul could simply say, “Let’s get this done” and everyone would get around the table and make a deal. The new presence of tech giants Netflix, Amazon, and Apple in the AMPTP membership makes this time different. At some point, the strike will be too expensive for legacy Hollywood studios. But for Apple, TV production is a way to sell services and hardware. For Amazon, it’s a perk that comes with subscribing to its Prime delivery service. Only Netflix needs a constant stream of new work – and it can commission it from creators across the globe. All three of them can wait. And the longer they drag this out, the more the traditional studios will lose money and weaken as competitors.

Legacy Hollywood doesn’t seem to realize it yet, but this strike is existential for them, too.

Illustrations: SAG-AFTRA president Fran Drescher, announcing the strike on Thursday.

Wendy M. Grossman is the 2013 winner of the Enigma Award. Her Web site has an extensive archive of her books, articles, and music, and an archive of earlier columns in this series. Follow on Mastodon.

The horns of a dilemma

It has always been possible to conceive a future for Mastodon and the Fediverse that goes like this: incomers join the biggest servers (“instances”). The growth of those instances, if they can afford it, accelerates. When the sysadmins of smaller instances burn out and withdraw, their users also move to the largest instances. Eventually, the Fediverse landscape is dominated by a handful of very large instances (who enshittify in the traditional way) with a long tail of small and smaller ones. The very large ones begin setting rules – mostly for good reasons like combating abuse, improving security, and offering new features – that the very small ones struggle to keep up with. Eventually, it becomes too hard for most small instances to function.

This is the history of email. In 2003, when I set up my own email server at home, almost every techie had one. By this year, when I decommissioned it in favor of hosted email, almost everyone had long since moved to Gmail or Hotmail. It’s still possible to run an independent server, but the world is increasingly hostile to them.

Another possible Fediverse future: the cultural norms that Mastodon and other users have painstakingly developed over time become swamped by a sudden influx of huge numbers of newcomers when a very large instance joins the federation. The newcomers, who know nothing of the communities they’re joining, overwhelm their history and culture. The newcomers are despised and mocked – but meanwhile, much of the previous organically grown culture is lost, and people wanting intelligent conversation leave to find it elsewhere.

This is the history of Usenet, which in 1994 struggled to absorb 1 million AOLers arriving via a new gateway and software whose design reflected AOL’s internal design rather than Usenet’s history and culture. The result was to greatly exacerbate Usenet’s existing problems of abuse.

A third possible Fediverse future: someone figures out how to make money out of it. Large and small instances continue to exist, but many become commercial enterprises, and small instances increasingly rely on large instances to provide services the small instances need to stay functional. While both profit from that division of labor, the difficulty of discover means small servers stay small, and the large servers become increasingly monopolistic, exploitative, and unpleasant to use. This is the history of the web, with a few notable exceptions such as Wikipedia and the Internet Archive.

A fourth possible future: the Fediverse remains outside the mainstream, and admins continue to depend on donations to maintain their servers. Over time, the landscape of servers will shift as some burn out or run out of money and are replaced. This is roughly the history of IRC, which continues to serve its niche. Many current Mastodonians would be happy with this; as long as there’s no corporate owner no one can force anyone out of business for being insufficiently profitable.

These forking futures are suddenly topical as Mastodon administrators consider how to respond to this: Facebook will launch a new app that will interoperate with Mastodon and any other network that uses the ActivityPub protocol. Early screenshots suggest a clone of Twitter, Meta’s stated target, and reports say that Facebook is talking to celebrities like Oprah Winfrey and the Dalai Lama as potential users. The plan is reportedly that users will access the new service via their Instagram IDs and passwords. Top-down and celebrity-driven is the opposite of the Fediverse.

It should not be much comfort to anyone that the competitor the company wants to kill with this initiative is Twitter, not Mastodon, because either way Meta doesn’t care about Mastodon and its culture. Mastodon is rounding error even for just Instagram. Twitter is also comparatively small (and, like Reddit, too text-based to grow much further) but Meta sees in it the opportunity to capture its influencers and build profits around them.

The Fediverse is a democracy in the sense that email and Usenet were; admins get to decide their server’s policy, and users can only accept or reject by moving their account (which generally loses their history). For admins, how to handle Meta is not an easy choice. Meta has approached for discussions the admins of some of the larger Mastodon instances, who must sign an NDA or give up the chance to influence developments. That decision is for the largest few; but potentially every Mastodon instance operator will have to decide the bigger question: do they federate with Meta or not? Refusal means their users can’t access Meta’s wider world, which will inevitably include many of their friends; acceptance means change and loss of control. As I’ve said here before, something that is “open” only to your concept of “good people” isn’t open at all; it’s closed.

At Chronicles of the Instantly Curious, Carey Lening deplores calls to shun Meta as elitist; the AOL comparison draws itself. Even so, the more imminent bad future for Mastodon is this fork that could split the Fediverse into two factions. Of course the point of being decentralized is to allow more choice over who you socially network with. But until now, none of those choices took on the religious overtones associated with the most heated cyberworld disputes. Fasten your seatbelts…

Illustrations: A mastodon by Heinrich Harder (public domain, via Wikimedia).

Wendy M. Grossman is the 2013 winner of the Enigma Award. Her Web site has an extensive archive of her books, articles, and music, and an archive of earlier columns in this series. Follow on Mastodon.

A world of lawsuits

In the US this week the Supreme Court heard arguments in two cases centered on Section 230, the US law that shields online platforms from liability for third-party content. In Paris, UNESCO convened Internet for Trust to bring together governments and civil society to contemplate global solutions to the persistent problems of Internet regulation. And in the business of cyberspace, in what looks like desperation to stay afloat Twitter began barring non-paying users (that is, the 99.8% of its user base that *doesn’t* subscribe to Twitter Blue) from using two-factor authentication via SMS and Meta announced plans for a Twitter Blue-like subscription service for its Facebook, Instagram, and WhatsApp platforms.

In other words, the above policy discussions are happening exactly at the moment when, for the first time in nearly two decades, two of the platforms whose influence everyone is most worried about may be beginning to implode. Twitter’s issues are well-known. Meta’s revenues are big enough that there’s a long way for them to fall…but the company is spending large fortunes on developing the Metaverse, which no one may want, and watching its ad sales shrink and data protection fines rise.

The SCOTUS hearings – Gonzalez v. Google, experts’ live blog, Twitter v. Taamneh – have been widely covered in detail. In most cases, writers note that trying to discern the court’s eventual ruling from the justices’ questions is about as accurate as reading tea leaves. Nonetheless, Columbia professor Tim Wu predicts that Gonzalez will lose but that Taamneh could be very close.

In Gonzalez, the parents of a 23-year-old student killed in a 2015 ISIS attack in Paris argue that YouTube should be liable for radicalizing individuals via videos found and recommended on its platform. In Taamneh, the family of a Jordanian citizen who died in a 2017 ISIS attack in Istanbul sued Twitter, Google, and Facebook for failing to control terrorist content on their sites under anti-terrorism laws. A ruling assigning liability in either case could be consequential for S230. At TechDirt, Mike Masnick has an excellent summary of the Gonzalez hearing, as well as a preview of both cases.

Taamneh, on the other hand, asks whether social media sites are “aiding and abetting” terrorism via their recommendations engines under Section 2333 of the Antiterrorism and Effective Death Penalty Act (1996). Under the Justice Against Sponsors of Terrorism Act (2016) any US national who is injured by an act of international terorrism can sue anyone who “aids and abets by knowingly providing substantial assistance” to anyone committing such an act. The case turns on how much Twitter knows about its individual users and what constitutes substantial assistance. There has been some concern, expressed in amicus briefs, that making online intermediaries liable for terrorist content will result in overzealous content moderation. Lawfare has a good summary of the cases and the amicus briefs they’ve attracted.

Contrary to what many people seem to think, while S230 allows content moderation, it’s not a law that disproportionately protects large platforms, which didn’t exist when it was enacted. As Kosseff tells Gizmodo: without liability protection a local newspaper or personal blog could not risk publishing reader comments, and Wikipedia could not function. Justice Elena Kagan has been mocked for saying the justices are “not the nine greatest experts on the Internet”, but she grasped perfectly that undermining S230 could create “a world of lawsuits”.

For the last few years, both Democrats and Republicans have called for S230 reform, but for different reasons. Democrats fret about the proliferation of misinformation; Republicans complain that they (“conservative voices”) are being censored. The global level seen at the UNESCO event took a broader view in trying to draft a framework for self-regulation. While it wouldn’t be binding, there’s some value in having an multi-stakeholder-agreed standard against which individual governmental proposals can be evaluated. One of the big gaps in the UK’s Online Safety bill;, for example, is the failure to tackle misinformation or disinformation campaigns. Neither reforming S230 nor a framework for self-regulation will solve that problem either: over the last few years too much of the most widely-disseminated disinformation has been posted from official accounts belonging to world leaders.

One interesting aspect is how many new types of “content” have been created since S230’s passage in 1996, when the dominant web analogy was print publishing. It’s not just recommendation algorithms; are “likes” third-party content? Are the thumbnails YouTube’s algorithm selects to show each visitor on its front page to entice viewers presentation or publishing?

In his biography of S230, The Twenty-Six Words That Created the Internet, Jeff Kosseff notes that although similar provisions exist in other legislation across the world, S230 is unique in that only America privileges freedom of speech to such an extreme extent. Most other countries aim for more of a balance between freedom of expression and privacy. In 1997, it was easy to believe that S230 enabled the Internet to export the US’s First Amendment around the world like a stowaway. Today, it seems more like the first answer to an eternally-recurring debate. Despite its problems, like democracy itself, it may continue to be the least-worst option.

Illustrations: US senator and S230 co-author Ron Wyden (D-OR) in 2011 (by JS Lasica via Wikimedia.

Wendy M. Grossman is the 2013 winner of the Enigma Award. Her Web site has an archive of earlier columns backj to 2001. Follow on Twitter.


The science fiction author Charles Stross had a moment of excitement on Mastodon this week: WRITER CHALLENGE!.

Stross challenged writers to use the word “esquivalience” in their work. The basic idea: turn this Pinocchio word into a “real” word.

Esquivalience is the linguistic equivalent of a man-made lake. The creator, editor Christine Lindberg, invented it for the 2001 edition of the New American Oxford Dictionary and defined it as “the willful avoidance of one’s official responsibilities; the shirking of duties”. It was a trap to catch anyone republishing the dictionary rather than developing their own (a job I have actually done). This is a common tactic for protecting large compilations where it’s hard to prove copying – fake streets are added to maps, for example, and the people who rent out mailing lists add ringers whose use will alert them if the list is used outside the bounds of the contractual agreement.

There is, however, something peculiarly distasteful about fake entries in supposedly authoritative dictionaries, even though I agree with Lindberg that “esquivalience” is a pretty useful addition to the language. It’s perfect – perhaps in the obvious adjectival form “esquivalient” – for numerous contemporary politicians, though here be dragons: “willful” risks libel actions.

Probably most writers have wanted to make up words, and many have, from playwright and drama critic George S. Kaufman, often credited for coining, among other things, “underwhelmed”, to Anthony Burgess, who invented an entire futurist street language for A Clockwork Orange. Some have gone so far as to create enough words to publish dictionaries – such as the humorist Gelett Burgess, whose Burgess Unabridged (free ebook!) compiles “words you’ve always needed”. From that collection, I have always been particularly fond of Burgess’s “wox”, defined as “a state of placid enjoyment; sluggish satisfaction”. It seems particularly apt in the hours immediately following Thanksgiving dinner.

In these cases, though, the context lets you know the language is made up. The dictionary is supposed to be authoritative, admitting words only after they are well-established. The presence of fake words feels damaging in a way that a fake place on a map doesn’t. It’s comparatively easy to check whether a place exists by going there, but at some point down the echoing corridors of time *every* word was used for the first time. Pinpointing exactly when is hard unless someone ‘fesses up. I don’t like the idea that my dictionary is lying to me. Better if NOAD had planted two fake words and had them recursively point at each other for their definitions.

I had been avoiding the ChatGPT hoopla, but it seemed plausible to ask it: Is “esquivalience” a real word?” Its response started well enough: “‘Esquivalience’ is not recognized as a standard word in the English language. It is a made-up word…” And then cuckoo land arrived: “…that was created by a writer named Adam Jacot de Boinod for his book “The Meaning of Tingo”.” Pause to research. The book in question was written in 2006. The word “esquivalience” does not, from a quick text search, appear in it. Huh? I went on to suggest Christine Lindberg’s name to ChatGPT, and after a digression attributing the word to the singer-songwriter Christine Lavin, it appeared to find references to Lindberg’s “claim” in its corpus of data. But, it continued to warn, in every response, “it is still not recognized as a standard word in the English language”. It’s a bot. It’s not being stern. It doesn’t know what it’s saying. Getting it to agree on Christine Lindberg as the original source isn’t winning the argument. It’s just giving it a different prompt.

I ask if it has ever encountered the word “wox”. “As an AI language model, I have certainly come across the word ‘wox’.” A human reads lightly insulted pride into that. Resist. It’s a bot. It has no pride. The bot went on to speculate on possible origins (“it may be a neologism…”). I ask if it’s heard of Gelett Burgess. Oh, yes, followed by a short biography. Then, when told Burgess invented “wox”: “Gelett Burgess did indeed invent the word…” and goes on to cite the correct book…but then continues that Burgess defined it as “to make fun of, to poke fun at” which is absolutely not what Burgess says, and I know this because I have the original 1914 book right here, and the definition I cited above is right there on p112. The bot does “apologize” every time you point out a mistake, though.

This isn’t much of a sample, but based on it, I find ChatGPT quite alarming as an extraordinarily efficient way of undermining factual knowledge. The responses sound authoritative, but every point must be fact-checked. It could not be worse-suited for today’s world, where everyone wants fast answers. Coupled with search, it turns the algorithms that give us answers into even more obscure and less trustworthy black boxes. Wikipedia has many flaws, but its single biggest strength is its sourcing and curation; how every page has been changed and shaped over the years is open for inspection.

So when ChatGPT went on to say that Gelett Burgess is widely credited with coining the term “blurb”, Wikipedia is where I turned. Wikipedia agrees (asked, ChatGPT cites the Oxford English Dictionary). Burgess FTW.

Illustrations: Gelett Burgess’s 1914 Burgess Unabridged, a dictionary of made-up words.

Wendy M. Grossman is the 2013 winner of the Enigma Award. Her Web site has an extensive archive of her books, articles, and music, and an archive of earlier columns in this series. Follow on Twitter.