As language models get cheaper, it’s dawned on me what kind of “AI” I’d like to have: a fully personalized chat bot that has been trained on my 30-plus years of output plus all the material I’ve read, watched, listened to, and taken notes on all these years. A clone of my brain, basically, with more complete and accurate memory updated alongside my own. Then I could discuss with it: what’s interesting to write about for this week’s net.wars?
I was thinking of what’s happened with voice synthesis. In 2011, it took the Scottish company Cereproc months to build a text-to-speech synthesizer from recordings of Roger Ebert’s voice. Today, voice synthesizers are all over the place – not personalized like Ebert’s, but able to read a set text plausibly enough to scare voice actors.
I was also thinking of the Stochastic Parrots paper, whose first anniversary was celebrated last week by authors Emily Bender, Timnit Gebru, Angelina McMillan-Major, and Margaret Mitchell. An important part of the paper advocates for smaller, better-curated language models: more is not always better. I can’t find a stream for the event, but here’s the reading list collected during the proceedings. There’s lots I’d rather eliminate from my personal assistant. Eliminating unwanted options upfront has long been a widspread Internet failure, from shopping sites (“never show me pet items”) to news sites (“never show me fashion trends”). But that sort of selective display is more difficult and expensive than including everything and offering only inclusion filters.
A computational linguistics expert tells me that we’re an unknown amount of time away from my dream of the wg-bot. Probably, if such a thing becomes possible it will be based on someone’s large language model and fine-tuned with my stuff. Not sure I entirely like this idea; it means the model will be trained on stuff I haven’t chosen or vetted and whose source material is unknown, unless we get a grip on forcing disclosure or the proposed BLOOM academic open source language model takes over the world.
I want to say that one advantage to training a chatbot on your own output is you don’t have to worry so much about copyright. However, the reality is that most working writers have sold all rights to most of their work to large publishers, which means that such a system is a new version of digital cholera. In my own case, by the time I’d been in this business for 15 years, more than half of the publications I’d written for were defunct. I was lucky enough to retain at least non-exclusive rights to my most interesting work, but after so many closures and sales I couldn’t begin to guess – or even know how to find out – who owns the rights to the rest of it. The question is moot in any case: unless I choose to put those group reviews of Lotus 1-2-3 books back online, probably no one else will, and if I do no one will care.
On Mastodon, the specter of the upcoming new! improved! version of the copyright wars launched by the arrival of the Internet: “The real generative AI copyright wars aren’t going to be these tiny skirmishes over artists and Stability AI. Its going to be a war that puts filesharing 2.0 and the link tax rolled into one in the shade.” Edwards is referring to this case, in which artists are demanding billions from the company behind the Stable Diffusion engine.
Edwards went on to cite a Wall Street Journal piece that discusses publishers’ alarmed response to what they perceive as new threats to their business. First: that the large piles of data used to train generative “AI” models are appropriated without compensation. This is the steroid-fueled analogue to the
The second is that users, satisfied with the answers they receive from these souped-up search services will no longer bother to visit the sources – especially since few, most notably Google, seem inclined to offer citations to back up any of the things they say.
The third is outright plagiarism without credit by the chatbot’s output, which is already happening.
The fourth point of contention is whether the results of generative AI should be themselves subject to copyright. So far, the consensus appears to be no, when it comes to artwork. But some publishers who have begun using generative chatbots to create “content” no doubt claim copyright in the results. It might make more sense to copyright the *prompt*. (And some bright corporate non-soul may yet try.)
At Walled Culture, Glyn Moody discovers that the EU has unexpectedly done something right by requiring positive opt-in to copyright protection against text and data mining. I’d like to see this as a ray of hope for avoiding the worst copyright conflicts, but given the transatlantic rhetoric around privacy laws and data flows, it seems much more likely to incite another trade conflict.
It now dawns on me that the system I outlined in the first paragraph is in fact Vannevar Bush’s Memex. Not the web, which was never sufficiently curated, but this, primed full of personal intellectual history. The “AI” represents those thousands of curating secretaries he thought the future would hold. As if.
Illustrations: Stable Diffusion rendering of “stochastic parrots”, as prompted by Jon Crowcroft.