Ex libris

So as previously discussed here three years ago and two years ago, on March 24 the US District Court for the Southern District of New York found that the Internet Archive’s controlled digital lending fails copyright law. Half of my social media feed on this subject filled immediately with people warning that publishers want to kill libraries and this judgment is a dangerous step limiting access to information; the other half is going “They’re stealing from authors. Copyright!” Both of these things can be true. And incomplete.

To recap: in 2006 the Internet Archive set up the Open Library to offer access to digitized books under “controlled digital lending”. The system allows each book to be “out” on “loan” to only one person at a time, with waiting lists for popular titles. In a white paper, lawyers David R. Hansen and Kyle K. Courtney call this “format shifting” and say that because the system replicates library lending it is fair use. Also germane: the Archive points to a 2007 California decision that it is in fact a library. Other countries may beg to differ.

When public libraries closed at the beginning of the covid19 pandemic, the Internet Archive announced the National Emergency Library, which suspended the one-copy-at-a-time rule and scrubbed the waiting lists so anyone could borrow any book at any time. The resulting publicity was the first time many people had heard of the Open Library, although authors had already complained. Hachette Book Group, Penguin Random House, HarperCollins, and Wiley filed suit. Shortly afterwards, the Archive shut down the National Emergency Library. The Open Library continues, and the Archive will appeal the judge’s ruling.

On the they’re-killing-the-libraries side: Mike Masnick and Fight for the Future. At Walled Culture, Glyn Moody argues that sharing ebooks helps sell paid copies. Many authors agree with the publishers that their living is at risk; a group of exceptions including Neil Gaiman, Naomi Klein, and Cory Doctorow, have published an open letter defending the Archive.

At Vice, Claire Woodstock lays out some of the economics of library ebook licenses, which eat up budgets but leave libraries vulnerable and empty-shelved when a service is withdrawn. She also notes that the Internet Archive digitizes physical copies it buys or receives as donations, and does not pay for ebook licenses.

Brief digression back to 1996, when Pamela Samuelson warned of the coming copyright battles in Wired. Many of its key points have since either been enshrined into law, such as circumventing copy protection; others, such as requiring Internet Service Providers to prevent users from uploading copyrighted material, remain in play today. Number three on her copyright maximalists’ wish listeliminating first-sale rights for digitally transmitted documents. This is the doctrine that enables libraries to lend books.

It is therefore entirely believable that commercial publishers believe that every library loan is a missed sale. Outside the US, many countries have a public lending right that pays royalties on loans for that sort of reason. The Internet Archive doesn’t pay those, either.

It surely isn’t facing the headwinds public libraries are. In the UK, years of austerity have shrunk library budgets and therefore their numbers and opening hours. In the US, libraries are fighting against book bans; in Missouri, the Republican-controlled legislature voted to defund the state’s libraries entirely, apparently in retaliation.

At her blog, librarian and consultant Karen Coyle, who has thought for decades about the future of libraries, takes three postings to consider the case. First, she offers a backgrounder, agreeing that the Archive’s losing on appeal could bring consequences for other libraries’ digital lending. In the second, she teases out the differences between academic/research libraries and public libraries and between research and reading. While journals and research materials are generally available in electronic format, centuries of books are not, and scanned books (like those the Archive offers) are a poor reading experience compared to modern publisher-created ebooks. These distinctions are crucial to her third posting, which traces the origins of controlled digital lending.

As initially conceived by Michelle M. Wu in a 2011 paper for Law Library Journal, controlled digital lending was a suggestion that law libraries could, either singly or in groups, buy a hard copy for their holdings and then circulate a digitized copy, similar to an Inter-Library Loan. Law libraries serve limited communities, and their comparatively modest holdings have a known but limited market.

By contrast, the Archive gives global access to millions of books it has scanned. In court, it argued that the availability of popular commercial books on its site has not harmed publishers’ revenues. The judge disagreed: the “alleged benefits” of access could not outweigh the market harm to the four publishers who brought the suit. This view entirely devalues the societal role libraries play, and Coyle, like many others, is dismayed that the judge saw the case purely in terms of its effect on the commercial market.

The question I’m left with is this: is the Open Library a library or a disruptor? If these were businesses, it would obviously be the latter: it avoids many of the costs of local competitors, and asks forgiveness not permission. As things are, it seems to be both: it’s a library for users, but a disruptor to some publishers, some authors, and potentially the world’s libraries. The judge’s ruling captures none of this nuance.

Illustrations: 19th century rendering of the Great Library of Alexandria (via Wikimedia).

Wendy M. Grossman is the 2013 winner of the Enigma Award. Her Web site has an extensive archive of her books, articles, and music, and an archive of earlier columns in this series. Follow on Twitter.

Memex 2.0

As language models get cheaper, it’s dawned on me what kind of “AI” I’d like to have: a fully personalized chat bot that has been trained on my 30-plus years of output plus all the material I’ve read, watched, listened to, and taken notes on all these years. A clone of my brain, basically, with more complete and accurate memory updated alongside my own. Then I could discuss with it: what’s interesting to write about for this week’s net.wars?

I was thinking of what’s happened with voice synthesis. In 2011, it took the Scottish company Cereproc months to build a text-to-speech synthesizer from recordings of Roger Ebert’s voice. Today, voice synthesizers are all over the place – not personalized like Ebert’s, but able to read a set text plausibly enough to scare voice actors.

I was also thinking of the Stochastic Parrots paper, whose first anniversary was celebrated last week by authors Emily Bender, Timnit Gebru, Angelina McMillan-Major, and Margaret Mitchell. An important part of the paper advocates for smaller, better-curated language models: more is not always better. I can’t find a stream for the event, but here’s the reading list collected during the proceedings. There’s lots I’d rather eliminate from my personal assistant. Eliminating unwanted options upfront has long been a widspread Internet failure, from shopping sites (“never show me pet items”) to news sites (“never show me fashion trends”). But that sort of selective display is more difficult and expensive than including everything and offering only inclusion filters.

A computational linguistics expert tells me that we’re an unknown amount of time away from my dream of the wg-bot. Probably, if such a thing becomes possible it will be based on someone’s large language model and fine-tuned with my stuff. Not sure I entirely like this idea; it means the model will be trained on stuff I haven’t chosen or vetted and whose source material is unknown, unless we get a grip on forcing disclosure or the proposed BLOOM academic open source language model takes over the world.

I want to say that one advantage to training a chatbot on your own output is you don’t have to worry so much about copyright. However, the reality is that most working writers have sold all rights to most of their work to large publishers, which means that such a system is a new version of digital cholera. In my own case, by the time I’d been in this business for 15 years, more than half of the publications I’d written for were defunct. I was lucky enough to retain at least non-exclusive rights to my most interesting work, but after so many closures and sales I couldn’t begin to guess – or even know how to find out – who owns the rights to the rest of it. The question is moot in any case: unless I choose to put those group reviews of Lotus 1-2-3 books back online, probably no one else will, and if I do no one will care.

On Mastodon, the specter of the upcoming new! improved! version of the copyright wars launched by the arrival of the Internet: “The real generative AI copyright wars aren’t going to be these tiny skirmishes over artists and Stability AI. Its going to be a war that puts filesharing 2.0 and the link tax rolled into one in the shade.” Edwards is referring to this case, in which artists are demanding billions from the company behind the Stable Diffusion engine.

Edwards went on to cite a Wall Street Journal piece that discusses publishers’ alarmed response to what they perceive as new threats to their business. First: that the large piles of data used to train generative “AI” models are appropriated without compensation. This is the steroid-fueled analogue to the link tax, under which search engines in Australia pay newspapers (primarily the Murdoch press) for including them in news search results. A similar proposal is pending in Canada.

The second is that users, satisfied with the answers they receive from these souped-up search services will no longer bother to visit the sources – especially since few, most notably Google, seem inclined to offer citations to back up any of the things they say.

The third is outright plagiarism without credit by the chatbot’s output, which is already happening.

The fourth point of contention is whether the results of generative AI should be themselves subject to copyright. So far, the consensus appears to be no, when it comes to artwork. But some publishers who have begun using generative chatbots to create “content” no doubt claim copyright in the results. It might make more sense to copyright the *prompt*. (And some bright corporate non-soul may yet try.)

At Walled Culture, Glyn Moody discovers that the EU has unexpectedly done something right by requiring positive opt-in to copyright protection against text and data mining. I’d like to see this as a ray of hope for avoiding the worst copyright conflicts, but given the transatlantic rhetoric around privacy laws and data flows, it seems much more likely to incite another trade conflict.

It now dawns on me that the system I outlined in the first paragraph is in fact Vannevar Bush’s Memex. Not the web, which was never sufficiently curated, but this, primed full of personal intellectual history. The “AI” represents those thousands of curating secretaries he thought the future would hold. As if.

Illustrations: Stable Diffusion rendering of “stochastic parrots”, as prompted by Jon Crowcroft.

Wendy M. Grossman is the 2013 winner of the Enigma Award. Her Web site has an extensive archive of her books, articles, and music, and an archive of earlier columns in this series. Follow on Twitter.