Large Language Publishing, par Jeff Pooley
▻https://doi.org/10.54900/zg929-e9595
Analyse très intéressante sur l’Open access dans la science, ses limites notamment en termes de licence et de droit d’utilisation et de l’usage des articles scientifiques pour entrainer l’intelligence artificielle (#AI) avec la question de la captation des profits associés alors que les éditeurs scientifiques ne rémunèrent jamais, quant à eux, les producteurs de connaissance scientifique et au contraire, exploitent les universités et leurs financements publics sous forme d’abonnements des bibliothèques aux revues ou de APC (charges payées par les scientifiques pour être publiés en #open_access). Le texte soulève d’intéressants paradoxes sur ces enjeux, en faisant un parallèle avec le procès que les grands journaux comme le New York Times ont lancé contre OpenAI et autres entreprises d’intelligence artificielle pour avoir entrainé les logiciels d’AI sans autorisation et surtout sans paiement leurs archives.
Thus the two main sources of trustworthy knowledge, science and journalism, are poised to extract protection money—to otherwise exploit their vast pools of vetted text as “training data.” But there’s a key difference between the news and science: Journalists’ salaries, and the cost of reporting, are covered by the companies. Not so for scholarly publishing: Academics, of course, write and review for free, and much of our research is funded by taxpayers. The Times suit is marinated in complaints about the costly business of journalism. The likes of Taylor & Francis and Springer Nature won’t have that argument to make. It’s hard to call out free-riding when it’s your own business model.
[...]
Enter Elsevier and its oligopolistic peers. They guard (with paywalled vigilance) a large share of published scholarship, much of which is unscrapable. A growing proportion of their total output is, it’s true, open access, but a large share of that material carries a non-commercial license. Standard OA agreements tend to grant publishers blanket rights, so they have a claim—albeit one contested on fair-use grounds by OpenAI and the like—to exclusive exploitation. Even the balance of OA works that permit commercial re-use are corralled with the rest, on propriety platforms like Elsevier’s ScienceDirect. Those platforms also track researcher behavior, like downloads and citations, that can be used to tune their models’ outputs.
[...]
As the Times lawsuit suggests, there’s a big legal question mark hovering over the big publishers’ AI prospects. The key issue, winding its way through the courts, is fair use: Can the likes of OpenAI scrape up copyrighted content into their models, without permission or compensation? The Silicon Valley tech companies think so; they’re fresh converts to fair-use maximalism, as revealed by their public comments filed with the US Copyright Office. The companies’ “overall message,“ reported The Verge in a round-up, is that they “don’t think they should have to pay to train AI models on copyrighted work.” Artists and other content creators have begged to differ, filing a handful of high-profile lawsuits.
The publishers haven’t filed their own suits yet, but they’re certainly watching the cases carefully. Wiley, for one, told Nature that it was “closely monitoring industry reports and litigation claiming that generative AI models are harvesting protected material for training purposes while disregarding any existing restrictions on that information.” The firm has called for audits and regulatory oversight of AI models, to address the “potential for unauthorised use of restricted content as an input for model training.“ Elsevier, for its part, has banned the use of “our content and data” for training; its sister company LexisNexis, likewise, recently emailed customers to “remind” them that feeding content to “large language models and generative AI” is forbidden.
[...]
One near-term consequence may be a shift in the big publishers’ approach to open access. The companies are already updating their licenses and terms to forbid commercial AI training—for anyone but them, of course. The companies could also pull back from OA altogether, to keep a larger share of exclusive content to mine. Esposito made the argument explicit in a recent Scholarly Kitchen post: “The unfortunate fact of the matter is that the OA movement and the people and organizations that support it have been co-opted by the tech world as it builds content-trained AI.“ Publishers need “more copyright protection, not less,“ he added. Esposito’s consulting firm, in its latest newsletter, called the liberal Creative Commons BY license a “mechanism to transfer value from scientific and scholarly publishers to the world’s wealthiest tech companies.“ Perhaps, though I would preface the point: Commercial scholarly publishing is a mechanism to transfer value from scholars, taxpayers, universities to the world’s most profitable companies.
[...]
There are a hundred and one reasons to worry about Elsevier mining our scholarship to maximize its profits. I want to linger on what is, arguably, the most important: the potential effects on knowledge itself. At the core of these tools—including a predictable avalanche of as-yet-unannounced products—is a series of verbs: to surface, to rank, to summarize, and to recommend. The object of each verb is us—our scholarship and our behavior. What’s at stake is the kind of knowledge that the models surface, and whose knowledge.
[...]
These dynamics of cumulative advantage have, in practice, served to amplify the knowledge system’s patterned inequalities—for example, in the case of gender and twentieth-century scholarship, aptly labeled the Matilda Effect by Margaret Rossiter.
The deployment of AI models in science, especially proprietary ones, may produce a Matthew Effect on the scale of Scopus, and with no paper trail. The problem is analogous to the well-documented bias-smuggling with existing generative models; image tools trained on, for example, mostly white and male photos reproduce the skew in their prompt-generated outputs. With our bias-laden scholarship as training data, the academic models may spit out results that, in effect, double-down on inequality. What’s worse is that we won’t really know, due to the models’ black-boxed character. Thus the tools may act as laundering machines—context-erasing abstractions that disguise their probabilistic “reasoning.” Existing biases, like male academics’ propensity for self-citation, may win a fresh coat of algorithmic legitimacy. Or consider center-periphery dynamics along North-South and native-English-speaking lines: Gaps traceable to geopolitical history, including the legacy of European colonialism, may be buried still deeper. The models, in short, could serve as privilege multipliers.
[...]
So it’s an urgent task to push back now, and not wait until after the models are trained and deployed. What’s needed is a full-fledged campaign, leveraging activism and legislative pressure, to challenge the commercial publishers’ extractive agenda. One crucial framing step is to treat the impending AI avalanche as continuous with—as an extension of—the publishers’ in-progress mutation into surveillance-capitalist data businesses.
[...]
So far the publishers’ data-hoovering hasn’t galvanized scholars to protest. The main reason is that most academics are blithely unaware of the tracking—no surprise, given scholars’ too-busy-to-care ignorance of the publishing system itself. The library community is far more attuned to the unconsented pillage, though librarians—aside from SPARC—haven’t organized on the issue. There have been scattered notes of dissent, including a Stop Tracking Science petition, and an outcry from Dutch scholars on a 2020 data-and-publish agreement with Elsevier, largely because the company had baked its prediction products into the deal. In 2022 the German national research foundation, Deutsche Forschungsgemeinschaft (DFG), released its own report-cum-warning—“industrialization of knowledge through tracking,” in the report’s words. Sharp critiques from, among others Bjorn Brembs, Leslie Chan, Renke Siems, Lai Ma, and Sarah Lamdan have appeared at regular intervals.
None of this has translated into much, not even awareness among the larger academic public. A coordinated campaign of advocacy and consciousness-raising should be paired with high-quality, in-depth studies of publisher data harvesting—on the example of SPARC’s recent ScienceDirect report. Any effort like this should be built on the premise that another scholarly-publishing world is possible. Our prevailing joint-custody arrangement—for-profit publishers and non-profit universities—is a recent and reversible development. There are lots of good reasons to restore custody to the academy. The latest is to stop our work from fueling the publishers’ AI profits.