Analysis: The success of Generative AI in the book sector is based on theft

28th September 2023

Generative, analytical and assistive informatics, sub-areas of so-called artificial “intelligence”, threatens numerous jobs and fields of labour in the book sector and will replace some professions by machines in the medium run; be it in the areas of writing, editing, proof reading, production, cover design, illustration, translation, selection and editing of original and translated works, audio book production or in the promotion and distribution of books.

Already, numerous criminal and damaging “AI business models” have developed in the book sector – with fake authors, fake books and also fake readers. It has been proved that the fundaments for large language models such as GPT, Meta, StableLM, BERT have been generated from copyrighted book works whose sources are shadow libraries such as Library Genesis (LibGen), Z-Library (Bok), Sci-Hub and Bibliotik – piracy websites. Without legal regulation, generative technologies accelerate and enable the expansion of exploitation, legitimisation of copyright infringement, climate harm, discrimination, information and communication distortion, identity theft, reputational damage, blacklisting, royalty fraud and collective licensing remuneration fraud.

At the same time, a close look and assessment is needed to categorise and regulate the individual aspects of advanced informatics; because not all smart software is “AI”, not every application is equally risky. We as a society, but especially as originators, as writers, need:

A clear legal position on the exception(s) on text and data mining within Art 3 and Art 4 of the 2019/790 CDSM Directive, to clarify, whether machine learning is covered by TDM or not, which is at this time highly doubtful, and which leads to the consequences of volunteer licensing for a new form of usage – instead of opt-outing;
A secured way to decide to use our writers’ and translators’ works for scraping and as “training material” for machine learning and competing products, rather than accepting previous illegal use or uncompensated terms;
A “clean slate”: The immediate shutdown of those generative AI applications developed on works that are based on violations of copyright and personal rights.

The success of Generative AI in the book sector is based on theft

The spreading, mostly uncritical enthusiasm for generative advanced informatics (“AI”), such as large-scale language, image or audio models that produce culture-like output on text-prompts, lowers the appreciation for human creative labour. This enthusiasm is blind to the origins of these systems, as well as to the medium- and long-term consequences. This analysis draws attention to the seven sins of generative AI, which is considered as threat.

A distinction must be made within assistive or analysing informatics, as these are mainly supporting software, and not meant to replace human creativity and labour.

The invisible side effects of generative advanced informatics (AI) in the book sector and its impact on writers

(1) Generative “AI” is based on exploitation of human labour.

If all participants were adequately remunerated, none of the big twelve generative text or image (re-)generators (such as StableLM, BERT, GPT, Midjourney) could realistically cover their business. For years and far BEFORE the TDM exception of the 2019/790 CDSM Directive, works by citizens^[1] , authors^[2]^,[3]^,[4] and artists^[5] have been stolen and used to train the software. This is the only way their existence today is possible. In order to categorise language, videos and images, “labellers” are also exploited – often for hourly wages of less than two euros. Eight percent^[6] of all Americans do ghost work, the work of making so-called AI systems appear smart – data labelling, flagging, content filtering. More often, this repetitive work is outsourced from Silicon Valley for cost reasons^[7]^,[8] , to crowd and gig workers in Venezuela, Mexico, Bulgaria, India, Kenya, Syria or the Philippines, where there is neither minimum wage nor trade unions.^[9]

(2) AI harms human authors, their income and reputation through fake authors, fake books, fake readers – and identity theft:

(a) Uncontrolled AI output is being pushed into the bestseller lists with click farms: For months, the global self-publishing provider Amazon has been flooded with bogus books by fake authors whose text and visual content have been mashed together by (re-)generative text and image output software. AI bots from click farms “read” these nonsense works and pushed them into the bestseller lists[10] . This led to a rapid decline in revenue for human authors by shared-revenue models, such as Kindle KDP (A pot of revenue divided by pages read and number of authors, similar to Spotify). At peak times, 80 out of 100 Kindle KDP bestsellers are AI editions. Only in September 2023 the retailer giant added a new section to its content guidelines[11] for KDP focused on AI, which since then includes definitions of “AI-generated”, to label AI output.

(b) Identity theft and name deception: The world’s most important review platform Goodreads, like Amazon, is flooded with AI books published under the illegitimately used names of real human authors (or slightly altered spellings of real known names). These books are listed as new releases in the authors’ profiles and entice readers to buy them. However, the income from these AI books flows to unknown sources. Human authors who are cheated out of their earnings must spend money to defend themselves with lawyers. So far, neither Goodreads nor Amazon have stopped this identity theft, which damages the reputation of human authors when a (low-quality) AI product is associated with their name.

(c) Unauthorised machine translations open up foreign-language markets and channel sales to unknown sources: We observed cases of books being illegally translated from, for example, the original English language into Spanish and Portuguese by means of robot translation without a licence, and published under a different name, usually in Amazon Selfpublishing and often even equipped with an AI cover. The author names, in turn, deliberately resemble well-known names. The revenues flow to unknown sources.

(d) Publishing services only against payment by the author: Publishers are increasingly also using AI-generated covers. We have cases where authors requested human graphic designers and were asked to pay. This practice is considered indecent. However, the authors, as weaker contractual partners , hardly have the courage to refuse this, out of well-founded fear of being considered “difficult” or of being rejected by publishers for future cooperation. They are pressured into accepting a technology that harms their own profession at its core.

(e) Illegal remuneration claims to collective management organisations and media clients: It cannot be ruled out that automatically generated and machine translated press articles and machine translated books, or even regenerative produced AI images, already “enjoy” private copying remuneration from collective managements organisations (CMOs), as there is no legal labelling obligation yet; or generated texts, machine translations and generated images flow into the media on a royalty basis.

(f) Machine voices replace human narrators – and lead to the loss of licence fees for writers: DeepZen has been working on clone voices since 2013 and offers its repertoire to publishers to save on fees; numerous publishers, including renowned ones, have already used synthetic audio book narration. The dislocation continues in the question of revenue distribution: when there is no audiobook narrator, to whom does his calculated share go? The job calls for professional narrators is declining rapidly[12]. To professionally produce a voice clone (of human people) costs less than 2000 Euros in a professional studio. It is even cheaper with programmes like Murf, Lobo, Respeacher, Voice.Ai or Overdub. After a few seconds of recording, a voice clone is generated with which you can make “Anyone” say “Anything”, no matter how immoral or fraudulent ,[13][14] .

In 2022, Google introduced its services for publishers in six countries, in early January 2023[15] Apple introduced a series of AI voices named such as Madison and Jackson. Authors and publishers who sell their books through Apple Books are supposed to make use of these (and sign a confidentiality clause to this effect). The areas of clone voices or synthetic “voices” range from dubbing to audio books to trick calls for fraudsters or for deep-fake interviews etc. Actors and audio book narrators are increasingly confronted with having to agree to voice cloning in contracts for work if they want to continue to be employed. This leads to the gradual elimination of voice actors and narrators. In addition, there are cases in which voice clones were created without the consent of the human speakers. Or to be replaced by purely synthetic voices of devices (example: “Tonie Box”, where synthetic robot voices read automatically generated texts to children for goodnight[16] ). AI dubbing also becomes relevant when e-books are read aloud by devices and voice clones, but the author has neither granted a licence nor receives remuneration.

All in all, all these new “AI business models” lead to the following paradox: those who made the existence of generative programmes possible in the first place are not remunerated. But those who use the software profit monetarily. This transfer of value as a form of exploitation cannot be intended by the legislator.

(3) “AI” is a high-risk communicator and unreliable source of information.

“Hallucinating” is the vocabulary currently used to describe generative text systems[17] that completely invent or incorrectly plug together data, events[18], court decisions[19] or biographies, contradict themselves when asked questions, or need to be constantly corrected by users with reinforcement learning from human feedback (RHLF)[20] . In the process, users conveniently teach the system what its developers did not. At the same time, generative text software makes it easier for actors such as propaganda farms to rapidly and cheaply spread disinformation and hate speech; or creates fake authors who flood social networking platforms or market players such as Amazon[21] with GPT output and artificial communication[22] or automated ChatBot “reviews” of books[23]. The lack of or inadequate security checks to save costs[24] and the lack of test and correction series prior to publication mean that generative text applications must be assessed as fundamentally untruthful. At the same time, however, the “faith” and lack of sensitivity towards digital content of many of the over 100 million users is so high that they do not recognise these “hallucinations” – or do not even suspect that the output is false. Basically, AI needs original, “fresh”, human texts in order not to go crazy, as Stanford University found out: If synthetic content (AI output) is used as training[25] , the system collapses.

(4) “AI” (re)produces bias and reinforces intersectional discrimination [26]^,[27]^.

Stable Diffusion, an image-generating (“text to image”) computer science, knows no black members of any national European parliament, no female doctors, and poses as cleaners basically Asian women. Text generators reproduce sexist and gender-stereotypes – as they draw on texts that come from a particular more Western, male, white-oriented canon[28]or “learnt” misogyny from the comment sections of social web media. A bias can refer not only to gender or skin colour; but to places, ages, social classes, professions, medical conditions, cultures or the classification of facts, of concepts such as “success” or “happiness” or political opinion.

Effect: Users of a generative AI adopt the bias[29] and reinforce it. As a result, people are pigeonholed even more quickly and, above all, unquestioningly, this can have an impact on social and professional access, education, housing, health care and credibility.

(5) “AI” companies fear the Brussels-Effect[30] of the upcoming AI Act – with good reason.

The Stanford University surveyed twelve AI companies[31] on 22 requirements of the proposed AI Act. Results: Few companies disclose information on the copyright status of training data; hardly any provided information on energy consumption and emissions reductions; NONE were able to report on safety audits and mitigation strategies for structural or systemic risks. Microsoft and Open AI have been lobbying[32] for months against the planned AI regulation; they see their business models and previous billions in profits at risk, which are based on exploitation, theft of intellectual property, lack of transparency and risk ignorance.

It is therefore all the more important to take a clear stance and to insist on transparency, authorisation and remuneration in all regulations in an unambiguously understandable way.

(6) Lack of clarity as to whether the statutory permission for TDM within the 2019/790 CDSM Directive, Articles 3 and 4, allows the use of copyrighted works as “training data” for machine learning. If so, the opt-out provided is not an option.

Unclear legal situation: It is at least uncertain whether legal permissions for TDM (based in national legislations on Art 4, 2019/790 CDSM Directive) allow the use of copyrighted works as “training data” (cf. on this below, at Dictionary: TDM) for machine learning. Even more it is considered that machine learning for generative informatics is a total new form of usage and has in any case to be handled within a volunteer and remunerated licensing system.
In any case, however, the opt-out provided for TDM in the 2019/790 CDSM Directive is in no way practicable. And this is not only due to the lack of contractual routines everywhere in Europe, in which authors could already declare the opt-out when transferring rights of use – as there is no common practice to declare, if writers or translators agree to TDM or not. None of the contracts concluded until 2022 include queries on TDM; and it can be assumed that this use does NOT fall under electronic use or under database storage.
No sector standard for meta data: There is no standardisation to make an opt-out machine-readable within works that are “available online”; also according to contracts none of the AI development companies have asked so far, to be quite sure. It is also unclear what “available online” means and where to draw the line.
No technical application for opt-out in sight: Even though the W3C group is working on developing solutions (see July 2023 report[33] ), currently only for URL and metadata of EPub3, authors remain unprotected until an indefinite time. Meanwhile, W3C developers are questioning the interpretation of the TDM exception and if this covers machine learning. In addition, a new ISO standard (ICSS) is being tested for approval (previous standards in the book sector are ISBN, ISSN, ISNI, ISTC, DOI); opt-out declarations with this new identifier could be machine-read by special software – if AI developers were interested in rights clearance …

It is completely unclear how an opt-out can be explained for analogue works.
It is also an open question whether an opt-out also applies to works that have already been used for TDM in the past. Equally open is how to deal with out-of-print works, when they are digitised again by libraries or archives: who is implementing the opt-out in there?
Unlawful scraping: In addition, there is ample evidence that even machine-readable
txt opt-out statements on html websites are simply ignored by scrapers or unsupervised machine learning crawlers.
No chance to exercise one‘s right: In fact, in practice, as an author, it is impossible to exercise the opt-out option.

AI companies have also been pulling copyrighted book works from bit torrent piracy sites since 2013 ,[34],[35] . The corpus Book3 and The Pile was proven to contain 190,000 titles; under investigation by volunteer research teams are 1.2 million more copyrighted titles. At the end of September 2023, this led to a lawsuit by 17 US authors[36] such as George R R Martin and Jodi Picoult, among others, together with the US Authors Guild.

(7) Generative informatics (“AI”) is a climate threat^[37]

According to a study[38] by the Riverside University, training GPT-3 using computing centres in the US consumed 3.5 million litres of water, and Microsoft’s data centres in Asia consumed 5 million litres. ChatGPT(3) consumes 500 ml of water per 20 questions. The carbon emissions analysis[39] conducted by the University of Berkeley concludes that training GPT-3 consumed 1,287 MWh and resulted in emissions of over 550 tons of carbon dioxide equivalent.

The energy consumption of so-called AI will be higher than that of all human workers by 2025; by 2030, machine learning will account for 3.5% of global electricity consumption.

Conclusion

Both the indifference to intellectual property theft and the habitus that digitally available works should be available for free or absurdly cheap are symptoms of the negation of human authorship behind every work. What is disturbing is that big companies are now making billions of in profits from theft and this seems to outrage only a few political decision makers.

If the future of technology is to be sustainable, innovative and equitable, then systems that cause harm must be shut down and regulations based on authorisation, remuneration and transparency must be put in place for the development of future artificial communication. If this does not happen, the future of AI is built on coercion and plunder.

Learn more about the EWC Campaign agAInstWritoids

Ressources:

[1] https://www.faz.net/aktuell/feuilleton/medien/open-ai-soll-fuer-chatgpt-300-millionen-woerter-aus-dem-internet-gestohlen-haben-19007444.html

[2] https://psmedia.asia/publishers-has-your-book-been-used-to-train-chat-gpt-without-your-permission/

[3] https://aicopyright.substack.com/p/the-books-used-to-train-llms

[4] https://arxiv.org/pdf/2305.00118.pdf

[5] https://urheber.info/media/pages/diskurs/ruf-nach-schutz-vor-generativer-ki/03e4ed0ae5-1681902659/finale-fassung_de_urheber-und-kunslter-fordern-schutz-vor-gki_final_19.4.2023_12-50.pdf

[6] https://marylgray.org/bio/on-demand/

[7] https://www.noemamag.com/the-exploited-labor-behind-artificial-intelligence/

[8] https://arxiv.org/pdf/2102.01265.pdf

[9] https://onezero.medium.com/the-a-i-industry-is-exploiting-gig-workers-around-the-world-sometimes-for-just-8-a-day-288dcce9c047

[10] https://www.vice.com/en/article/v7b774/ai-generated-books-of-nonsense-are-all-over-amazons-bestseller-lists

[11] https://www.theguardian.com/books/2023/sep/11/self-publishers-must-declare-if-content-sold-on-amazons-site-is-ai-generated

[12] https://www.voanews.com/a/7092661.html

[13] https://www.podcast.de/episode/609495902/deepfake-bei-anruf-klon

[14] https://www.deutschlandfunkkultur.de/audio-deepfakes-was-wenn-wir-unseren-ohren-nicht-mehr-100.html

[15] https://www.theguardian.com/technology/2023/jan/04/apple-artificial-intelligence-ai-audiobooks

[16] https://rp-online.de/nrw/staedte/duesseldorf/duesseldorf-tonies-testet-geschichten-mit-kuenstlicher-intelligenz_aid-90005417

[17] https://www.beamex.com/resources/for-a-safer-and-less-uncertain-world/generative-ai/

[18] https://www.nytimes.com/2023/05/01/business/ai-chatbots-hallucination.html

[19] https://www.morningbrew.com/daily/stories/2023/05/29/chatgpt-not-lawyer?mbcid=31642653.1628960&mblid=407edcf12ec0&mid=964088404848b7c2f4a8ea179e251bd1&utm_campaign=mb&utm_medium=newsletter&utm_source=morning_brew

[20] https://www.telusinternational.com/insights/ai-data/article/rlhf-advancing-large-language-models

[21] https://www.vice.com/en/article/v7b774/ai-generated-books-of-nonsense-are-all-over-amazons-bestseller-lists

[22] https://www.independent.co.uk/tech/ai-author-books-amazon-chatgpt-b2287111.html

[23] https://www.cnbc.com/2023/04/25/amazon-reviews-are-being-written-by-ai-chatbots.html

[24] https://www.nytimes.com/2023/04/07/technology/ai-chatbots-google-microsoft.html

[25] https://futurism.com/ai-trained-ai-generated-data

[26] https://fra.europa.eu/sites/default/files/fra_uploads/fra-2022-bias-in-algorithms_en.pdf

[27] https://www.bloomberg.com/graphics/2023-generative-ai-bias/

[28] https://crfm.stanford.edu/2023/06/15/eu-ai-act.html

[29] https://www.nyu.edu/about/news-publications/news/2022/july/gender-bias-in-search-algorithms-has-effect-on-users–new-study-.html

[30] https://uploads-ssl.webflow.com/614b70a71b9f71c9c240c7a7/630534b77182a3513398500f_Brussels_Effect_GovAI.pdf

[31] https://crfm.stanford.edu/2023/06/15/eu-ai-act.html?fbclid=IwAR2pW8d96Fwjor9LIeFXUJjei4l2hBs6LbjHJikO65VZHHnDavZvIMSxuR8

[32] https://time.com/6273694/ai-regulation-europe/

[33][33] https://www.w3.org/community/tdmrep/ and https://www.w3.org/2022/tdmrep/

[34] https://www.washingtonpost.com/technology/interactive/2023/ai-chatbot-learning/?itid=lk_inline_manual_54

[35] https://www.theatlantic.com/technology/archive/2023/08/books3-ai-meta-llama-pirated-books/675063/

[36] https://apnews.com/article/openai-lawsuit-authors-grisham-george-rr-martin-37f9073ab67ab25b7e6b2975b2a63bfe

[37] https://www.theguardian.com/technology/2023/aug/01/techscape-environment-cost-ai-artificial-intelligence

[38] https://arxiv.org/pdf/2304.03271.pdf

[39] https://arxiv.org/ftp/arxiv/papers/2104/2104.10350.pdf

***

About the authors.

This analysis paper was researched and written originally in German for the Netzwerk Autorenrechte (Authors’ Rights Network) by Nina George (EWC Commissioner) and André Hansen (VdÜ, German Literary Translators Association) I Editors: Dorrit Bartel, Tamara Leonard I Provider research: Monika Pfundmeier (EWC Board Member, Syndikat Board Member), and examined by legal advisors.

The EWC was granted permission to translate, adapt and share it. A full publication on your website needs exchange with the authors via the EWC Secretariat, please.

The Authors’ Rights Network (www.netzwerk-autorenrechte.de) represents 16 associations and 16,500 writers and translators from Germany, Austria and Switzerland. Contact: info@netzwerk-autorenrechte.de. Lobby-Register Nr. R005345

AI Act: 13 International and European Authors’ and Performers’ federations call for a human centric approach to generative AI

DICTIONARY ON ADVANCED INFORMATICS (AI)