EWC responded to the Multi-stakeholder Consultation FUTURE-PROOF AI ACT: TRUSTWORTHY GENERALPURPOSE AI with the proviso that the premise of this consultation needs formal clarification. The exception(s) for text and data mining (Art. 3, Art. 4, Directive (EU) 2019/790) is inapplicable: The statutory language and text of the provision(s), its conception, and the ratio of the exception esp. Art. 4 indicate that it must not be applied to the training of GPAI and in particular generative AI models. Hence, the training of generative AI models without the authors’ authorisation can be classified as both a copyright infringement and a violation of duties in the AI Act.
EWC calls upon the AI Office to get the scope of the TDM Art. 3 and Art. 4 exceptions clarified before the Code of Practice is drafted.
The European AI Office launched this multi-stakeholder consultation on trustworthy general-purpose AI models in the context of the AI Act on July 30th, 2024. Stakeholders with relevant expertise and perspectives, particularly from academia, independent experts, industry representatives such as general-purpose AI model providers or downstream providers integrating the general-purpose AI model into their AI system, civil society organisations, rightsholders organisations, and public authorities, wre invited to respond until 18th of September.
EWC responded to the Consultation, and submitted together with CEATL a joint free text submission in addition for the Working groups 1 and 4 of the Code of Practice Plenary.
In principle, the following aspects must be formally clarified prior to the drafting of the Code of Practice:
a) The scope of TDM exceptions of the Directive (EU) 2019/790, as recent studies (e.g. but not limited to Dornis/Stober https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4946214 ) and litigations, as well as worldwide movement of law provisions (Australia) showed, that the short cut from the procedures used within TDM, are technically and legally not the same as to the processes used to collect, prepare, reproduce, store and memorise within (Gen)AI development. Please acknowledge the considerations below for more details.
b) Handling of works published before 6.7.2021, between 6.7.2021 and the AI Act; and from 1.8.24 on (entry into force of the AI Act).
- Works published before the CDSM Directive (EU) 2019/790 came into force on 7 June 2021 do not contain any machine-readable or other indication of the TDM and (Gen)AI rights reservation. However, they are still in circulation; in Europe, this applies to around 13.5 million book works in all formats (digital, audio digital, print, website, audio visual). Subsequent incorporation of a reservation of rights will cost the book market hundreds of millions of euros. An explicit time wall would be desirable, and that all works published before 6th of July 2021 are generally excluded from any scraping and future use for TDM, for AI-, and for GenAI development. AI developers must clean their datasets accordingly and may not use these works to build further lifecycles of their models.
- It is known that the foundations of large language models have not just been developed since 2021, but that the collection, reproduction, storage, memorisation, semantic exploitation and making available to the public through programmes, including beta testing, have been developed and made publicly available, in some cases since the year 2015. The TDM exceptions, regardless of their misleading implication that they also cover GenAI, do not apply retroactively. Accordingly, data set builders and curators, as well as AI developers, should be urged to comply with EU law to date, and to eliminate all data sets from before 5 July 2021 without replacement, and not to use these works for future lifecycles and development. There are currently around 68,000 known datasets that have been in circulation since the beginning of the millennium. The Pile and Books3 in particular were taken from piracy sites, among others, but were also compiled before any TDM exception.
- In any case, it must be clarified what happens to works used since 6th of July 2021 that do not contain a machine-readable rights reservation or did not contain one at the time of the presumably illegal use for GenAI, and which was only inserted in this year 2024, for example. Background: There are no general standards for opt-outs in the book sector. Some large publishing groups include a rights reservation in the imprint of the e-book file in plain text form, others work with TDMRep from W3C (this is NOT a recognised ISO!). Others try to accommodate the now hundreds of different crawlers, each of which “understand” an opt-out in robots.txt in different “machine language” – but whether they respect it remains unproven.
- Conclusion: in any transparency documentation, it will be necessary to know the exact date of collection and to make it clear to AI developers that a reservation of rights can also take place AFTER. Accordingly, before any re-use of the non-deleted and re-reproduced works, AI developers must clearly ensure that they do not violate the declared reservation of rights.
b.1) Handling of out of commerce works digitised by entitled CHI under Art. 8 of the Directive (EU) 2019/790. Cultural heritage institutions that fall under the entitlement of Art. 8 of the Directive (EU) 2019/790 to digitise out-of-commerce works and make them publicly available in a controlled framework are actively trying, together with Europeana, to prevent the legitimate exercise of rights reservation by authors or, if applicable, other rightsholders. They either refuse to protect digitised works that are made publicly available again and thus also made available to scrapers with opt-outs, or they assume that Art. 8 would allow this use. This is not the case. Authors whose works are out of commerce must be able to exercise their rights. Art. 8 does not contain any logical, semantic, or legal justification for the automatic application of Art. 3 and Art. 4 to out-of-commerce works, and certainly not without actively enabling opt-outs.
In principle, this applies to every state institution, including libraries, which are not permitted to make collections, whether print or e-book, available for TDM, for AI or for GenAI under any circumstances without consulting authors and rights holders; this requires licences, remuneration and transparent usage documentation.
c) Reliable definitions of “legally accessed” and “publicly available
The degree of transparency the dominant AI providers (e.g., but not limited to Open AI) provides i.e., that they train on “publicly available” data is insufficient for authors and rightsholders to determine whether their works has been used to train for example Sora. There is no list of datasets or a narrative explanation that enables authors to exercise their rights.
As the more than 100 lawsuits worldwide show, authors and AI developers do not have the same understanding of what exactly “legally accessed” and “publicly available” mean. Sometimes AI developers work with data suppliers, who in turn make use of works beyond paywalls. It is unclear here whether opt-outs are respected, whether works were purchased, when they were purchased and reproduced, whether a reservation of rights was recognised and accepted in, for example, the legal notice, etc. Accordingly, the transparency chain must start from the moment Zero of crawling and also consider cases in which works have been collected under Art. 3 TDM exception of the Directive (EU) 2019/790 at a research institute, but then passed on in private partnership with commercial AI providers. This is neither legal nor public, but it leaves academic authors, among others, severely damaged, who would also have to be allowed to opt-out when switching from one exception to the other. We also point to cases where, for example, videos have been transcribed into text and sold as a “dataset”; these cases also make a reservation of rights impossible, as well as tracking in order to substantiate one’s rights, esp. of the authors, in court proceedings.
In addition, reliable and fine-grained definitions need to take place, also to give authors and rightsholders the securitised option to enforce their rights, and to start the transparency chain already with the collectors and curators, confirming legal access and describing in detail methods of collecting.
Relevant link on the Report by the Danish Rights Alliance to copyright transparency:
d) Clarification of the primary right to opt-out as an authors’ right: 5 of the InfoSoc Directive 2001/29/EC,andArt. 4 of the Directive 2019/790 (EU) are legal grounds designating original authors as the primary rightsholders; therefore, as the rightsholders also entitled for opting out. As authors do not have granted in their contracts – except new but highly diffuse CC-licences – any rights to TDM, to (GP)AI or generative AI development exploitation, the rights reservation and therefore opt-out right is part of the author’s non-waivable rights. In this regard, the AI Office must promote and enforce mechanisms for both the authors’opt-out and for authors’ access to transparency documentation. We call on the Plenary to unequivocally acknowledge the right to opt out of authors.
Until clarification and, if applicable, compensation for the authors who have been deprived of their rights, the following must be enforceable to comply with EU copyright law and the recital 107 of the AI Act:
- As the right to opt-out of commercial TDM under Art. 4 of the Directive (EU) 2019/790, and the right to licence any AI or GenAI development originates with the creation of a work, this genuine authors’ right including moral right must be effective for writers and translators and all authors/artists/performers in the first place. Conclusion: authors foremost, but on a practical view also together with their respective contractual partners and further rightsholders need an effective and manageable tool to declare a TDM opt-out, and, if they make this decision fully informed, an “opt-in” – licensing – for the usage of AI, GenAI, with remuneration and transparent documentation of usage;
- If the interpretation persists that TDM and GenAI development are “the same thing”, despite the fact that these disruptive technologies already replace those authors and artists from which AI developers, providers and companies had served themselves without remuneration or consent: measures to identify and comply with the rights reservation from the text and data mining exception pursuant to Article 4(3) of Directive (EU) 2019/790;
- Measures to obtain and confirm the authorisation from authors, where applicable;
- Measures to detect and remove collected copyright protected works, data and material for which rights reservation from the text and data mining exception has been expressed pursuant to Art. 4(3) of the Directive (EU) 2019/790 (“unlearning”).
The 3W (What-Where-When) must be the guideline for the transparency requirements by (GP/Gen)AI developers, providers, deployers, and collecting entities:
- What content was included in the training sets. E.g., every title and copyright information, plus works format, and title specific – as authors and rightsholders are obliged by the Directive (EU) 2019/790 to declare an opt-out for EVERY work, in conclusion, also the documentation WHAT is used shall be title/work-specific. In any other cases authors and rightsholders would be prevented to exercise their rights, which is a breach of Union law.
- Where the content was initially collected from (e.g., URL or name of platform/service; engaged entities, corpora builder, data set builder, private partnerships, licensees …)
- When the works, data and further content was initially collected (i.e., date and time)
Justifications:
e) Time of gathering of the work: as elaborated above, the TDM exception(s) Directive (EU) 2019/790 apply not retroactive for any data and works collection and using before 6th July 2021.
- Works and data gathered before 6th of July 2021 are in any case not legally accessed and therefore an infringement also outside the debate, whether TDM is GenAI or not.
- Also, TDM rights reservations can be added after scraping / collection.
- In addition, works are known to be reused in a new edition of a model. Large Language Models (LLM), for example, cannot simply be ‘supplemented’; all existing works are used AGAIN if, for example, there is a jump from ChatGPT 3 to 4. Each model needs to be build up from scratch each time and equip it with more parameters and capacities. In between, however, an opt-out – or a complaint procedure – may have taken place. Accordingly, AI providers must develop a monitoring system.
e) Sources & commissioned entities (URL; plus, corpora providers, data set builders, incl. research institutions in private partnerships).
f) Information on compliance with “legally accessed”.
g) Licensing methods and remuneration schemes.
h) Traceable confirmation that TDM opt-out is always monitored and accepted.
i) Information on the life cycle of the Model, to ensure that authors / rightsholders, who had applied a rights reservation in the meantime or in another format / edition, can be sure not to be (re)used.
Relevant links: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4946214 and https://europeanwriterscouncil.eu/240425_cwos_jointstatement_ai-act/
To the EWC and CEATL joint free text submission to the AI Office multi-stakeholder consultation Future-Proof AI Act: Trustworthy general purpose AI: EWC_Ceatl JOINT SUBMISSION FINAL 240916