Already, numerous criminal and damaging “AI business models” have developed in the book sector – with fake authors, fake books and also fake readers. It has been proved that the fundaments for large language models such as GPT, Meta, StableLM, BERT have been generated from copyrighted book works whose sources are shadow libraries such as Library Genesis (LibGen), Z-Library (Bok), Sci-Hub and Bibliotik – piracy websites.
Without legal regulation, generative technologies accelerate and enable the expansion of exploitation, legitimisation of copyright infringement, climate harm, discrimination, information and communication distortion, identity theft, reputational damage, blacklisting, royalty fraud and collective licensing remuneration fraud.
At the same time, a close look and assessment is needed to categorise and regulate the individual aspects of advanced informatics; because not all smart software is “AI”, not every application is equally risky. The EWC together with the Netzwerk Autorenrechte and the Campaign Against Writoids classified the following three systems:
- Assistive Informatics – randomly considered as risk AI;
- Analysing Informatics – partly considered as risky;
- Generative Informatics – the only category clearly considered as high risk.
In addition, the following dictionary, which will grow accordingly to developments, introduce the technical contexts, to help readers understand advanced informatics:
Algorithm
|
A firmly structured set of rules to examine existing information and come to a result. E.g. evaluation of statistics such as sales histories of books (If … then: If someone bought book X, then they will also like book Y and Z), evaluation of user preferences to “predict” the success of a book (or Netflix series) in the similar segment, or to create a Spotify playlist based on songs heard. Intersectoral, algorithm-based analyses combine text analysis (what sentiments does it contain, what emotions and “reading motives” does it serve) with reader-habit analysis and author branding analysis to “calculate” a book’s chances of success. The parameters and rules of the algorithms are mostly not transparent and highly flawed in emotion analyses. |
|
Foundation Model | Base model, “foundation” Term coined in 2021 for “data foundations” from algorithmic, self-learning and human-unmonitored deep learnings that have been trained on a maximally broad database (“upstream”) – texts, images, sounds, customer data, etc. These “foundations” serve as the basis for apps, programmes and applications that are emerging both in the field of advanced computer science (generative “AI”) and in the assistive field (Photoshop, style check of a text). These “foundations” serve as the basis for apps, programmes and applications that are emerging both in the field of advanced computer science (generative “AI”) and in the assistive field (Photoshop, checking the style of a text). The GPT-3.5 and GPT-4 model families, for example, form the basis for ChatGPT, Bing Chat and Duolingo Max.8 Foundation model providers such as Open Ai, Microsoft, Google or Meta refuse to transparently disclose their data sources[1] . |
|
LLM | Large Language Models Deep-learning algorithms that recognise, summarise, translate, predict and generate linguistic content among billions of parameters using very large text datasets, such as GPT (Generative Pretrained Transformer). The learning during training is humanly unsupervised. The “training texts” are not labelled or given special assessments. Hundreds of gigabytes of texts in dozens of languages such as Wikipedia articles, books, scientific articles, news texts, forum posts, social media posts or online comments are used for the “basic training”. Errors, bias or racism present in the training data are adopted unchecked by the language models. The fine-tuning only takes place before the LLMs are supposed to perform certain tasks, such as ChatGPT: labellers that evaluate texts, mark them, filter out toxic content or even determine words that the chatbot should not use (“content control”). Due to the unsupervised self-learning and the lack of preparation of texts (lack of fact-checking; no filtering of e.g. sexist, racist, inciting content; no linguistic labelling), the application of NLGs like ChatGPT and others result in erroneous output. Text AI always needs verification. |
|
ML |
Machine learning, artificial experience |
|
(N)MT | Neural Machine Translation NLU and NLG based conversion process. Major vendors such as Google, Amazon, Microsoft, DeepL. Forms: Rule-based machine translation (RBMT) based on linguistic rules (word-by-word translation); statistical machine translation (SMT) with algorithmically determined relationships between words, phrases and sentences. Most common today: neural machine translation (NMT), where the neural networks of an MT machine are responsible for encoding and decoding the original text and Natural Language Generation generates the output. With “human in the loop” NMT, the machine “learns” from the translator. The limitations of MT/MT arise from the still deficient Natural Language Understanding (NLU, see below) and Natural Language Generation’s inability to check itself. |
|
CAT Tool | Computer-assisted translation Local or cloud-based database storage of a translator’s (human) translation, e.g. to access previous work (“translation memory”), including glossaries or styles. |
|
NLP | Natural Language Processing, Linguistic Data Processing Mathematical, algorithmic techniques and linguistic computational methods for natural language processing, including: |
|
NLU | Natural Language Understanding, text capture Automated capture, “understanding” of human language (text). “Understanding”, however, does not mean meaningful, contextual or emotional and cognitive capacity. NLU of an interactive chatbot, email spam filter or machine translation are based on the logic of pragmatic semantic rules and the attempt to identify the intention of what is written (“intent”). Only sentiment analysis of text input can enable NLU to “understand” not only the meaning of words, but the intention (“intent”). |
|
Most NLU tools, even sophisticated ones such as DeepL, continue to fail because of humour, irony, linguistic wit, style – in particular, NLU is not suitable for translating complex or narrative texts qualitatively well. NLU / NLP is also used for grammar checking or text analysis. |
||
Scraping |
NLG | Natural Language Generation Machine reproduction of human language, e.g. ChatGPT, DeepL. However, NLG can neither read itself, nor correct itself, nor capture the content of what is produced, as speech is converted into formulas and patterns. Pattern recognition and probability algorithms are used, as well as text modules. In some cases, AI generators memorise complete texts (from websites, books[2] ,[3] ) and other artistic works and reproduce them (plagiarism, infringement of usage rights).
This constitutes a copyright-relevant process: copying and storage. Specifically, the collected works and performances are stored in a database to be made available for training. They are not deleted afterwards.
|
TDM | Text and data mining Algorithmically automated process of analysing large volumes of text or other data resources. While data mining tends to be developed to understand that patterns, repetitions, additional information or solutions of complex research can be extracted from it, text mining is a structuring and algorithmic analysis of selected document collections to locate information and discover hidden content relationships between texts and text fragments. It is used, for example, for social science analyses of comments in social networks and their changes in content, language or emotion over the decades, or for the analysis of role models and stereotypes in novels from the 18th to the 21st century, as well as for the examination of medical textbooks to determine, for example, that the treatment of women for heart disease finds little or no entry in teaching literature. |
|
In contrast to machine learning, this “classical”, goal-oriented text mining requires a preparation of the texts to be examined, such as a precise computer-linguistic preparation of the documents (which self-learning, unsupervised large language models do not do) and a restriction of the texts to be examined for specific tasks. Often, documents deemed relevant – articles, books – are made available under licence or obtained under the EU-law TDM exemptions for scientific, non-commercial research.LLM and Foundation Models, however, do not constrain their datasets, they do not structure them, nor do they prepare them, and it is doubtful that unsupervised text mining also covers the domain of unsupervised machine learning for the production of commercial products: this goes far beyond the goals of text mining as an information extractor. Furthermore, text mining extends to citizens and individuals, their comments, reviews, and Google inputs. |
[1] https://crfm.stanford.edu/2023/06/15/eu-ai-act.html
[2] https://www.theregister.com/2023/05/03/openai_chatgpt_copyright/
[3] https://docs.google.com/spreadsheets/d/1jW7EhsNjIGDMoK2JidyDD7UXH9N0NpEJfWFEj05_LC4/edit?pli=1#gid=0
***
Learn more about the EWC Campaign agAInstWritoids