Where are texts used to train LLMs coming from ? Are they legally obtained ?

Warning: IANAL (I Am Not A Lawyer)

much of the debate about training language models by exploiting books concerns whether these models make use of training texts in a way that is compatible with copyright exceptions

here the jurists are divided and as I have written in previous posts and articles there are good arguments in support of one or the other thesis i.e. that it is permissible to use pre-existing texts because the products of LLMs are not derivative works or that it is not permissible to use these texts because the products of LLMs are derivative works that somehow insist on the same market

but there is one aspect that I don’t think anyone has dwelt on, i.e., I haven’t seen any articles about it: how were those texts (with which the models are trained) obtained?

if they were obtained from an ebook, we need to remember that ebooks are a service subject to a license and we need to understand if that license gives permission to use that text to train a model. moreover, ebooks are often and frequently protected by DRM. breaking the DRM to extract the text seems to me to be not legal since it has been ruled in various judgments that the publisher’s right to protection holds before the user’s right to make private co behpies of it and therefore breaking the DRM to extract the text is not allowed.

if, on the other hand, the texts were obtained from a paper book, by scanning and OCR, from what I understand it is relevant the purpose for which such scanning and OCR was done, and from what I understand if this was done for commercial exploitation or otherwise for purposes other than personal copying, the activity seems to me to be unlawful.

if I am right in this reasoning, there is an upstream issue about the lawfulness of training LLM models, related to the lawfulness of the way the text used for training was obtained

this at least where there is not a fair use regime in place (like in the USA) that allows for broader uses.

am I wrong ?

