Bruno Saraiva [master’s student in European Union Law and Digital Citizenship & Technological Sustainability (CitDig) scholarship holder]
I.
Henrietta Lacks is a relatively obscure name, but one that is representative of the extraordinary impact an individual can have on human achievements, despite their recognition, in life and after death. Her legacy is one of immortality, a unique form of it: books have been written about her, her story is widely discussed, and her very cells are studied daily. Fragments of her body remain alive and will likely persist as long as modern civilisation endures.
Henrietta Lacks died in 1951, at the age of 31. Her passing would come from an extremely aggressive form of cervical cancer. An African American woman, she was born and laboured on her family’s tobacco farm, until the rising fortunes of post war America carried her to Baltimore where she would pass away, leaving her husband and five children. Neither her nor her loved ones would know the significance of her contribution to humanity. Glimpses would only come decades later, when her children’s lives were disrupted by researchers seeking medical data and tissue samples, while steadfastly refusing to divulge the intention behind their actions. Only in 1975, during a chance dinner conversation, would the Lacks family realise Henrietta’s enduring importance.
In 1951, during her treatment for adenocarcinoma, a sample of Henrietta’s tumour was taken without her knowledge or consent by researchers at John Hopkins Hospital, as contemporary medical practice allowed. These cancer cells exhibited remarkable longevity and rapid reproduction, characteristics that made them indefinitely sustainable, in other words, immortal – under laboratorial circumstances. The cancer tissue that killed Ms. Lacks would not die; a characteristic made them invaluable for scientific research.
These cells, dubbed “HeLa” after her donor, became the first human tissue successfully cloned in 1955. They played a critical role in developing the polio vaccine in 1954 and have since been used in countless studies on cancer, immune response, radiation, toxic substances, gene mapping, and the full gamut of biological experimentation of human interest. In this endeavour they have generated countless scientific and medical breakthroughs, but also billions of dollars’ worth of products, being themselves spread and sampled, sometimes at cost.
These developments were never known to Henrietta, with some of the sampling occurring via autopsy after the initial discovery, this would also come without familial consent. By the time her name was revealed in the 1970s owing to the passing of the original researcher, HeLa cells had multiplied to such an extent that they could theoretically cover the Earth several times but would also prove to be so resilient that they would contaminate other cell lines and endanger critical and long-lasting lines of research.
It was in this context that researchers sought access to her relatives’ genetic information. Callously treading the issue of consent, racial sensibilities and the critical aspect of the monetisation of a person’s legacy and body, the lack of transparency and respect fuelled suspicion and highlighted the ethical complexities surrounding her legacy in a matter that remains unclosed to this day.
Henrietta’s story might seem far detached from the subject matter that brings us today, but Opinion 28/2024 on certain data protection aspects relate to the processing of personal data in the context of AI models, adopted on 17 December 2024, intuits and intersects with broader issues of data protection and consent. More concretely, the ethics of profiting and continuing what is the biological legacy of the source, as we will examine shortly.
The legal basis for the existence of this Opinion rests upon a request dated 4 September 2024 brought on by the Irish Supervisory Authority, for an Opinion pursuant to Article 64(2) of the General Data Protection Regulation (“GDPR”) in relation to Artificial Intelligence (“AI”) models and the processing of personal data to European Data Protection Board (“EDPB”). The questions raised by this Supervisory Authority (“SA”), have been judged by the EDPB itself as admissible, by fulfilling a threefold criterion of admissibility, namely: i) general impact (based on how these questions immediately emerge when a data controller trains – or intends to train – AI Models using personal data, thereby involving a large amount of both of controllers and data subjects); ii) the likelihood of producing effects in more than one Member State (to be interpreted lato sensu, and thereby not restricted to legal effects), and iii) whether the EDPB has issued an Opinion on the same matter and it has not yet provided replies to the questions arising from the request.
As for the questions themselves, they can be best summarised by quoting the Opinion itself: “(1) when and how an AI model can be considered as ‘anonymous’; (2) how controllers can demonstrate the appropriateness of legitimate interest as a legal basis in the development and (3) deployment phases; and (4) what are the consequences of the unlawful processing of personal data in the development phase of an AI model on the subsequent processing or operation of the AI model.”
Although answers are given to these questions, they cannot be considered unequivocal, since the lasting Leitmotif of this Opinion can only be described as indecision. In answering the questions forwarded by the Irish SA, the 35-page Opinion uses the term “case-by-case” 16 times,[1] usually referring to how Supervisory Agencies might value and interpret each case. It does this, however, while always allowing national SA’s full reign upon the proceedings – reminding their goal as guarantors of data protection within their national spaces, even as internal recommendations and suggestions seem to go against the grain of piecemeal approaches.[2]
It further clarifies that “this Opinion does not aim to be exhaustive, but rather to provide general considerations on the interpretation of the relevant provisions which competent SA’s should take utmost account of when using their investigative powers.” The result of this is that the answers to most questions raised are either to be found in previous EDPB opinions, GDPR recitals or the 23 May 2024 Report of the work undertaken by the ChatGPT Taskforce (from which this opinion is largely derived – both thought and textual expression). A noticeable gap on the subject of web scrapping – a particular concern regarding protection of data subjects’ rights – also becomes apparent, though checking for news in the EPDB’s website seems to signal work is being undertaken in that area.[3]
Considering the importance of the subject at hand, this preliminary insight will be majorly focused on the second question, particularly on what concerns operative methods of large language models.
II.
When it comes to operative methods of large language models, as they overlap with human intellectual activity, it is quite demanding – within this specific topic – to understand how AI Models use data. This critical point is not explored in the Opinion, while proving too specific to find itself in the body of the AI Act. To answer these questions while not diverging from the EU regulatory and legal ecosystem we quote from page 4 of the “Report of the work undertaken by the ChatGPT Taskforce” of 23 May 2024:
“Large language models (LLMs) are deep learning models (a subset of machine learning models) that are pre-trained using vast amounts of data. Analysing these massive datasets enables the LLM to learn probability relationships and become proficient in the grammar and syntax of one or more languages. LLMs generate coherent and context-relevant language. To put it simply, LLMs respond to human language by producing coherent text that appears human-like. Most recent LLMs such as OpenAI’s GPT models are based on a neural network architecture called a transformer model.”
This concept and the way AI Models use language; its proportion and relationships prove essential to realising how AI Models – as they already exist – are used to commodify individuals’ language use through mimicry and transformation for commercial purposes.
Here that Mrs. Lacks’ analogy becomes clearer – rather the physical use of biological material originating from a human being – the commodification happens via the processing their creative output, expressed and extracted in the form of language – be it written, visual or in the form of social media expression.
If sufficient relevant information is fed, even to a generalised model, approximating or mimicking a particular data subject’s form of expression becomes possible.[4] This concern becomes even more salient when one considers the recent examples of AI-powered profiles on Instagram and Facebook by Meta, and the recognised effect that social media has in funnelling interests and personalities to reach the maximum audience possible. With this last element, the “cloning” of an individual is not necessary (or necessarily desirable since it visibly draws attention to the violation of the rights of specific data subjects) – but the creation of an archetype, a gestalt digital creationmade from the aggregation of individuals’ interests and expressions becomes possible. In this way, individuals, who for the purposes of the GDPR are data subjects, have their results, opinions and expressions aggregated to create highly specialised models that cover the entire range of human intellectual production.
In this sense, and maintaining the analogy with Lacks, though subsequent contact with relatives of specific data subjects is unlikely to occur [though not unthinkable if the reason why that data subject is of interest relates to hereditary characteristics such as vulnerability to certain content as recommended by AI Models (an issue raised in point 80)], it is not utterly unfeasible that other data subjects with similar profiles, characteristics or interests are targeted or that further data collection is edged towards particularly proficient data subjects to complete specialised models.
III.
Opinion 28/2024 enumerates a series of general considerations for SAs to account for on their assessment of controllers’ claims of legitimate interest as a legal basis for the processing necessary for the development of AI models.
Reminding that the GDPR does not set a legal hierarchy between the legal bases that it establishes (point 60), it again clarifies that it is the controllers’ responsibility to determine whether and what legal basis exists for their processing activities. To this end, the Opinion recalls the three-step test for the assessment of the use of legitimate interest as a legal basis for data processing under point 59, i.e.: «(1) identifying the legitimate interest pursued by the controller or a third party; (2) analysing the necessity of the processing for the purposes of the legitimate interest(s) pursued (also referred to as “necessity test”); and (3) assessing that the legitimate interest(s) is (are) not overridden by the interests or fundamental rights and freedoms of the data subjects (also referred to as “balancing test”).»
Despite being presented in some circles as a “new” development for Opinion 28/2024 (and indeed as a highlight), the concept of a tripartite test for legitimate interests is not new, being enshrined in Article 6(1)(f) of the GDPR. In this case, any innovative value it has stems from its application to AI, reinforcing the notion that data protection discipline continues to apply to such endeavour.
Indeed, the criterion for legal basis is not even a novelty introduced by the GDPR, with the present guidelines on the mater “Guidelines 1/2024 on processing of personal data based on Article 6(1)(f) GDPR” being deeply informed by the Article 29 Data Protection Working Party Opinion. These, in turn, build upon Opinion 06/2014, depending on the notion of legitimate interests of the data controller under Article 7(f) of Directive 95/46/EC. On the judicial home front, confirmation for this approach has been given by the Court of Justice of the European Union’s (“CJEU”) decision in the Rīgas satiksme case.[5]
As for the verification of the first step, the Opinion reiterates that an interest may be regarded as legitimate if three cumulative criteria are met, namely if: “the interest (1) is lawful; (2) is clearly and precisely articulated; and (3) is real and present (i.e. not speculative).”(Point 68). To this regard the Opinion offers concrete examples of what such an interest may be during the development – examples include the development of a “conversational” user assistance agent or systems that improve threat detection. These considerations, although broad, help to root the intention and purpose of the AI model and have a legal and technical effect.
From a regulatory point of view, having a stated purpose for each AI model makes it easier to check whether the model has been developed improperly and whether the deployment is not going off track (discussed in point 64) by deviating from its supposed purpose. From a technical point of view, having an explicit goal for the AI model plays a role in the selection of training data. For example, an AI model aimed at customer support would theoretically not need to be fed with data other than customer support questions and answers. Because of the way large language models are trained and work, this initial data selection can play a positive role in the overall quality of the AI during implementation by specialising your dataset, reducing redundant, unrelated or illegitimate answers – training only for a specific, intended purpose. This ab initio limitation of datasets can also help screen out personal data during training by suppressing or selecting it, although, as the Opinion makes clear, the onus for this option lies squarely on the controller and the SA to verify.
With respect to the second step, regarding necessity (points 70 to 75) the Opinion introduces a twofold assessment process for necessity considering: “(1) whether the processing activity will allow for the pursuit of the legitimate interest; and (2) whether there is no less intrusive way of pursuing this interest.” (Point 72). In this regard, it makes it clear that processing should be put into the full context of the fundamental rights of the data subjects, in conjunction with the data minimisation principle (point 71), which refers to the limitation of personal data collection to what is directly relevant and necessary to accomplish their specified purpose and whether it is proportionate to pursue the legitimate interest at stake.
Point 73 breaches the sensitivity of context as it regards the intended processing of personal data. The possibility of using less intrusive means of data processing may depend on “whether the controller has a direct relationship with the data subjects (first-party data) or not (third-party data).” In this regard, the Opinion rightly points out CJEU jurisprudence regarding disclosure of first party data to third parties for the purpose of legitimate interest(s), under the CJEU judgment Koninklijke Nederlandse Lawn Tennisbond of 4 October 2024, case C‑621/22, paragraphs 51-53.
With respect to the third step, the Opinion, in a familiar pattern, “recalls that the balancing test should be conducted considering the specific circumstances of each case.” It then provides an overview of the elements that SAs may consider when evaluating whether the interest of a controller or a third party is overridden by the interests, fundamental rights and freedoms of data subjects.
As part of the third step balancing test evaluation process (point 76 to 108), the Opinion offers a particularly broad overview on a range of different subjects, ranging from; specific risks to fundamental rights (drinking heavily from the Charter of Fundamental Rights of the European Union, under point 80), considerations of positive impacts (including mental health, under point 81), impact assessment criteria (with another three part test, under point 83), the nature of data and its varying sensitivity (point 84), the disparate nature of data breaches (point 89), data Subjects’ reasonable expectations (points 92 and 94), how the GDPR compliance is not an end point for investigation and enforcement (point 97).
In this lengthy exploration of balance, a place of prominence should be given to point 102. Here, tentative solutions are advanced such as suggesting a reasonable amount of time between data collection and processing, the creation of unconditional opt outs from the outset (in compliance with Article 21 GDPR) and allowing for the claiming of personal data breaches under the specific terms in which they occur.
Under points 107 and 108 deployment-specific balancing measures are presented. Highlights of these points include the deletion of personal data [paragraph (b) of point 107] but more importantly is the way paragraph (a) tackles the storage, regurgitation or generation of personal data.
As we advanced previously – due to the nature of large language models they are mathematically capable of achieving “correct” results even after the deletion of data – this reduces the solution forwarded by paragraph (b), leaving only paragraph (a)’s fleeting reference to “output filters” as the most operative solution. These filters prevent regurgitation or generation, ensuring that whatever the model produces first is not presented to the user, depriving us of all the machine’s capabilities in the name of protecting personal data.
To that end, data subjects’ interests, rights and freedoms override whatever legitimate interest(s) a controller invokes or pursues, be they a first or a third party. To overcome this, controllers may implement bespoke mitigating measures to limit the impact processing on these data subjects. In conclusion, Opinion 28/2024 does offer a litany of examples and mitigating measures, but one can tell that due to the rapidly evolving nature of the technology and challenges faced, extraordinarily wide latitude was provided to SA’s in a way that allows for less than robust solutions to be implemented.
In a document in which the expression “case-by-case basis” is used repeatedly, the protection of European data subjects’ data will also depend on the individual requirements, understandings and standards of each national data supervisors, in a future that is still uncertain.
[1] Through the document, one may find another 7 uses of the term “circumstances of each case”, while other equivalent turns of phrase are used.
[2] To this end, we refer to the suggestion of media campaigns to inform potential data subjects about the possibility, risks and consequences of their personal data being used to train AI models. Although a convincing and effective means, no significant progress has been made on exactly how, in the context of data scrapping, data controllers would limit their scope to the national borders of the competent authority or the means or mechanisms by which an international campaign could be launched. See point 103 of the Opinion.
[3] See EDPB, “EDPB opinion on AI models: GDPR principles support responsible AI”, 18 December 2024, https://www.edpb.europa.eu/news/news/2024/edpb-opinion-ai-models-gdpr-principles-support-responsible-ai_en.
[4] For an example on this very same subject, the recent Authors Guild v. OpenAI Inc. [(1:23-cv-08292), District Court, S.D. New York] provides the example of several authors who claimed their books were used to train AI, leading users to write sequels or books of their purported authorship in their style. For a quick overview see Stella Haynes Kiehn, “Plot twist: understanding the Authors Guild v. OpenAI Inc Complaint”, Washington Journal of Law, Technology & Arts, 5 March 2024, https://wjlta.com/2024/03/05/plot-twist-understanding-the-authors-guild-v-openai-inc-complaint/.
[5] Judgment CJEU Rīgas satiksme, 4 May 2017, case C-13/16, ECLI:EU:C:2017:336.
Picture credit: by Markus Spiske on pexels.com.