ICO’s Consultation on Generative AI

Understanding personal data protection risks for web-scraped data to train Generative AI models and the principle of purpose limitation 

Generative AI mostly relies on data from the internet. But what is the right way to collect this data? In this article, we focus on the data protection risks outlined in the Information Commissioner’s Office’s (ICO) recent consultation series - namely, the first chapter that dealt with the lawful basis for using web-scraped data to train Generative AI Models, and the second chapter, which elaborates on purpose limitation in the Generative AI lifecycle. Although the consultations are now closed, we thought it would be helpful to highlight some of the key takeaways from each.

Generative AI models learn from a corpus of existing content to produce new outputs that resemble the learned material without exact replication of the original. These models are usually trained on large datasets, which enables them to have a wide range of general-purpose abilities. However, how the existing content is collected to be fed into the model at the learning stage is a contentious issue. 

Generative AI has garnered particular attention in the past couple of years, especially for its natural language processing ability which ChatGPT is the most famous for. Organisations are now looking for ways to adopt Generative AI solutions to automate routine tasks or make work more efficient. 

However, not enough consideration is usually given to the potential risks, including data protection, of using generative AI. The lack of skills in the workforce to use available AI tools, as well as the lack of a unified data strategy across organisations are also causes for concern. 

The ICO consultation series examines areas where organisations need more guidance on how data protection law affects Generative AI. These areas include having a suitable legal basis for training Generative AI models, how the principle of purpose limitation applies to Generative AI development and use, and the responsibilities for complying with the accuracy principle and data subject rights. 

The working fuel of Generative AI is data such as text, images, videos,  which can be collected from various resources, including from huge swathes of publicly available content. Developers of Generative AI models may collect data directly through web-scraping or obtain already web-scraped data from another organisation or use a combination of both approaches.

The goal of web-scraping is to convert the data available on web pages, which is typically designed for human reading and interaction, into a structured format that can be used for analysis, storage in databases, or further processing by automated systems.

This data may contain personal data posted on various websites, blogs and social media, and the ICO notes that the internet may also contain information that was not placed there by the person to whom it relates, for example, discussion forums or leaked information. 

Web-scraping carries a data protection risk, particularly when it involves personal data. Key concerns include exposure of personal information potentially leading to exploitation, data breaches with legal and financial repercussions, violation of user expectations and consent as scraping often occurs without users' knowledge.

Acquiring web-scraped data from another organisation can raise additional challenges concerning personal data protection. This prompts inquiries into the legality of obtaining and transferring the gathered data from the provider organisation.

For further context and example, parallels can be used with obtaining personal databases from third-party organisations for direct marketing activities. The receiving organisations of those databases are accountable for ensuring the integrity of the data provided. They must adhere to legal requirements regarding received data, and they must address any complaints concerning its use. 

It is essential to navigate these issues carefully to protect individuals' privacy rights.

In the view of the ICO, five out of the six possible lawful bases for web-scraping are unlikely to be available to organisations. The ICO decided to focus its consultation on legitimate interests as the potential lawful basis for web-scraping. The ICO does not expand on the reasons as to why other lawful bases would be inappropriate in this case as these appear to be self-explanatory. For example, consent has to be freely given, specific, informed and unambiguous to be valid, and it is unlikely that data subjects would have consented to such use of their personal data published on the internet.

The ICO further highlights that, for the processing to be lawful, it must also not infringe upon any other laws beyond data protection. Developers, therefore, must be mindful of the requirements of other laws when carrying out web-scraping or obtaining web-scraped data from another organisation. 

Organisations relying on legitimate interest as the lawful ground for personal data processing must carry out the three-part test which is as follows:

Purpose test: are you pursuing a legitimate interest?

The first step would be for the organisation to formulate their legitimate interest in web-scraping data. In the ICO’s view, an organisation would need to do so in a “specific, rather than open-ended way, based on what information they have access to at the time of collecting the training data.” 

Such legitimate interests could include commercial gain as well as wider societal interests. The key is for developers to be able “to evidence the model’s specific purpose and use” to make sure that downstream use of the Generative AI model will comply with data protection requirements and respect individuals’ rights and freedoms.

Necessity test:  is the processing necessary for that purpose?

The organisation needs to evidence that the processing is necessary to achieve the purpose and that the same results cannot reasonably be achieved in a less intrusive way. The ICO recognises that “most Generative AI training is only possible using the volume of data obtained through large-scale scraping”.

Balancing test: do the individual’s interests override the legitimate interest?

The third and final step is to balance the organisation’s interest(s) against the rights and freedoms of individuals. 

Collecting data through web-scraping is a form of ‘invisible processing’ where individuals don’t know that their personal data is processed this way. This can lead to individuals losing control over their personal information and how organisations use it, making it hard for them to exercise their rights under UK data protection law. Both invisible processing and AI-related activities are considered high-risk and need a DPIA (Data Protection Impact Assessment) as recommended by the ICO guidelines. 

As well as the upstream risks discussed above, there may be downstream risks and harms involved such as generative inaccurate information which may lead to distress or reputational damage. These also include using social engineering tactics to create phishing emails and other adversarial attacks.

The degree to which organisations developing Generative AI can reduce downstream risks and harms, depends on how the models are brought to market. The ICO’s consultation provides an outline of risk mitigation considerations to consider when carrying out a balancing test. These depend on how the models are deployed; by the initial developer, by a third party through an API or provided to a third party.

The ICO notes that where the initial developer makes an AI model available to third parties, it will have much less control over its downstream use. This means that any wider societal interest may not be realised in practice. For this reason, the ICO recommends that organisations carefully consider the balancing test, especially in cases where it will not be able to exercise meaningful control over its downstream use. 

The purpose limitation principle requires organisations developing Generative AI to demonstrate a clear understanding of the purpose for processing personal data before launching such a process. The purpose for processing personal data at the lifecycle stages of Generative AI may vary.

The ICO emphasises that organisations must be transparent about their reasons for processing personal data, ensuring it aligns with what individuals would reasonably expect. This purpose must be lawful and not infringe on other regulations. 

Without prioritising separate data protection objectives at all stages of Generative AI development/deployment, compliance with the following core data protection principles will be very difficult:

  • minimisation principle: the data is necessary for the purpose
  • lawfulness principle: the use of the data for that purpose is lawful
  • transparency principle: the purpose has been explained to the individuals the data relates to
  • fairness principle: the purpose falls within people’s reasonable expectations or it can be explained why any unexpected processing is justified
  • whether the stated purpose aligns with the scope of the processing activity and the organisation’s ability to determine that scope

By clarifying and narrowing the purposes of data processing for each stage of the Generative AI lifecycle, organisations can determine what personal data they need to process and establish the appropriate rights and responsibilities for the data controllers and processors involved.

For more practical purposes, the ICO stresses the importance and mandatory nature of practising data protection by design and by default approach. In addition, a basic questionnaire is provided to be completed within organisations seeking to process personal data for the development or deployment of Generative AI.

As Generative AI advances, it is vital for organisations to stay informed about privacy implications and engage with the evolving legal frameworks. The ICO's ongoing consultations offer a forum for addressing these complex matters.

Our additional thoughts: Actions to mitigate the privacy risks 

We understand that Generative AI brings both potential benefits and privacy concerns. In addition to the ICO’s approach set out above, we believe is important to recognise the main risks associated with this technology: 

  • Data privacy concerns with AI using personal data to create synthetic outputs
  • Privacy breach or data leakage, where AI-generated content leads to unintended personal information disclosure
  • Authenticity issues and other deceptive AI practices with AI creating convincing but false content like deepfakes
  • AI hallucinations: producing believable but non-factual or nonsensical outputs
  • Data leakage risks, where 'prompt injection' attacks reveal sensitive training data

Addressing these risks is crucial and requires strong data protection and ethical standards in AI development and use.

Organisations can address privacy concerns with Generative AI by implementing both technical measures and procedural safeguards. Key strategies to reduce privacy threats and promote the ethical application of Generative AI technologies may include:

  • Training: Initiating user education and awareness programs to promote the responsible use of AI and safeguard against data breaches
  • Governance: Instituting an AI governance framework to delineate oversight responsibilities and protocols, encompassing strategies for pinpointing, evaluating, and mitigating privacy risks inherent in these systems, as well as establishing procedures for documenting and addressing any incidents or breaches concerning privacy
  • Classifying AI systems by impact on data protection and privacy and conducting risk assessments to identify potential threats, which will consider both - the technical and human factors that may affect the privacy of Generative AI

If you have any queries or would like further information, please visit our data protection services section or contact Christopher Beveridge.

Your Key Contact