Data Privacy: Machine Learning and the GDPR
Ana Guerra
Datasets are essential for the research and development of models in the fields of Natural Language Processing (NLP) and Machine Learning (ML). However, while the use, collection, and storage of data increases, concerns about data privacy intensify as well.
To be in line with best practices, it is relevant to understand what data privacy means and how it is regulated. This post will therefore offer a brief overview of how data privacy is regulated within the European Union. Besides following EU regulation, data driven projects have also to be ethically responsible. In consequence, this article ends with some words about ethics while processing personal data.
Data privacy
In a nutshell, data privacy can be described as a person’s right to decide if and how his/her personal data can be manipulated by businesses and/or organizations. It guarantees the right to express opinions in private, without being surveilled, and to preserve the privacy of one’s personal data.
Within the European Legal framework, data privacy itself is considered a fundamental human right (Art. 8, Charter of Fundamental Rights of the European Union). It is also connected to other human rights, like the rights to privacy and free speech.
Concerns about data privacy arise whenever personal identifiable information is collected, used or saved by companies or organizations. In this sense, data privacy is different from data security, which is concerned with the protection of personal data from external attacks.
In Germany, it is the European Regulation that defines what is to be understood as personal data and under which circumstances it can be processed.
European regulation
Besides being considered a fundamental human right by the European Charter of Fundamental Rights, data privacy and protection is regulated by the General Data Protection Regulation, commonly referred as GDPR. It is one of the most restrictive regulations on data privacy, being enforced since 2018.
In general, the GDPR applies if at least one of the following two conditions is met:
The data is collected or processed by any organization based in the EU or the UK.
The personal data belongs to a person based in any of the EU member states or the UK.
Point number 2 is quite relevant, since it provides extraterritorial scope to the regulation. That means that foreign businesses and organizations are also subject to the GDPR if the personal data processed by them originates from people located in the EU/UK (Art. 3(2), GDPR). In this case, it does not matter where the company is based or where the data is stored.
As pointed out by Rogerts et al., a direct consequence of this is an expanded application of the European Regulation to large scale NLP models and resources, which are likely to use data from the web and social media produced by EU or UK based people.
Principles
The GDPR sets some principles in regards to data privacy and protection. These principles apply to the use, storage and collection of personal data.
Personal data is defined, in this context, as “any information concerning an identified or identifiable natural person” (Art. 4(1), GDPR). That includes, of course, name and addresses, but it might also include other personal data that can be connected to a person, like gender, web cookies and location.
The general idea behind these principles is that individuals should be able to control how and with which purposes their personal data is used. They should also be able to know and control who has access to their data.
According to the GDPR, personal data should (Art. 5, Principles, GDPR):
be used fairly, lawfully and transparently;
be collected, stored and used for specific purposes;
be kept only for the time necessary to achieve those specific purposes;
be processed only if necessary and in a manner compatible with the purposes defined during data collection;
be kept up to date; and
be protected from unlawful use.
In addition to following the principles above, asking for consent is necessary before processing any data.
Consent
Consent, which is also regulated by the GDPR (Art. 7), needs to be:
informed and specific: it contains the name of the controller that is going to process the data, how and why the personal data is going to be used and what information is going to be saved and processed.
freely given: a person has genuine free choice to give or not to give consent to the use of his/her personal data. It is possible to choose to give consent just to some of the data operations. And consent is not vinculated to the performance of a contract - unless it is necessary to its fulfillment (Recital 43, GDPR).
easy to refuse or to withdraw: a person can refuse or withdraw a given consent, and demand to have his/her personal data erased. This should be as easy as it is to give consent and it derives from individuals’ right to be forgotten(Art. 17, Recital 65 and 66, GDPR). This right is, however, not absolute and the GDPR specifies when the right to be forgotten does not apply (Recital 65).
clear and unambiguous: the pre-formulated consent formulary provided by the controller needs to be written in an intelligible form, using plain and clear language.
In general, personal data can only be collected after consent has been given. Yet, some special categories of personal data, like public health, may be excluded from the need to ask for consent, if the data processing is carried out on the grounds of the public interest.
Ethics and best practice
In a previous article, we discussed ethics in NLP projects and how sampling can shape NLP models. Adding to this information, data driven projects which make use of personal identifiable information should also follow ethical principles.
The many data privacy scandals in the last years have shown how dangerous the inappropriate use of private data can be, and also how damaging it might be to a company’s reputation.
In this context and taking the GDPR in consideration, before building any data driven model with personal data, a company or organization has to define the specific purposes to use this data, and to ask for informed consent. Moreover, it is relevant to delineate an honest and open data privacy policy. In some cases, solutions that allow de-identification of personal data can also be useful to ensure data protection.
Unfortunately, there is no magic recipe to implement data privacy standards on ML and NLP projects. While some projects don’t rely on personal data, others need to find solutions to ensure data protection. In these latter cases, an open and transparent strategy to handle private information should be designed according to the project requirements. But one thing is certain, besides having to comply with EU standards, any data solution should also be ethically responsible.