On synthetic data: a brief introduction for data protection law dummies
On synthetic data: a brief introduction for data protection law dummies
This blog post narrows down to delimiting the notion of synthetic data generated from personal data and studying its legal qualification within the European data protection framework.
by César Augusto Fontanillo López and Abdullah Elbi
Synthetic data is attracting increasing attention from technicians and legal scholars in recent years. This is especially noticeable among entities and people working on data-driven technologies, particularly in the artificial intelligence application development and testing sector, where sheer volumes of data are needed. In these circles, synthetic data has become a growing trend under the “fake it till you make it” concept by promising to alleviate existing data access and analytics challenges while respecting data protection rules. Given the rising prospects and acceptance of data synthesis, there is a need to assess the legal implications of its generation and use, the starting point being the legal qualification of synthetic data.
Synthetic data is a broad concept encompassing both personally and non-personally identifiable information. In this blog entry, we focus on the intersection between synthetic data and personal data. The reason for so doing is that generating synthetic data through personal data (including hybrid data) provides a more straightforward assessment and is more suitable for the introductory purposes of this blog entry. We acknowledge that issues surrounding the qualification as personal data of existing models and background knowledge used in data synthesis may be particularly relevant to this topic. However, these issues will not be dealt with in this entry.
The present blog post hence narrows down to delimiting the notion of synthetic data generated from personal data and studying its legal qualification within the European data protection framework. Three main conclusions are drawn from our analysis: first, full data protection compliance prior to data synthesis would be applicable in many cases; second, according to the identifiability test, synthetic data can be considered pseudonymous or anonymous data depending on the appropriateness of the data synthesis and the related ex-post control mechanisms; third, there is a need for further discussion on how current identifiability criteria affect the concept of personal data. This blog entry has been heavily influenced by the work of Khaled El Emam.
What is synthetic data?
Synthetic data has many names, such as “fake data” or “artificial data”. Regardless of the terminology, synthetic data is, at a fundamental level, data artificially generated from original data that preserves the statistical properties of said original data.
Given an original dataset X, a synthetic dataset X' can be generated by building, for instance, a machine learning model that captures its structure and statistical distribution. The conservation of the statistical properties of X in X'is critical, as it allows data analysts to draw meaningful conclusions from the synthetic data as if they were drawn from the original data.
At the same time, it is possible to induce a certain degree of randomness in the synthetic data generation process, unrelated to the original dataset, to produce data sets with high variability. On this point, it is important to control the level of randomness to ensure that the synthetic data is sufficiently diverse, yet still realistic.
Why synthetic data?
Synthetic data presents technical advantages which justify its increasing popularity. In particular, synthetic data can overcome some of the drawbacks of general database access and use, as well as provide efficient data processing in terms of time and cost.
A widespread concern about the processing of databases containing original (personal) data relates to the characteristics of such data. Quantitatively, databases may lack a sufficient amount of data. Qualitatively, databases may lack sufficient variability of data. Where database specificities are inadequate to meet particular processing needs, such as training machine learning models or testing mathematical models, their access and use might be hindered or discouraged. In such cases, given the capacity of data synthesis to produce large amounts of data with high variability, this technique can be leveraged to feed models developed by data analysts with valuable data that are both quantitatively and qualitatively tailored to the specific processing aims, thus overcoming the deficiencies of general database access and use.
The other advantage of synthetic data is that it can ensure a good optimisation of model development processes in terms of time and cost. Under normal circumstances, acquiring and preparing data sets can be a resource-intensive task, especially where data sets need to be labelled for supervised learning. Since the labelling of those data sets is, in many cases, done manually, human and technical resources need to be mobilised, thus making this activity expensive and time-consuming. Against this backdrop, it could be desirable to perform cost-effective preliminary evaluations on synthetic data models to validate assumptions and demonstrate the kind of results that could be obtained with actual models. Data synthesis can provide, in these scenarios, a rapid iteration of model training and experimentation to explore and test hypotheses before engaging in original data collection and processing for actionable model results.
Given the technical advantages offered by data synthesis and its potential as an effective anonymisation technique to access, analyse, share, reuse, and publish data without revealing personal information, a growing number of scholars view data synthesis as a tool that respects data protection requirements while stimulating technological innovation. Some authors have even compared synthetic data to a rich, calorie-free cake because of its advantages over original (personal) data processing.
How to generate appropriate synthetic data?
Data synthesis is subject to a balancing test between utility and anonymity. Utility can be understood as a measure of the satisfaction of synthetic data to produce analysis results similar to those that the original data would produce. Anonymity should be understood as the lack of identifiability as extracted from the definition of personal data in the GDPR. This means that the individual cannot be identified nor be identifiable through direct or indirect identifiers or in combinations with additional pieces of information.
As a rule of thumb, the higher the utility of a synthetic data set, the lower its anonymity. If a synthetic dataset X'maximises utility by fitting the original dataset X very carefully, anonymity would be lost because X' would be a replication of X. If a synthetic dataset X' maximises anonymity by fitting the original dataset X very carelessly, the utility would be lost because X' would be statistically different from X. Since, as has been argued elsewhere, it is as relevant to optimise the utility of the synthetic data set as it is to prevent the reidentification of the natural person, the trade-off between utility and anonymity must be, therefore, correctly navigated to generate appropriate synthetic data.
At the same time, one must consider that the very nature of this trade-off is at odds with the plausibility of generating completely anonymous datasets, or datasets with zero risk of re-identification if utility also needs to be preserved. Consequently, this pushes data controllers to consider anonymity in the creation of synthetic datasets in probabilistic terms. This implies that the determination of whether a synthetic dataset complies with the required anonymity standards involves, inter alia, assessing the probability of re-identification relative to an acceptable probabilistic threshold. Thus, if data synthesis is carried out poorly, the risk of reidentification can become higher, given the greater chance of record replication. On the contrary, if data synthesis is carried out properly, the risk of re-identification can be minimised. The probability of re-identification can be measured by using different metrics.
Is synthetic data pseudonymous data?
According to the GDPR, pseudonymous data is personal data that cannot be attributed to a specific data subject without the use of additional information, which is to be kept separately and subject to technical and organisational measures to prevent reidentification. While the GDPR does not define the concept of attribution, we understand that the concept refers here to the use of additional information that would make the data subject identifiable. If properly generated, synthetic data cannot be attributed to a specific data subject, given its eugenic nature. This means that the use of additional information may not pinpoint the data subject, therefore circumventing the identifiability test.
Nonetheless, synthetic data can still show sufficient structural equivalence with the original dataset or share essential properties or patterns to trigger attribution. For instance, if the synthetic data is generated by one-to-one transformation of the original dataset so that each synthetic data point equates to an original data point, source features would be substantially maintained in the synthetic data set and hence would fall under the definition of pseudonymous data. This might be the case where the trade-off of data synthesis is not properly navigated, and the original data set is kept by the controller and used as additional information to draw personal attribution. In such cases, the data protection obligations will apply tout court.
Is synthetic data anonymous data?
Anonymous data is defined as information that does not relate to an identified or identifiable natural person or to personal data rendered anonymous in such a manner that the data subject is not or no longer identifiable. At its core, anonymisation encompasses not only a set of techniques but also technical and organisational safeguards designed to prevent re-identification over time.
Supporters of synthetic data argue that, where synthetic data is properly generated, there is no one-to-one mapping from synthetic records back to the person and therefore consider synthetic data as anonymous data. Of course, this assertion is based on statistical terms, taking into account the above-mentioned utility-anonymity trade-off. This means that, where synthetic data is properly generated, it is, statistically speaking, indistinguishable from the original data, meeting the anonymisation standards. Under these circumstances, synthetic data is considered to eliminate the risk of re-identification and provide strong data protection guarantees.
On the other side, opponents of synthetic data’s anonymous nature contend that even where it is properly generated, one-to-one relationships are still possible, particularly if the synthetic data set preserves the characteristics of the original data set with high accuracy and/or statistical outliers are present. Based on these assumptions, they consider synthetic data as identifiable information.
Beyond these frictions, a significant concern in data synthesis is the possibility of inferring sensitive information about the individual. For instance, an adversary might know the presence of an individual in the original data set and, even if the individual’s record cannot be singled out or linked, sensitive inferences might still be possible. According to the opinion of A29WP, to consider personal data as 'truly' anonymous, inferences about the characteristics of the individual must be ruled out. A pertinent data protection question is therefore whether merely meeting the inference criterion is enough to classify allegedly anonymous, synthetically generated data as personal data.
How to legally qualify synthetic data? Takeaways and open questions
All in all, it should be considered that synthetic data would be, in many cases, generated from original (personal) data, except where data synthesis is based on existing models or background knowledge. While the latter scenarios need separate consideration, where original (personal) data processing is at stake, compliance with the European data protection legislation would be necessary, at least in the phases prior to data synthesis. This implies that the controller would still need to have a lawful basis to collect personal data and be subject to the corresponding data protection obligations in relation to the type and sensitivity of the collected data and the aims pursued. Thereafter, since anonymisation is widely accepted as an instance compatible with the initial purposes, further processing of personal data for data synthesis purposes should not be problematic, provided that the data synthesis is carried out adequately and synthetic data is reliably produced.
Only after personal data has been rendered synthetic in such a manner that the data subject is no longer identifiable, European data protection law may disapply. Yet one must note that the bar of anonymisation has been set very high by the European legislator. It comprises anonymisation techniques and post-anonymisation control mechanisms, both technical and organisational. In this sense, the question of whether synthetic data remains anonymous is not a discrete but a continuous issue. It depends on the extent to which the synthetic data deviates sufficiently from the original data to avoid identifiability and the extent to which anonymity is sustained over time. To validate the former, the controller must check whether identification is still possible after data synthesis. To validate the latter, technical and organisational measures must be in place. Yet still, the possibility of deducing, with significant probability, attribute values from synthetic datasets remains an unresolved issue.
In a world where trust in anonymisation techniques has eroded in recent years, it is, therefore, necessary to discuss and agree on the desired model of identifiability on which to build data protection law.
The authors would like to thank Naser Damer, senior researcher at Fraunhofer Institute for Computer Graphics Research IGD, and Khaled El Emam, Canada Research Chair in Medical AI and Professor in the School of Epidemiology and Public Health at the University of Ottawa, for their valuable insights and suggestions in the elaboration of this blog entry.