In this article I’ll expose 4 possible solutions that organizations can implement to protect their users and customers’ privacy:
1. Data Pseudonymisation.
2. Data anonymization
3. Collect only anonymous data and use Open Source software, such as DuckDuckGo, for example.
4. Anonymous data and with a defined purpose using Open Source software.
The last of the solutions came from the Workshop “AI 360º” which we attended in Copenhagen.
Ulrik Jørgensen, a philosopher and professor at Aalborg University, (Denmark), and Elise Lassus, a researcher at the European Union Agency for Fundamental Rights (FRA), Steen Rasmussen, and myself contributed to this solution. We also include the solutions of Stephan Engberg, from his project, Citizen Key.
Is data protection by design and by default, in the context of the GDPR, the solution to protect data subjects from being profiled?
The GDPR, in its Article 25 mentions two instances where companies have to think about data protection, and these are: ‘at the time of the determination of the means for processing’ and ‘at the time of the processing itself”.
This means that at the time where engineers are thinking about developing the products, or establishing a particular business process, companies (engineers) already must think about privacy and have a plan for how to protect it.
Privacy by design establishes a privacy that is proactive, a privacy that reduces risks. It is a privacy that is embedded in whatever an organisation is doing, both in terms of building products and in business operations. As Anna Cavoukian established in her 7 principles of Privacy by design.
Article 25 of the GDPR continues by saying that the key requirements for organisations are to adopt privacy measures that are both organisational and technical.
In privacy by design, two techniques are explored below: Pseudonymisation and Anonymisation of personal data.
1. Data pseudonymisation
Recital 26 equates data that has undergone pseudonymisation to personal data.
Why do we need to pseudonymise if pseudonymous data are still personal data? Answers to this question can be found in Recital 28 and Recital 29.
Recital 28 says pseudonimisation ‘reduce the risks to the data subjects concerned and help controllers and processors to meet their data protection obligations’. Because pseudonomisation is a prevention measure, it helps in compliance with the GDPR.
And Recital 29 says ‘In order to create incentives to apply pseudonymisation when processing personal data, measures of pseudonymisation should be used, whilst allowing general analysis’.
With pesudonymisation strategies, data subject IDs are replaced by a pseudonym (hash of the user ID). This means that companies do not know which user carried out a specific action, but a company can know that an individual did A, B, C and D.
If a static hash is used, a company can know about an individual’s action for an extended period of time, which make it easier for companies to make business decisions, and as is shown in Figure 4, which implies less privacy because companies can still profile users.
2. Data anonymisation
Anonymised datasets are more secure than pseudonymised datasets as there is no direct way to recover the identity of an individual. How can anonymised data be created and validated? There are multiple anonymisation models/schemes that are sufficient according to regulations.
The WP29 Opinion 05/201440 gives two options to check whether a dataset is anonymised:
Option 1: Your Dataset has none of the following properties:
a. Singling out, which corresponds to the possibility to isolate some or all records, which identify an individual in the dataset.
b. Linkability, which is the ability to link, at least, two records concerning the same data subject or a group of data subjects (either in the same database or in two different databases). If an attacker can establish (eg by means of correlation analysis) that two records are assigned to a same group of individuals but cannot single out individuals in this group, the technique provides resistance against ‘singling out’ but not against linkability.
c. Inference, which is the possibility to deduce, with significant probability, the value of an attribute from the values of a set of other attributes.
OR Option 2: Perform a re-identification risk analysis.
The first option is much stronger than the second option in a sense that in the first option you have to show that you can prevent what is called attribute inferences. Companies have to show that they are not able to infer any attribute from the individuals that are part of the dataset.
Figure 5 shows how the anonymisation must be done in the datasets, so that the training set is fed by anonymised data.
In the second option, companies are only concerned with re-identification, which is also called identity inferences. You have to show that you are not able to recover the identity of the individuals that are part of the anonymised dataset.
However, re-identification is not the only problem, because many things can be learned from an anonymised dataset, even if individuals cannot be identified. Big data is currently actively being used for that.
The main issue is attribute inferences. Privacy advocates focus on unique identification as the main attack on our privacy. Data minimisation and anonymisation are strategies to avoid being singled out.
Companies often tell users that sharing their data is safe because they ‘anonymise’ the data by first removing or obfuscating the personal information; however, this depersonalisation leads to only partial anonymity, as companies still usually store and share data grouped together.
This data group can be analysed, and in many cases, then linked back to the user identity based on its content.
Data de-anonymisation of this nature has taken place time and time again when companies release so-called ‘anonymised data’, even with good intentions, for example, for research purposes.
For instance, even though efforts were taken to anonymise data, individuals were still de-anonymised through Netflix recommendations and AOL search histories.
3. Collect only anonymous data and use only open source software
As we saw in the former section, companies and institutions can anonymise their datasets, but this is not enough in terms of privacy, transparency and to ensure control over personal data. In this section, we consider a third possibility by collecting only anonymous data as a way of empowering individuals and protecting their privacy and confidentiality.
Anonymous data are not connected to information that can identify an individual; however, this is not about ‘just’ collecting anonymous data. In order to protect data anonymity, companies and organisations must take further security measures so datasets cannot be linked and users cannot be de-anonymised.
Un-linkability is crucial for disabling cross-contextual aggregation of individual profiles, for example, by using credentials or attributes instead of full identification. Other security measures also have to be taken like best practices, including encryption and technical measures at the organisational level.
Combining this with open source automated decision-making processes (ADM) together with transparency about the use of the obtained ADM conclusions provide the highest degree of transparency and privacy protection.
4. Anonymous and purpose-defined data together with open source software.
Complete anonymity is neither possible nor desirable for all types of cyberspace interactions. You want your doctor to know who you are to be able to help you with health issues.
Utility companies need to know the addresses to whom they deliver power and water, and for most private person-to-person communications there is a desire in both ends to be certain about the identity of each other. Also, cybercrime is difficult to manage in an anonymous cyberspace.
Do solutions exist that address the above? A solution should use anonymous data as much as possible, minimising the use of personal data, as well as transmit personal data encrypted.
Minimise the use of personal data is the intent of the GDPR and it can be obtained e.g. by creating purpose- determined data for each cyberspace process, as suggested by e.g. Engberg.
The content of the purpose-determined data can either be anonymous or identifiable dependent of the purpose. Thus, each new process in cyberspace would have a new identifier dependent on whether we communicate for e-commerce, health, banking, private conversations, etc. E.g. if we want to provide some of our health data or other relevant data for a research project, we could do that anonymously.
In contrast, some of these same data should be identifiable if it concerns an ongoing treatment by our doctor. Implementing software systems that enables the creation of purpose-determined data sessions on top of our current digital infrastructure is highly recommendable as it would either eliminate or reduce most of our current complex online personal data protection and security issues.
It could be both developed and implemented piece-meal e.g. one sector or group of individuals at the time, and it could then grow organically to include more sectors and groups.
We discuss anonymous and purpose- defined data in Figure 6. Only collecting anonymous or purpose-determined data reduce the data privacy issues significantly.
Adding strong data protection protocols (e.g. anonymised databases) for purpose- determined data (when necessary) eliminates identification of individuals as well as tracking of their presence on the internet.
Also, inferences, linkability and single-out are not possible. Combining this with open source, automated decision-making processes (ADM) together with transparency about the use of the obtained ADM decisions provide the highest degree of transparency.
The solution in Figure 6 is an option for companies and organizations to pursue, if they want to demonstrate honesty, transparency and maximize privacy protection in their data collection, processing and profiling decisions.
Please note, “No solution fits all”. Automated decision-making (ADM) should not be made public for certain critical infrastructures, e.g. automated coordination of metro traffic or power distribution.
Complete openness could expose potential infrastructure vulnerabilities for misuse. In such situations independent experts, under democratic oversight, should have access to the ADM, while ADM details should be kept out of the public eye.
Here ends my article today proposing several practical solutions to protect individuals and groups of individuals’ privacy. Once again, thanks to Ulrik Jørgensen, Elise Lassus and Stephan Engberg for his clear vision.
As always, thanks for reading.