Privacy has taken center stage in recent times as more and more of our data is digitized and shared online data privacy is at a greater risk. A single company may possess the personal information of millions of customers, they use to understand their needs and build products. Companies (even governments) need to keep these data private to protect sensitive information and from keeping their image from being tarnished. This huge data set can be used to understand the problems of the population on a large and hence can be useful. So now how do we find a way that enables the use of this data in the right way and yet protects the privacy of the individual?
To resolve this conflict of interest we have something called differential privacy.
Let’s say we are working on a sensitive data set and would like to release only a part of the statistics to the public. However, an adversary can use the part of data released and reverse engineer it to get the complete sensitive data set. This reverse engineering is a type of breach of privacy. Differential privacy can be used to solve problems when these ingredients are present- sensitive data, curators (who need to release the statistics), and an adversary who wants to recover the sensitive data.
In this article, we start by understanding what is differential privacy?; Then we look at how it works in general and briefly look at its technicalities; its applications and limitations and finally conclude by looking at its prospects.
What is differential privacy?
Generally, the company releases data by simply removing or altering sensitive data like name, address, age, etc. This is called ‘data anonymization’. But this technique has certain problems, firstly the server of the data set companies decide what data has to be removed or changed, so it is completely at their discretion. Secondly, it raises the question of how anonymous is the data? In 2007, Netflix released a data set of their user ratings as a part of the competition to see if anyone could outperform their collaborative filtering algorithm. They had removed the names of their users and added some random ratings in between to make it anonymous. However, two computer scientists from the University of Texas were still able to find 99% of the data that was removed. They breached the privacy using this data set and the auxiliary information available from IMDb (this type of breach is called linkage attack).
Since data anonymization is not enough to protect sensitive information differential privacy offers a solution. Differential privacy is a rigorous mathematical algorithm of privacy. An algorithm is said to be differentially private if by looking at the aggregate output one cannot decode or reverse engineer the original data. The algorithm gives the guarantee that the individual-level data is not revealed.
How does it work? Let’s take an example of a database of credit ratings. There can be adversaries who want to know the names of the people who have a bad credit rating. Therefore, using the differential privacy algorithm we add “noise” to the ground truth or output. So, if the number of people who have a bad credit rating is say, 3 then the algorithm adds some noise to it. Therefore, the final output will be around 3 but not the exact number. This noise enables the protection of individual data by adding in some form of discrepancy to it. Since the curator knows how much noise has been added to individual data, he or she can add the error at the end of the entire output and get the correct result. So if the reported data has inaccuracies how does it help? Basically differential privacy helps in striking a balance between statistical utility for the public and privacy for the individual.
A crucial feature of differential privacy is it defines privacy in quantifiable terms. Not as a binary notion “whether a person data was exposed or not” but rather as an accumulative risk. Every time a person’s data has been processed the risk of him/her being exposed increases. Therefore, the rigorous mathematical definition of differential privacy has parameters (epsilon and delta in the definition) that quantify the privacy loss.
Let’s look at a simple working of this algorithm.
If we ask individuals from the sample a simple yes or no question like “do you have health insurance?”, and let’s say they answer yes. The algorithm of differential privacy takes the entry and then the system flips a coin. If the result of flipping the coin is heads then, the final output will be given as yes, like the original answer. However, if the result is tails, then the system will flip another coin. In this case, if we get a head then the final output is given as yes and if it’s a tail then the final output is given as a no. So, the final output that will be given will have at least a 25% chance that the person has not taken the policy and hence, adding noise to the data (or probability of inaccuracy attach to it). Now an adversary cannot know for sure looking at the data and other auxiliary information that this is the individual the data set is talking about. Hence, preventing linkage attacks or breaches of privacy.
Now the coin toss algorithm is just an example to explain in a simplified manner about differential privacy system. The real world differential privacy system uses the Laplace distribution. Laplace distribution spreads the data to a larger range and increases the level of anonymity. In a paper called “The algorithm foundations of differential privacy- by Cynthia Dwork and Aron Roth” while explaining the mathematical definition of differential privacy, it guarantees that the outcome of the survey will remain the same whether or not an individual participates in it. Therefore, you don’t have a reason to not participate in the survey, as your data will be exposed.
So now we know what differential privacy is and how it works. Now let’s see where is it being applied and what are its limitations.
Differential privacy is used to develop many privacy tools software that can be used by the companies to protect their customer’s data. Currently, Apple and Google use differential privacy based software. Apple started rolling out differential privacy in IOS 10 AND macOS Sierra. They use it to collect data on what websites use the most power, what emoji’s are used in a certain context, and the kind of words people are typing that is not in the dictionary. Its implementation is documented however, it is not open source. Google, however, has been developing an open-source library for this. It uses differential privacy in chrome to conduct studies on the malware and in Google maps to collect data about traffics in the city.
However, not many corporates use differential privacy because of its certain limitations. Firstly differential privacy is useful only for large data sets because of the noise attached to it. On a small data set this added noise can lead to the inaccurate output. Secondly, its implementation is more difficult than simply releasing the data after anonymization.
If we look at the famous incident of 1990 when the computer scientist Latanya Sweeney used the information released by Group Insurance Commission. They decided to release the report on the visits to hospitals by state employees. Latanya Sweeney showed how easy it was to identify anyone from the data set using just three indicators zip code, gender, and birthday. This way she was able to get the details of the Governor of Massachusetts. This example reminds us why it is important to adopt differential privacy simply anonymization is not enough.
With the growing use of digitized data, there is a need to protect our data therefore, the use of differential privacy offers a way to strike a balance between data utility and privacy loss. Though the adoption is limited now it does have a lot of scope.
- Differential Privacy: A Survey of Results by Cynthia Dwork
- A Brief Introduction to Differential Privacy: Georgian
- Understanding Differential Privacy: From Intuitions behind a Theory to a Private AI Application- by Nguyen
- Differential Privacy: Harvard University Privacy Tools Project