Anonymization and Pseudonymization
How a frozen pizza can kill your trustworthiness.
“In order to protect your privacy, we only use anonymized data.”
I guess we all have read or heard this statement an increasing number of times in the last few years, but anonymization is not as easy as one would think. There are a number of famous examples why it may not be enough to simply remove identifiers, as an example the names of the individuals in a database. Much has been written on this subject that I don’t want to repeat again. If you haven’t encountered them yet, some of the most famous stories are the ones about the Massachusetts state employees' health database (see Example 1), the AOL search queries or the Netflix prize data set.
Instead, I would like to give you an artificial example. The reason for this is that I will re-use it in the follow-up blog post on differential privacy, so give me a few moments to introduce this example here.
Let’s assume we have a database of customers of a particular grocery store, who get their purchases logged via customer cards. An effort to anonymize the data has already been made, so the names of these customers have been replaced by aliases. However, there is still quite some personal data left in, which is supposed to be used for internal customer studies. Say that in our example, the database contains age, place of residence and gender of the customers, as well as information about their purchases and purchase habits. Say that the intended use of the database is to study different customer groups, in order to understand them better and consequently adapt product selection and pricing.
Now imagine that I am an employee of the grocery store with access to the database, and I am extremely interested my neighbor’s purchases. (My neighbor claims that she usually prepares all food from scratch, but I believe she just heats up a pizza from the freezer more often than not.) So instead of doing my usual customer group studies, I try to identify my neighbor, whom I happen to know does all of her grocery shopping at this store. This turns out to be pretty easy. I know where she lives, of course, and I roughly know her age as well. It turns out that there is only one person in the database with these attributes, which has to then be my neighbor. So now I just let the database show me the purchase profiles of this customer group.
There it is! Now I know for sure that my neighbor just pretends to cook everything from scratch, but instead mostly just puts a pizza in the oven. Also, I’ve exposed her as a liar and hypocrite.
Let’s look at this example from a more conceptual viewpoint. Apparently, the anonymization of the database was not sufficiently strong to prevent a so-called linkage attack. By linking the database with side-information about my neighbor, I was able to re-identify her and thus to obtain some possibly sensitive personal data (about both her grocery shopping habits and her moral values).
Just replacing identifiers with aliases without caring about re-identifiability is also called pseudonymization. For example, the ISO 29100 standard -which is currently the only published privacy standard-follows this terminology. The term anonymization then usually stands for a stronger notion, it consists of removing the personally identifiable data from the overall data, so that from the actual data remaining it is not possible to re-identify the individual, for example by using side-information.”
Well, getting the terms right is a first step, but how does one actually anonymize if it’s not so simple? There exists a group of more sophisticated anonymization methods, namely k-anonymity and its siblings including l-diversity and t-closeness. The basic idea is that there should be large enough groups of people with the same identifiers, and that the spread of the sensitive personal data in each group of people should be sufficiently large. More on these methods in an older blog post.
An alternative to these anonymization methods, which does not alter the data but just restricts access to it, is differential privacy. There is a lot of research on differential privacy and if you like statistics and/or mathematics you will also like the nice theory on differential privacy. But I think you can also explain the basic ideas in pretty concrete terms and sweep most of the theory under the rug. So now I’ll do a classical cliffhanger and tell you that I’ll give you an introduction in the next blog post tomorrow.
Christine Jost, Ericsson Research