The second
Netflix Prize Contest will release even more customer information, raising concerns that the movie rental data could be connected with customers, violating the
Video Privacy Protection Act.
Paul Ohm is concerned that the data might give too much information:
True, Netflix plans to release age not birthdate, but simple arithmetic shows that for many people in the country, gender plus ZIP code plus age will narrow their private movie preferences down to at most a few hundred people. Netflix needs to understand the concept of "information entropy": even if it is not revealing information tied to a single person, it is revealing information tied to so few that we should consider this a privacy breach.
I have no doubt that researchers will be able to use the techniques of Narayanan and Shmatikov, together with databases revealing sex, zip code, and age, to tie many people directly to these supposedly anonymized new records.
Because of this, if it releases the data, Netflix might be breaking the law.
Netflix's Steve Swasey wants to assure customers that the data cannot be connected to customers:
There’s no foundation for the concerns raised.
Netflix zealously guards its members’ privacy. All the information we’re giving in The Netflix Prize 2 dataset is completely anonymous. It contains no personally identifiable information. It does not contain anyone’s name, address, date of birth or any means to connect a particular record with a specific Netflix member. As in Netflix Prize 1, the dataset contains some movie ratings from select anonymous members. It also includes some Queue adds and taste preferences, broad age ranges, gender and zip codes but, again, completely anonymous. All that data is modified – our scientists call it perturbed – to make it anonymous.
Thanks to Jim for sending this in.