Last week was the Statistical and Learning-Theoretic Challenges in Data Privacy, which I co-organized with Cynthia Dwork, Steve Fienberg and Sesa Slavkovic. As I explained in my initial post on the workshop, the goal was to tie together work on privacy in statistical databases with the theoretical foundations of learning and statistics.
- Slides for most talks are online
- Blog posts: Arvind N., Jon K. #1, #2 (see also an older post by Ben R.)
The workshop was a success. For one thing, I got a new result out of it and lots of ideas for problems to work on. I even had fun most of the time1.
— A shift in tone —
More importantly, I felt a different tone in the conversations and talks at this workshop than at a previous ones involving a similar crowd. For the first time, most participants seemed to agree on what the important issues are. I’ve spent lots of time hanging out with statisticians recently, so this feeling may not have been shared by everyone. But one change was objectively clear: the statisticians in the crowd have become much better at describing their problems in computational terms. I distinctly remember encountering fierce resistance, at the original 2005 CS-Stats privacy workshop in Bertinoro, when we reductionist CS types tried to get statisticians to spell out the procedures they use to analyze social science data.
“Analysis requires judgement. It is as much art as science,” they said (which we translated as, “Recursion, shmecursion. We do not know our own programs!”).
“But can’t you try to pin down some common objectives?”, we answered.
— The “computational lens” at work —
An interesting feature of several talks was the explicit role of “computational” perspective. Both Frank McSherry and Yuval Nardi used techniques from numerical analysis, namely gradient ascent and the Newton-Raphson method, to design protocols which were both more efficient and easier to analyze than previous attempts based on a more global, structural perspective. Frank described a differentially private algorithm for logistic regression, joint with Ollie Williams; Yuval described an efficient SFE protocol for linear regression, joint with Steve Fienberg, Rob Hall, and others.
— Two under-investigated ideas —
At the wrap-up session (see the notes), I pointed out two directions that I think have been investigated with much less rigor than they deserve:
“Cryptanalysis” for database privacy
It would be nice to have a systematic study of, and standard nomenclature for, attacks on privacy/anonymity in statistical databases. Right now it seems every paper ends up defining (or not defining) a model from scratch, yet many papers are doing essentially the same thing in different domains. Even an incomplete taxonomy would be helpful. Here are a few terms I’d like to see becoming standard:
- linkage attack
- reconstruction attack
- composition attack (my personal favorite)
On a related point, it would be nice to see a good categorization of the kinds of side information that gets used. For example, Johannes Gehrke at Cornell and his students have a few papers laying out categories of side information (I have issues with some of the positive results in those papers, but I think the quantification of side information is interesting).
Relaxed definitions of privacy with meaningful semantics
This is probably a topic for a much longer post, but briefly: it would be nice to see meaningful definitions of privacy in statistical databases that exploit the adversary’s uncertainty about the data. The normal approach to this is to specify a set of allowable prior distributions on the data (from the adversary’s point of view). However, one has to be careful. The versions I have seen are quite brittle. Some properties to keep in mind when considering new definitions:
- Side information: is the class of priors rich enough to incorporate complex side information, such as an anonymization of a related database? [see composition above]
- Convexity and post-processing, as in Dan Kifer’s talk
- Equivalent, “semantic” characterizations [e.g. here, here]
— Other notes —
- The majority of the talks were completely or partly on differential privacy. Notable exceptions: Brad Malin, Xiaofeng Wang, Ravi Kumar, Jiashun Jin, Yuval Nardi. Our goal was not to have such a preponderance of differential privacy talks, but some of the people we expected to talk about other things (like Jerry Reiter) decided to focus on differential privacy. Tailoring the talk to the crowd?
- The nonspeaker participants were heavily skewed towards CS. In particular, at least [see comments!] four professors (Gerome Miklau, Anupam Gupta, Jonathan Katz, Yevgeniy Dodis) and three postdocs (Katrina Liggett, Anand Sarwate, Arvind Narayanan) from CS departments attended just to listen to the talks; I recognized only one stats postdoc (Saki Kinney). I also recognized lots of UCLA locals there too from CS (Yuval Ishai, Rafi Ostrovsky, Amit Sahai) but none from statistics.
- The rump session + posters combination worked very well (despite my earlier doubts). Rump session slides are online.
1 Serious sleep deprivation due to jet-lagged kids and talk prep made the “fun” part occasionally difficult.
Once again this year, theorists and their friends were well represented among the Sloan Research Fellows. These are the names I recognize:
- Joel Tropp
- Jonathan Kelner
- Constantinos Daskalakis
- Amin Saberi
- Brent Waters
- Balázs Szegedy
Congratulations to all!
Like most theorists I collaborate a lot with remote colleagues1. Recent technological developments like the telephone have made this much easier, but I am ready to move on to something even more advanced.
I have started to use this Internet thing. For example, I use cvs or subversion on most projects to track and synchronize files (see Suresh’s recent post for an explanation of why). I call people on Skype when the connection is good enough for video, and use a mix of email, IM and Wave2 to jot down notes in real time about the conversation.
One thing is missing in my collaborative online life, though, and that is a good white board. Ideally, I would like a live holographic image of my colleagues in my office, together with sound and a remotely controllable marker for my whiteboard. Until I write a paper with the doctor from Star Trek Voyager, I’m willing to settle for something less ambitious: an interactive white board on which several people can simultaneously draw, paste/drop in graphics, and write text and latex equations that get formatted in real time.
There are a few free web services and software packages out there that offer something along these lines:
- Scriblink: this has a lot of what I want but a very clunky interface.
- Imagination^3: better drawing interface, no latex/math support
- Latex plugins for Gaim: adds simple latex support to most IM platforms. No drawing. Also, annoying to install on a Mac since it requires a bunch of other packages to be installed first.
- Jarnal: haven’t tried it yet.
A question for readers, then: what other similar tools are out there? More generally, what software has improved your remote collaborations?
1 I mean physically remote. Some of them are also emotionally remote, but so far neither the telephone nor the Internet have helped them.
2 Although Google’s recent Buzz privacy fiasco made me reconsider using Wave.
It wasn’t exactly in stealth mode, but I heard about Data Privacy Day 2010 only after it happened.
Born of an effort to promote awareness of data privacy issues by the non-profit The Privacy Projects, this year’s celebration (?) included events at a several universities. Most interesting to me was a roundtable discussion at UC Berkeley sponsored by the Federal Trade Commission. I’m skeptical about how much the federal government will do about protecting privacy, but it is good to see serious interest.
This year’s events concentrated on consumer privacy and its apparent conflict with emerging business models. My recent research has been on handling privacy concerns in “statistical databases” — large collections of sensitive information that we would like to open up to wider scrutiny and analysis. Unsurprisingly, I would like to see “Data Privacy Day” also cover this aspect of data privacy. There is a danger, though, that the topic becomes too diffuse. What are really the most pressing privacy issues, and what should a broad “data privacy” awareness event cover?
UPDATE (2/3/10): Arvind was part of the roundtable and has some notes on it at 33 bits. He includes there some interesting comments on academics’ participation in policy discussions. I’ll add only that at Penn State, quite a few faculty members are involved in policy, but mostly away from the public eye. For example, two weeks ago I met with a White House official about privacy issues in the release of government data sets; I’ve also been involved in (executive branch) panels on government handling of biometric data. However, it is true that public participation in policy discussions by academics is limited. That may be because many academics realize they would make bad politicians; as Arvind notes, misaligned incentives also play a role.
Two interesting reads this morning on military security.
First, Matt Blaze blogs about physical and procedural security measures at a decommissioned nuclear ICBM silo. Startling pictures and lots of food for thought. The cultural and game theoretic aspects of our current conflicts are pretty different from those of the cold war; I was surprised to find myself looking back with something like nostalgia at the bright ideological lines of my childhood.
“MAD [Mutually Assured Destruction] may well be the most perfectly evocative acronym in the English language, but for 40 years, it actually seemed to work. Leaders on both sides evidently knew a Nash equilibrium when they saw one. […]
A few hundred of the successors to the Titans, the “Minuteman III” missiles, remain active in silos throughout the northern US. […] Looking up from the bottom of the silo at the little crack of sunlight 150 feet above, an obvious fact hit home for me. I realized at that moment that these things are actually aimed somewhere, somewhere not at all abstract.”
On an unrelated (?) topic, the Washington Post (discussed by Harry Lewis) reports that US drones have been transmitting video in the clear, and that militants in Iraq and Afghanistan have been watching avidly.
“The U.S. government has known about the flaw since the U.S. campaign in Bosnia in the 1990s… But the Pentagon assumed local adversaries wouldn’t know how to exploit it, [current and former] officials said.”
That explains it. But surely, in 15+ years, someone would have had time to patch the hole.
“Fixing the security gap would have caused delays, according to current and former military officials. It would have added to the Predator’s price.”
Huh? Software encryption is cheap! All right, maybe dedicated hardware costs more. Maybe even thousands of dollars. Ok, now I get it.
“Today, the Air Force is buying hundreds of Reaper drones, a newer model, whose video feeds could be intercepted in much the same way as with the Predators, according to people familiar with the matter. A Reaper costs between $10 million and $12 million…”
PS: Of course, the security of video transmissions may be the least of the problems that the extensive use of drones in Pakistan and Afghanistan raises. Reading the Post article this morning reminded me of this Fresh Air interview from a few months ago on the use of robotic weapons in general.
— 1. Differential Privacy —
Differential privacy is a definition of “privacy” for statistical databases. Roughly, a statistical database is one which is used to provide aggregate, large-scale information about a population, without leaking information specific to individuals. Think, for example, of the data from government surveys (e.g. the decennial census or epidemiological studies), or data about a company’s customers that it would like a consultant to analyze.
The idea behind the definition is that users–that is, people getting access to aggregate information–should not be able to tell if a given individual’s data has been changed.
More formally, a data set is just a subset of items in a domain . For a given data set , we think of the server holding the data as applying a randomized algorithm , producing a random variable (distributed over vectors, strings, charts, or whatever). We say two data sets are neighbors if they differ in one element, that is, .
Definition 1 A randomized algorithm is -differentially private if, for all pairs of neighbor data sets , and for all events in the output space of :
This definition has the flavor of indistinguishability in cryptography: it states that the random variables and must have similar distributions. The difference with the normal cryptographic setting is that the distance measure is multiplicative rather than additive. This is important for the semantics of differential privacy—see this paper for a discussion.
I hope to write a sequence of posts on differential privacy, mostly discussing aspects that don’t appear in published papers or that I feel escaped attention.
— 2. Sampling to Amplify Privacy —
To kick it off, I’ll prove here an “amplification” lemma for differential privacy. It was used implicitly in the design of an efficient, private PAC learner for the PARITY class in a FOCS 2008 paper by Shiva Kasiviswanathan, Homin Lee, Kobbi Nissim, Sofya Raskhodnikova and myself. But I think it is of much more general usefulness.
Roughly it states that given a -differentially private algorithm, one can get an -differentially private algorithm at the cost of shrinking the size of the data set by a factor of .
Suppose is a -differentially private algorithm that expects data sets from a domain as input. Consider a new algorithm , which runs on a random subsample of points from its input:
Algorithm 2 (Algorithm ) On input and a multi-set
- Construct a set by selecting each element of independently with probability .
- Return .
Lemma 3 (Amplification via sampling) If is -differentially private, then for any , is -differentially private.
Leo Reyzin pointed out the following EFF article on locational privacy. The article does a reasonable job of explaining the role that new technology can have in both violating and protecting privacy.
I can’t count how many times I’ve had a similar conversation (ok, monologue) with government folks and non-technical friends. Maybe next time I can spare myself the trouble and email them the link.