IPAM Workshop Wrap-Up
Last week was the Statistical and Learning-Theoretic Challenges in Data Privacy, which I co-organized with Cynthia Dwork, Steve Fienberg and Sesa Slavkovic. As I explained in my initial post on the workshop, the goal was to tie together work on privacy in statistical databases with the theoretical foundations of learning and statistics.
- Slides for most talks are online
- Blog posts: Arvind N., Jon K. #1, #2 (see also an older post by Ben R.)
The workshop was a success. For one thing, I got a new result out of it and lots of ideas for problems to work on. I even had fun most of the time1.
— A shift in tone —
More importantly, I felt a different tone in the conversations and talks at this workshop than at a previous ones involving a similar crowd. For the first time, most participants seemed to agree on what the important issues are. I’ve spent lots of time hanging out with statisticians recently, so this feeling may not have been shared by everyone. But one change was objectively clear: the statisticians in the crowd have become much better at describing their problems in computational terms. I distinctly remember encountering fierce resistance, at the original 2005 CS-Stats privacy workshop in Bertinoro, when we reductionist CS types tried to get statisticians to spell out the procedures they use to analyze social science data.
“Analysis requires judgement. It is as much art as science,” they said (which we translated as, “Recursion, shmecursion. We do not know our own programs!”).
“But can’t you try to pin down some common objectives?”, we answered.
— The “computational lens” at work —
An interesting feature of several talks was the explicit role of “computational” perspective. Both Frank McSherry and Yuval Nardi used techniques from numerical analysis, namely gradient ascent and the Newton-Raphson method, to design protocols which were both more efficient and easier to analyze than previous attempts based on a more global, structural perspective. Frank described a differentially private algorithm for logistic regression, joint with Ollie Williams; Yuval described an efficient SFE protocol for linear regression, joint with Steve Fienberg, Rob Hall, and others.
— Two under-investigated ideas —
At the wrap-up session (see the notes), I pointed out two directions that I think have been investigated with much less rigor than they deserve:
“Cryptanalysis” for database privacy
It would be nice to have a systematic study of, and standard nomenclature for, attacks on privacy/anonymity in statistical databases. Right now it seems every paper ends up defining (or not defining) a model from scratch, yet many papers are doing essentially the same thing in different domains. Even an incomplete taxonomy would be helpful. Here are a few terms I’d like to see becoming standard:
- linkage attack
- reconstruction attack
- composition attack (my personal favorite)
On a related point, it would be nice to see a good categorization of the kinds of side information that gets used. For example, Johannes Gehrke at Cornell and his students have a few papers laying out categories of side information (I have issues with some of the positive results in those papers, but I think the quantification of side information is interesting).
Relaxed definitions of privacy with meaningful semantics
This is probably a topic for a much longer post, but briefly: it would be nice to see meaningful definitions of privacy in statistical databases that exploit the adversary’s uncertainty about the data. The normal approach to this is to specify a set of allowable prior distributions on the data (from the adversary’s point of view). However, one has to be careful. The versions I have seen are quite brittle. Some properties to keep in mind when considering new definitions:
- Side information: is the class of priors rich enough to incorporate complex side information, such as an anonymization of a related database? [see composition above]
- Convexity and post-processing, as in Dan Kifer’s talk
- Equivalent, “semantic” characterizations [e.g. here, here]
— Other notes —
- The majority of the talks were completely or partly on differential privacy. Notable exceptions: Brad Malin, Xiaofeng Wang, Ravi Kumar, Jiashun Jin, Yuval Nardi. Our goal was not to have such a preponderance of differential privacy talks, but some of the people we expected to talk about other things (like Jerry Reiter) decided to focus on differential privacy. Tailoring the talk to the crowd?
- The nonspeaker participants were heavily skewed towards CS. In particular, at least [see comments!] four professors (Gerome Miklau, Anupam Gupta, Jonathan Katz, Yevgeniy Dodis) and three postdocs (Katrina Liggett, Anand Sarwate, Arvind Narayanan) from CS departments attended just to listen to the talks; I recognized only one stats postdoc (Saki Kinney). I also recognized lots of UCLA locals there too from CS (Yuval Ishai, Rafi Ostrovsky, Amit Sahai) but none from statistics.
- The rump session + posters combination worked very well (despite my earlier doubts). Rump session slides are online.
1 Serious sleep deprivation due to jet-lagged kids and talk prep made the “fun” part occasionally difficult.