## Posts Tagged ‘**diffe.p.**’

## Differential privacy and the secrecy of the sample

(This post was laid out lazily, using Luca‘s lovely latex2wp.)

** — 1. Differential Privacy — **

Differential privacy is a definition of “privacy” for statistical databases. Roughly, a statistical database is one which is used to provide aggregate, large-scale information about a population, without leaking information specific to individuals. Think, for example, of the data from government surveys (*e.g.* the decennial census or epidemiological studies), or data about a company’s customers that it would like a consultant to analyze.

The idea behind the definition is that *users*–that is, people getting access to aggregate information–should not be able to tell if a given individual’s data has been changed.

More formally, a data set is just a subset of items in a domain . For a given data set , we think of the server holding the data as applying a randomized algorithm , producing a random variable (distributed over vectors, strings, charts, or whatever). We say two data sets are *neighbors* if they differ in one element, that is, .

Definition 1A randomized algorithm is -differentially private if, for all pairs of neighbor data sets , and for all events in the output space of :

This definition has the flavor of *indistinguishability* in cryptography: it states that the random variables and must have similar distributions. The difference with the normal cryptographic setting is that the distance measure is multiplicative rather than additive. This is important for the semantics of differential privacy—see this paper for a discussion.

I hope to write a sequence of posts on differential privacy, mostly discussing aspects that don’t appear in published papers or that I feel escaped attention.

** — 2. Sampling to Amplify Privacy — **

To kick it off, I’ll prove here an “amplification” lemma for differential privacy. It was used implicitly in the design of an efficient, private PAC learner for the PARITY class in a FOCS 2008 paper by Shiva Kasiviswanathan, Homin Lee, Kobbi Nissim, Sofya Raskhodnikova and myself. But I think it is of much more general usefulness.

Roughly it states that given a -differentially private algorithm, one can get an -differentially private algorithm at the cost of shrinking the size of the data set by a factor of .

Suppose is a -differentially private algorithm that expects data sets from a domain as input. Consider a new algorithm , which runs on a random subsample of points from its input:

Algorithm 2 (Algorithm )On input and a multi-set

- Construct a set by selecting each element of independently with probability .
- Return .

Lemma 3 (Amplification via sampling)If is -differentially private, then for any , is -differentially private.

## IPAM Workshop on Privacy and Statistical (Learning) Theory

I am on the organizing committee (with Cynthia Dwork, Steve Fienberg, and Sesa Slavkovic) for an upcoming workshop at UCLA’s Institute for Pure and Applied Mathematics (IPAM). The workshop is on the relationship between database privacy and the theoretical foundations of statistics and machine learning. It is imaginatively titled:

Statistical and Learning-Theoretic Challenges in Data Privacy

(February 22-26, 2010)

(because the catchier “What Can We Learn Privately?” was already taken).

The workshop web page describes the basic thrust pretty concisely:

The goal of workshop is to establish a coherent theoretical foundation for research on data privacy. This implies work on (1) how the conflicting goals of privacy and utility can or should be formulated mathematically; and (2) how the constraints of privacy—in their various incarnations—affect the accuracy of statistical inference and machine learning. In particular, the goal is to shed light on the interplay between privacy and concepts such as consistency and efficiency of estimators, generalization error of learning, robustness and stability of estimation algorithms, and the generation of synthetic data.

The workshop is born of (what I consider) an exciting research program with potential payoffs both for how sensitive data is managed (see, *e.g.*, Abe Flaxman’s post on a recommendation for HIPAA’s overhaul) as well as statistics and statistical learning theory. For more detailed discussion, see:

- The survey paper by Cynthia Dwork and me, based on these two papers.
- The workshop proposal.

Participation is open to essentially anyone; to make it easier, **IPAM has funding to help some attendees with their travel costs**, especially students and other junior researchers. You can apply through the IPAM web page.

Several excellent researchers have already confirmed that they will speak (see the web page for the current list). I am especially happy about the breadth of the areas they hail from: crypto, algorithms, social science statistics, nonparametric statistics, theoretical and applied machine learning, and health data privacy, among others. Of special note, there will be four tutorials aimed at helping the diverse audience actually communicate:

- Larry Wasserman and Martin Wainwright will speak about the basic foundations of statistics and statistical learning theory;
- Two other people (possibly Cynthia Dwork and myself) will discuss the definitional approaches to privacy that have come out of the CS literature, especially differential privacy, and also the worst-case analysis perspective that is common to TCS papers.

The exact format and content of the tutorials is still t.b.d., so suggestions (either directly to me or via comments on this post) would be welcome.

### Why the workshop?

Good interdisciplinary work is notoriously hard. The first barrier is linguistic: new terminology, definitions, measures of accuracy/loss, etc (“like a U.N. meeting without the benefit of translators”, as Dick Lipton recently put it, describing some of Blum and Furst’s initial interactions with AI folks). Nevertheless, the *terminology* barrier can be overcome relatively easily (*i.e.*, on a scale of months or years) in theoretical fields with clean definitions and theorems, such as theoretical statistics and learning.

The more subtle barrier, and one that usually takes much more work to overcome, is one of *perspective*. Merely using the right language will get you partway, but “the wider point of view of [people in other fields] can be harder to grok” (Noam Nisan). What problem are they *really* trying to solve? What is their criterion for evaluating the quality or interest of a new idea? To add to the confusion, a field that looks monolithic from the outside may in fact be a highly heterogeneous mix of subdisciplines. For example, the question “what do statisticians actually (need to) do?”, which many of us working on data privacy have wondered aloud, has approximately as many answers as there are statisticians doing things…

As far as I can tell, these perspective barriers are best overcome by osmosis: spend as much time as possible interacting with a wide variety of people from the other field. I think theoretical work provides an especially fruitful venue for interaction because its products (namely definitions and theorems) are more universally interpretable. Of course, this opinion may simply be a result of my own preference for theoretical work…

So how does one get these interactions going? External stimuli, like deadlines for collaborative grant proposals, can help, but grant proposals require people who are already committed to working together. Workshops and conferences are also an important venue. Regarding data privacy, there were several successful workshops encouraging CS-statistics interaction: Bertinoro in 2005, NSF in 2007, CMU in 2007, DIMACS in 2008, NCHS in 2008 (no more web pages for two of those, unfortunately). The upcoming IPAM workshop is the first with an explicitly theoretical bent; I am hoping it will be an even greater success.