Differential privacy and the secrecy of the sample
— 1. Differential Privacy —
Differential privacy is a definition of “privacy” for statistical databases. Roughly, a statistical database is one which is used to provide aggregate, large-scale information about a population, without leaking information specific to individuals. Think, for example, of the data from government surveys (e.g. the decennial census or epidemiological studies), or data about a company’s customers that it would like a consultant to analyze.
The idea behind the definition is that users–that is, people getting access to aggregate information–should not be able to tell if a given individual’s data has been changed.
More formally, a data set is just a subset of items in a domain . For a given data set , we think of the server holding the data as applying a randomized algorithm , producing a random variable (distributed over vectors, strings, charts, or whatever). We say two data sets are neighbors if they differ in one element, that is, .
Definition 1 A randomized algorithm is -differentially private if, for all pairs of neighbor data sets , and for all events in the output space of :
This definition has the flavor of indistinguishability in cryptography: it states that the random variables and must have similar distributions. The difference with the normal cryptographic setting is that the distance measure is multiplicative rather than additive. This is important for the semantics of differential privacy—see this paper for a discussion.
I hope to write a sequence of posts on differential privacy, mostly discussing aspects that don’t appear in published papers or that I feel escaped attention.
— 2. Sampling to Amplify Privacy —
To kick it off, I’ll prove here an “amplification” lemma for differential privacy. It was used implicitly in the design of an efficient, private PAC learner for the PARITY class in a FOCS 2008 paper by Shiva Kasiviswanathan, Homin Lee, Kobbi Nissim, Sofya Raskhodnikova and myself. But I think it is of much more general usefulness.
Roughly it states that given a -differentially private algorithm, one can get an -differentially private algorithm at the cost of shrinking the size of the data set by a factor of .
Suppose is a -differentially private algorithm that expects data sets from a domain as input. Consider a new algorithm , which runs on a random subsample of points from its input:
Algorithm 2 (Algorithm ) On input and a multi-set
- Construct a set by selecting each element of independently with probability .
- Return .
Lemma 3 (Amplification via sampling) If is -differentially private, then for any , is -differentially private.
Proof: Fix an event in the output space of , and two data sets that differ by a single individual, say .
Consider a run of on input . If is not included in the sample , then the output is distributed the same as a run of on , since the inclusion of in the sample is independent of the inclusion of other elements. On the other hand, if is included in the sample , then the behavior of on is only a factor of off from the behavior of on . Again, because of independence, the distribution of is the same as the distribution of conditioned on the omission of . For a set , let denote the distribution of . In symbols, we have that for any event :
We can put the pieces together, using the fact that is in with probability only :
We can get a similar lower bound:
The last inequality uses the fact that .
Notes on the proof: (a) I did not attempt to optimize constants. (b) We don’t need a lower bound on in terms of , only an upper bound. That is, it suffices that adding elements to a set should not cause big jumps in probability — a sudden drop is ok. That extra generality was actually needed in the analysis of the private learner for parity in [KLNRS’08]. (c) A similar amplification could work starting from a -differentiall private algorithm for any , but the parameters would get worse very quickly when is much larger than 1. One would have to retain elements with probability only , meaning that almost nothing would be left when .
— 3. Interpretation: The Secrecy of the Sample —
The amplification lemma works by sampling a random subset of the input and so it makes the most sense to apply it in settings where the input is itself a random sample from some underlying population. The price, then, of amplifying is that the effective input size shrinks. This simplifies design at the cost of throwing away data. In settings where data is cheap (think streaming algorithms), the tradeoff makes sense.
There is a different interpretation, however, which I like more. It says that if you are doing a survey and you can reasonably expect that the sample itself will remain secret, then you get -differential privacy for very small essentially for free. This formalizes a common intuition among, say, statisticians at the census bureau, that the very uncertainty about which members of the US population were surveyed (for long form data) provides a large degree of protection.
— 4. Update: Proving the lemma with a weaker hypothesis —
Omkant in the comments asked about proving the lemma with only an upper bound on in terms of . The idea is that the trivial lower bound of 0 suffices:
The last inequality holds as long as is less than a particular constant (roughly ). This leads to a slight weakening of the lemma statement, in exchange for a weaker hypothesis.