# Oddly Shaped Pegs

An inquiry into the Nature and Causes of Stuff

## Differential privacy and the secrecy of the sample

(This post was laid out lazily, using Luca‘s lovely latex2wp.)

— 1. Differential Privacy —

Differential privacy is a definition of “privacy” for statistical databases. Roughly, a statistical database is one which is used to provide aggregate, large-scale information about a population, without leaking information specific to individuals. Think, for example, of the data from government surveys (e.g. the decennial census or epidemiological studies), or data about a company’s customers that it would like a consultant to analyze.

The idea behind the definition is that users–that is, people getting access to aggregate information–should not be able to tell if a given individual’s data has been changed.

More formally, a data set is just a subset of items in a domain ${D}$. For a given data set ${x\subset D}$, we think of the server holding the data as applying a randomized algorithm ${A}$, producing a random variable ${A(x)}$ (distributed over vectors, strings, charts, or whatever). We say two data sets ${x,x'}$ are neighbors if they differ in one element, that is, ${x\ \triangle\ x' =1}$.

Definition 1 A randomized algorithm ${A}$ is ${\epsilon}$-differentially private if, for all pairs of neighbor data sets ${x,x'}$, and for all events ${S}$ in the output space of ${A}$: $\displaystyle \Pr(A(x)\in S) \leq e^\epsilon \Pr(A(x')\in S\,.$

This definition has the flavor of indistinguishability in cryptography: it states that the random variables ${A(x)}$ and ${A(x')}$ must have similar distributions. The difference with the normal cryptographic setting is that the distance measure is multiplicative rather than additive. This is important for the semantics of differential privacy—see this paper for a discussion.

I hope to write a sequence of posts on differential privacy, mostly discussing aspects that don’t appear in published papers or that I feel escaped attention.

— 2. Sampling to Amplify Privacy —

To kick it off, I’ll prove here an “amplification” lemma for differential privacy. It was used implicitly in the design of an efficient, private PAC learner for the PARITY class in a FOCS 2008 paper by Shiva Kasiviswanathan, Homin Lee, Kobbi Nissim, Sofya Raskhodnikova and myself. But I think it is of much more general usefulness.

Roughly it states that given a ${O(1)}$-differentially private algorithm, one can get an ${\epsilon}$-differentially private algorithm at the cost of shrinking the size of the data set by a factor of ${\epsilon}$.

Suppose ${A}$ is a ${1}$-differentially private algorithm that expects data sets from a domain ${D}$ as input. Consider a new algorithm ${A'}$, which runs ${A}$ on a random subsample of ${ \approx\epsilon n}$ points from its input:

Algorithm 2 (Algorithm ${A'}$) On input ${\epsilon \in (0,1 )}$ and a multi-set ${x\subseteq D}$

1. Construct a set ${T\subseteq x}$ by selecting each element of ${x}$ independently with probability ${\epsilon}$.
2. Return ${A(T)}$.

Lemma 3 (Amplification via sampling) If ${A}$ is ${1}$-differentially private, then for any ${\epsilon\in(0,1)}$, ${A'(\epsilon,\cdot)}$ is ${2\epsilon}$-differentially private.

Proof: Fix an event ${S}$ in the output space of ${A'}$, and two data sets ${x,x'}$ that differ by a single individual, say ${x=x'\cup \{i\}}$.

Consider a run of ${A'}$ on input ${x}$. If ${i}$ is not included in the sample ${T}$, then the output is distributed the same as a run of ${A'}$ on ${x'=x\setminus\{i\}}$, since the inclusion of ${i}$ in the sample is independent of the inclusion of other elements. On the other hand, if ${i}$ is included in the sample ${T}$, then the behavior of ${A}$ on ${T}$ is only a factor of ${e}$ off from the behavior of ${A}$ on ${T\setminus\{i\}}$. Again, because of independence, the distribution of ${T\setminus\{i\}}$ is the same as the distribution of ${T}$ conditioned on the omission of ${i}$. For a set ${T\subseteq D}$, let ${p_T}$ denote the distribution of ${A(T)}$. In symbols, we have that for any event ${S}$: $\displaystyle p_{x}(S\mid i\not\in T) = p_{x'}(S) \quad\text{and}\quad p_{x}(S\mid i\in T) \in e^{\pm 1}p_{x'}(S)\,.$

We can put the pieces together, using the fact that ${i}$ is in ${T}$ with probability only ${\epsilon}$: $\displaystyle \begin{array}{rcl} p_x(S) &=& (1-\epsilon)\cdot p_{x}(S\mid i\not\in T) + \epsilon\cdot p_{x}(S\mid i\in T) \\ &\leq& (1-\epsilon) \cdot p_{x'}(S) + \epsilon \cdot e \cdot p_{x'}(S)\\ &=& (1+\epsilon(e-1)) p_{x'}(S)\\& \leq& \exp(2\epsilon)\cdot p_{x'}(S) \end{array}$

We can get a similar lower bound: $\displaystyle \begin{array}{rcl} p_x(S) &=& (1-\epsilon)\cdot p_{x}(S\mid i\not\in T) + \epsilon\cdot p_{x}(S\mid i\in T) \\ &\geq& (1-\epsilon) \cdot p_{x'}(S) + \epsilon \cdot \frac1e \cdot p_{x'}(S)\\ &=& (1-\epsilon(1-e^{-1})) \cdot p_{x'}(S)\\ &\geq& \exp(-\epsilon) \cdot p_{x'}(S) \end{array}$

The last inequality uses the fact that ${\epsilon\leq 1}$. $\Box$

Notes on the proof: (a) I did not attempt to optimize constants. (b) We don’t need a lower bound on ${\Pr(A(T)\in S)}$ in terms of ${\Pr(A(T\setminus\{i\})\in S)}$, only an upper bound. That is, it suffices that adding elements to a set should not cause big jumps in probability — a sudden drop is ok. That extra generality was actually needed in the analysis of the private learner for parity in [KLNRS’08]. (c) A similar amplification could work starting from a ${c}$-differentiall private algorithm for any ${c}$, but the parameters would get worse very quickly when ${c}$ is much larger than 1. One would have to retain elements with probability only ${\epsilon\cdot e^{-c}}$, meaning that almost nothing would be left when ${c>\ln(n)}$.

— 3. Interpretation: The Secrecy of the Sample —

The amplification lemma works by sampling a random subset of the input and so it makes the most sense to apply it in settings where the input is itself a random sample from some underlying population. The price, then, of amplifying ${\epsilon}$ is that the effective input size shrinks. This simplifies design at the cost of throwing away data. In settings where data is cheap (think streaming algorithms), the tradeoff makes sense.

There is a different interpretation, however, which I like more. It says that if you are doing a survey and you can reasonably expect that the sample itself will remain secret, then you get ${\epsilon}$-differential privacy for very small ${\epsilon}$ essentially for free. This formalizes a common intuition among, say, statisticians at the census bureau, that the very uncertainty about which members of the US population were surveyed (for long form data) provides a large degree of protection.

— 4. Update: Proving the lemma with a weaker hypothesis —

Omkant in the comments asked about proving the lemma with only an upper bound on ${p_x(S)}$ in terms of ${p_{x'}(S)}$. The idea is that the trivial lower bound of 0 suffices: $\displaystyle \begin{array}{rcl} p_x(S) &=& (1-\epsilon)\cdot p_{x}(S\mid i\not\in T) + \epsilon\cdot p_{x}(S\mid i\in T) \\ &\geq& (1-\epsilon) \cdot p_{x'}(S) + \epsilon \cdot \mathbf{0}\\ &=& (1-\epsilon) \cdot p_{x'}(S)\\ &\geq& \exp(-2\epsilon) \cdot p_{x'}(S) \end{array}$

The last inequality holds as long as ${\epsilon}$ is less than a particular constant (roughly ${0.79}$). This leads to a slight weakening of the lemma statement, in exchange for a weaker hypothesis.

September 2, 2009 at 12:19 pm

### 5 Responses

1. Quite cool. The “secrecy of the sample” interpretation is actually what I liked the most. Btw, could you elaborate more on why we don’t need the lower-bound to conclude the lemma?

Omkant

September 7, 2009 at 5:28 am

• Updated to add the details.

September 7, 2009 at 8:44 am

2. Ahh… I believe I got confused with the statement “we don’t need lower-bound….”. You were saying that we don’t need to use the lower-bound coming from A being 1-dp (so 1/e * pr) to conclude the other direction. I took it as if we did not need to conclude the lower-bound itself in your proof of Lemma 3.

Omkant

September 7, 2009 at 5:46 pm