# Oddly Shaped Pegs

An inquiry into the Nature and Causes of Stuff

## Differential privacy and the secrecy of the sample

(This post was laid out lazily, using Luca‘s lovely latex2wp.)

— 1. Differential Privacy —

Differential privacy is a definition of “privacy” for statistical databases. Roughly, a statistical database is one which is used to provide aggregate, large-scale information about a population, without leaking information specific to individuals. Think, for example, of the data from government surveys (e.g. the decennial census or epidemiological studies), or data about a company’s customers that it would like a consultant to analyze.

The idea behind the definition is that users–that is, people getting access to aggregate information–should not be able to tell if a given individual’s data has been changed.

More formally, a data set is just a subset of items in a domain ${D}$. For a given data set ${x\subset D}$, we think of the server holding the data as applying a randomized algorithm ${A}$, producing a random variable ${A(x)}$ (distributed over vectors, strings, charts, or whatever). We say two data sets ${x,x'}$ are neighbors if they differ in one element, that is, ${x\ \triangle\ x' =1}$.

Definition 1 A randomized algorithm ${A}$ is ${\epsilon}$-differentially private if, for all pairs of neighbor data sets ${x,x'}$, and for all events ${S}$ in the output space of ${A}$:

$\displaystyle \Pr(A(x)\in S) \leq e^\epsilon \Pr(A(x')\in S\,.$

This definition has the flavor of indistinguishability in cryptography: it states that the random variables ${A(x)}$ and ${A(x')}$ must have similar distributions. The difference with the normal cryptographic setting is that the distance measure is multiplicative rather than additive. This is important for the semantics of differential privacy—see this paper for a discussion.

I hope to write a sequence of posts on differential privacy, mostly discussing aspects that don’t appear in published papers or that I feel escaped attention.

— 2. Sampling to Amplify Privacy —

To kick it off, I’ll prove here an “amplification” lemma for differential privacy. It was used implicitly in the design of an efficient, private PAC learner for the PARITY class in a FOCS 2008 paper by Shiva Kasiviswanathan, Homin Lee, Kobbi Nissim, Sofya Raskhodnikova and myself. But I think it is of much more general usefulness.

Roughly it states that given a ${O(1)}$-differentially private algorithm, one can get an ${\epsilon}$-differentially private algorithm at the cost of shrinking the size of the data set by a factor of ${\epsilon}$.

Suppose ${A}$ is a ${1}$-differentially private algorithm that expects data sets from a domain ${D}$ as input. Consider a new algorithm ${A'}$, which runs ${A}$ on a random subsample of ${ \approx\epsilon n}$ points from its input:

Algorithm 2 (Algorithm ${A'}$) On input ${\epsilon \in (0,1 )}$ and a multi-set ${x\subseteq D}$

1. Construct a set ${T\subseteq x}$ by selecting each element of ${x}$ independently with probability ${\epsilon}$.
2. Return ${A(T)}$.

Lemma 3 (Amplification via sampling) If ${A}$ is ${1}$-differentially private, then for any ${\epsilon\in(0,1)}$, ${A'(\epsilon,\cdot)}$ is ${2\epsilon}$-differentially private.

September 2, 2009 at 12:19 pm

## Predicate Privacy, Symmetric Keys and Obfuscation

Why are there so few blogs about research problems in crypto? For one thing, cryptographers are notoriously cagey about their open questions. (One could come up with several unflattering conjectures for why this is the case; perhaps the simplest one is that paranoia is a job qualification for cryptographers…) This post is a meager attempt to counter the trend.

At TCC 2009, Shen, Shi and Waters (SSW) presented a paper on predicate privacy in predicate encryption schemes. In this post, I wanted to point out some open questions implicit in their work, and a connection to program obfuscation.

Roughly, predicate encryption allow the decryptor to delegate a limited part of the decryption work to someone else. First, let’s consider the public-key version: Bob generates a public/secret key pair $(pk,sk)$ and publishes $pk$, which anybody can use to send him a message. Using the corresponding secret key $sk$, he can read the plaintext of these messages.

Now suppose that Bob wants to let his mail server partially decrypt his incoming mail, just enough to determine whether to route messages to Bob’s phone or to his desktop. For example, he might want to hand the mail server (Hal) a key that allows Hal to determine whether the word “urgent” appears in the email subject line. A predicate encryption scheme supports a class $C$ of predicates on the plaintext space if, for every predicate $p\in C$, Bob can generate a subkey $k_p$ that allows to Hal to compute $p(m)$ from a valid encryption of a message $m$. However, Hal should learn nothing else about the plaintext, that is, the encryptions of message pairs $m,m'$ with the same value of $p$
should remain indistinguishable.

There has been a flurry of recent work enlarging the classes of predicates  for which predicate encryption is possible. The eventual goal would be a scheme which allows creating keys for any efficiently computable predicate. This post is not about those works.

In a different direction, Shen, Shi and Waters pointed out an important caveat about the definition (and existing constructions) of predicate encryption: Hal learns the predicate $p$ itself. This might not be desirable: for example, I do not want to publicize the rules I use for prioritizing email. They set out to remedy this by constructing of a symmetric-key predicate encryption scheme. If Alice and Bob share a symmetric key $k$, the SSW scheme allows either of them to generate a key $k_p$ such that Hal, given only $k_p$, learns nothing about $p$, and such that encrypted messages with the same predicate value remain indistinguishable to Hal. Their constructions works for a family of inner product predicates.

Why the symmetric-key version?  Well, as SSW point out, the restriction is in some sense necessary. In a public key scheme, Hal learns both $k_p$ and the public key $pk$. Hence, he can encrypt messages of his choice and evaluate $p$ on the resulting ciphertext, thus giving himself oracle access to $p$. For simple classes of predicates (e.g. inner products modulo 2), this type of oracle access allows one to learn the predicate $p$ completely.

“Predicate privacy,” they conclude,  “is inherently impossible to achieve in the public-key setting.”

There are at least two good reasons to be skeptical about SSW’s conclusion, namely that symmetric-key schemes are the way to go for predicate privacy:

1. The impossibility result just mentioned is really about obfuscation: public-key predicate encryption schemes can, at best, provide the level of security that is possible for program obfuscation. Namely, that access to $k_p$ and $pk$ should allow one to learn no more about $p$ than one could from oracle access to $p$. Barak, Goldreich, Impagliazzo, Rudich, Sahai, Vadhan and Yang showed that even this type of security is impossible for many classes of predicates. On the other hand, some classes of predicates are obfuscatable. Learnable classes of predicates (such as the inner products mod 2 example above) are trivially obfuscatable since oracle access allows you to learn the predicate. More interestingly, point functions, which allow you to test if the input is equal to a particular value, can be obfuscated in the random oracle model, or assuming the existence of a very hard one-way permutation.
2. Even in the symmetric-key setting, these considerations are important. The problem is that an attacker might know, or be able to influence, the email that Alice sends to Bob. If the attacker can choose Alice’s messages, we say he is mounting a chosen-plaintext attack (CPA). For example, in the second world war, allied cryptographers famously were able to learn the location of an impending Japanese attack by broadcasting fake news of a water shortage in the Midway Islands, in the clear, and watching the resulting Japanese communications. Thus, the standard model of security for symmetric-key crypto includes, at the least, access to an encryption oracle by the adversary. In the context of predicate privacy, a CPA attack gives Hal access to an oracle for the predicate $p$.

The relationship to program obfuscation is less clear in the symmetric-key setting, since Hal must still use an oracle, which knows the secret key, to access $p$ and hence he never obtains a description of a small circuit for $p$. In particular, the impossibility results of Barak et al. do not apply.

The observations above lead to some natural open questions (which I may say more about in a future post, if the comments don’t get there first).

Open questions

• What kind of security does the SSW provide in the face of a CPA attack? (My guess: oracle access to the inner product predicate; note that such predicates are not trivially learnable over large fields, where the predicate tells only if the inner product is 0 or non-zero.)
• Is predicate privacy with CPA security impossible in general, much the same way that obfuscation is impossible in general?  (My guess: it is possible in general. In a recent conversation, Guy Rothblum guessed that it is impossible.)
• Are there interesting classes of predicates for which public-key predicate privacy is possible? (My guess: yes).