# Oddly Shaped Pegs

An inquiry into the Nature and Causes of Stuff

## Predicate Privacy, Symmetric Keys and Obfuscation

Why are there so few blogs about research problems in crypto? For one thing, cryptographers are notoriously cagey about their open questions. (One could come up with several unflattering conjectures for why this is the case; perhaps the simplest one is that paranoia is a job qualification for cryptographers…) This post is a meager attempt to counter the trend.

At TCC 2009, Shen, Shi and Waters (SSW) presented a paper on predicate privacy in predicate encryption schemes. In this post, I wanted to point out some open questions implicit in their work, and a connection to program obfuscation.

Roughly, predicate encryption allow the decryptor to delegate a limited part of the decryption work to someone else. First, let’s consider the public-key version: Bob generates a public/secret key pair $(pk,sk)$ and publishes $pk$, which anybody can use to send him a message. Using the corresponding secret key $sk$, he can read the plaintext of these messages.

Now suppose that Bob wants to let his mail server partially decrypt his incoming mail, just enough to determine whether to route messages to Bob’s phone or to his desktop. For example, he might want to hand the mail server (Hal) a key that allows Hal to determine whether the word “urgent” appears in the email subject line. A predicate encryption scheme supports a class $C$ of predicates on the plaintext space if, for every predicate $p\in C$, Bob can generate a subkey $k_p$ that allows to Hal to compute $p(m)$ from a valid encryption of a message $m$. However, Hal should learn nothing else about the plaintext, that is, the encryptions of message pairs $m,m'$ with the same value of $p$
should remain indistinguishable.

There has been a flurry of recent work enlarging the classes of predicates  for which predicate encryption is possible. The eventual goal would be a scheme which allows creating keys for any efficiently computable predicate. This post is not about those works.

In a different direction, Shen, Shi and Waters pointed out an important caveat about the definition (and existing constructions) of predicate encryption: Hal learns the predicate $p$ itself. This might not be desirable: for example, I do not want to publicize the rules I use for prioritizing email. They set out to remedy this by constructing of a symmetric-key predicate encryption scheme. If Alice and Bob share a symmetric key $k$, the SSW scheme allows either of them to generate a key $k_p$ such that Hal, given only $k_p$, learns nothing about $p$, and such that encrypted messages with the same predicate value remain indistinguishable to Hal. Their constructions works for a family of inner product predicates.

Why the symmetric-key version?  Well, as SSW point out, the restriction is in some sense necessary. In a public key scheme, Hal learns both $k_p$ and the public key $pk$. Hence, he can encrypt messages of his choice and evaluate $p$ on the resulting ciphertext, thus giving himself oracle access to $p$. For simple classes of predicates (e.g. inner products modulo 2), this type of oracle access allows one to learn the predicate $p$ completely.

“Predicate privacy,” they conclude,  “is inherently impossible to achieve in the public-key setting.”

There are at least two good reasons to be skeptical about SSW’s conclusion, namely that symmetric-key schemes are the way to go for predicate privacy:

1. The impossibility result just mentioned is really about obfuscation: public-key predicate encryption schemes can, at best, provide the level of security that is possible for program obfuscation. Namely, that access to $k_p$ and $pk$ should allow one to learn no more about $p$ than one could from oracle access to $p$. Barak, Goldreich, Impagliazzo, Rudich, Sahai, Vadhan and Yang showed that even this type of security is impossible for many classes of predicates. On the other hand, some classes of predicates are obfuscatable. Learnable classes of predicates (such as the inner products mod 2 example above) are trivially obfuscatable since oracle access allows you to learn the predicate. More interestingly, point functions, which allow you to test if the input is equal to a particular value, can be obfuscated in the random oracle model, or assuming the existence of a very hard one-way permutation.
2. Even in the symmetric-key setting, these considerations are important. The problem is that an attacker might know, or be able to influence, the email that Alice sends to Bob. If the attacker can choose Alice’s messages, we say he is mounting a chosen-plaintext attack (CPA). For example, in the second world war, allied cryptographers famously were able to learn the location of an impending Japanese attack by broadcasting fake news of a water shortage in the Midway Islands, in the clear, and watching the resulting Japanese communications. Thus, the standard model of security for symmetric-key crypto includes, at the least, access to an encryption oracle by the adversary. In the context of predicate privacy, a CPA attack gives Hal access to an oracle for the predicate $p$.

The relationship to program obfuscation is less clear in the symmetric-key setting, since Hal must still use an oracle, which knows the secret key, to access $p$ and hence he never obtains a description of a small circuit for $p$. In particular, the impossibility results of Barak et al. do not apply.

The observations above lead to some natural open questions (which I may say more about in a future post, if the comments don’t get there first).

Open questions

• What kind of security does the SSW provide in the face of a CPA attack? (My guess: oracle access to the inner product predicate; note that such predicates are not trivially learnable over large fields, where the predicate tells only if the inner product is 0 or non-zero.)
• Is predicate privacy with CPA security impossible in general, much the same way that obfuscation is impossible in general?  (My guess: it is possible in general. In a recent conversation, Guy Rothblum guessed that it is impossible.)
• Are there interesting classes of predicates for which public-key predicate privacy is possible? (My guess: yes).

June 30, 2009 at 11:03 pm

## Crypto and politics

While the interactions between cryptography and politics are usually far from the public eye, the crackdown in Iran has given anonymization and covert communication technologies (ironically!) a chance to shine. For a layman’s introduction, see this Irish Times article and this Washington Post one (via Michael M.). I’d appreciate people using the comments to post links to articles with more technical analysis of what is currently going.

The geek in me is curious to see how the conflict between surveillers and surveillance evaders will play out in Iran; but most of me is just plain angry. This blog is now green…

June 26, 2009 at 2:15 am

Posted in Uncategorized

## IPAM Workshop on Privacy and Statistical (Learning) Theory

I am on the organizing committee (with Cynthia Dwork, Steve Fienberg, and Sesa Slavkovic) for an upcoming workshop at UCLA’s Institute for Pure and Applied Mathematics (IPAM). The workshop is on the relationship between database privacy and the theoretical foundations of statistics and machine learning. It is imaginatively titled:

Statistical and Learning-Theoretic Challenges in Data Privacy
(February 22-26, 2010)

(because the catchier “What Can We Learn Privately?” was already taken).

The workshop web page describes the basic thrust pretty concisely:

The goal of workshop is to establish a coherent theoretical foundation for research on data privacy. This implies work on (1) how the conflicting goals of privacy and utility can or should be formulated mathematically; and (2) how the constraints of privacy—in their various incarnations—affect the accuracy of statistical inference and machine learning. In particular, the goal is to shed light on the interplay between privacy and concepts such as consistency and efficiency of estimators, generalization error of learning, robustness and stability of estimation algorithms, and the generation of synthetic data.

The workshop is born of (what I consider) an exciting research program with potential payoffs both for how sensitive data is managed (see, e.g., Abe Flaxman’s post on a recommendation for HIPAA’s overhaul) as well as statistics and statistical learning theory. For more detailed discussion, see:

Participation is open to essentially anyone; to make it easier, IPAM has funding to help some attendees with their travel costs, especially students and other junior researchers. You can apply through the IPAM web page.

Several excellent researchers have already confirmed that they will speak (see the web page for the current list). I am especially happy about the breadth of the areas they hail from: crypto, algorithms, social science statistics, nonparametric statistics, theoretical and applied machine learning, and health data privacy, among others.  Of special note, there will be four tutorials aimed at helping the diverse audience actually communicate:

• Larry Wasserman and Martin Wainwright will speak about the basic foundations of statistics and statistical learning theory;
• Two other people (possibly Cynthia Dwork and myself) will discuss the definitional approaches to privacy that have come out of the CS literature, especially differential privacy, and also the worst-case analysis perspective that is common to TCS papers.

The exact format and content of the tutorials is still t.b.d., so suggestions (either directly to me or via comments on this post) would be welcome.

### Why the workshop?

Good interdisciplinary work is notoriously hard. The first barrier is linguistic: new terminology, definitions, measures of accuracy/loss, etc (“like a U.N. meeting without the benefit of translators”, as Dick Lipton recently put it, describing some of Blum and Furst’s initial interactions with AI folks). Nevertheless, the terminology barrier can be overcome relatively easily (i.e., on a scale of months or years) in theoretical fields with clean definitions and theorems, such as theoretical statistics and learning.

The more subtle barrier, and one that usually takes much more work to overcome, is one of perspective. Merely using the right language will get you partway, but “the wider point of view of [people in other fields] can be harder to grok” (Noam Nisan). What problem are they really trying to solve? What is their criterion for evaluating the quality or interest of a new idea? To add to the confusion, a field that looks monolithic from the outside may in fact be a highly heterogeneous mix of subdisciplines. For example, the question “what do statisticians actually (need to) do?”, which many of us working on data privacy have wondered aloud, has approximately as many answers as there are statisticians doing things…

As far as I can tell, these perspective barriers are best overcome by osmosis: spend as much time as possible interacting with a wide variety of people from the other field. I think theoretical work provides an especially fruitful venue for interaction because its products (namely definitions and theorems) are more universally interpretable. Of course, this opinion may simply be a result of my own preference for theoretical work…

So how does one get these interactions going? External stimuli, like deadlines for collaborative grant proposals, can help, but grant proposals require people who are already committed to working together. Workshops and conferences are also an important venue. Regarding data privacy, there were several successful workshops encouraging CS-statistics interaction: Bertinoro in 2005, NSF in 2007, CMU in 2007, DIMACS in 2008, NCHS in 2008 (no more web pages for two of those, unfortunately). The upcoming IPAM workshop is the first with an explicitly theoretical bent; I am hoping it will be an even greater success.