Differential privacy and the secrecy of the sample

(This post was laid out lazily, using Luca’s lovely latex2wp.)

— 1. Differential Privacy —

Differential privacy is a definition of “privacy” for statistical databases. Roughly, a statistical database is one which is used to provide aggregate, large-scale information about a population, without leaking information specific to individuals. Think, for example, of the data from government surveys (e.g. the decennial census or epidemiological studies), or data about a company’s customers that it would like a consultant to analyze.

The idea behind the definition is that users–that is, people getting access to aggregate information–should not be able to tell if a given individual’s data has been changed.

More formally, a data set is just a subset of items in a domain {D}. For a given data set {x\subset D}, we think of the server holding the data as applying a randomized algorithm {A}, producing a random variable {A(x)} (distributed over vectors, strings, charts, or whatever). We say two data sets {x,x'} are neighbors if they differ in one element, that is, {x\ \triangle\ x' =1}.

Definition 1 A randomized algorithm {A} is {\epsilon}-differentially private if, for all pairs of neighbor data sets {x,x'}, and for all events {S} in the output space of {A}:

\displaystyle \Pr(A(x)\in S) \leq e^\epsilon \Pr(A(x')\in S\,.

This definition has the flavor of indistinguishability in cryptography: it states that the random variables {A(x)} and {A(x')} must have similar distributions. The difference with the normal cryptographic setting is that the distance measure is multiplicative rather than additive. This is important for the semantics of differential privacy—see this paper for a discussion.

I hope to write a sequence of posts on differential privacy, mostly discussing aspects that don’t appear in published papers or that I feel escaped attention.

— 2. Sampling to Amplify Privacy —

To kick it off, I’ll prove here an “amplification” lemma for differential privacy. It was used implicitly in the design of an efficient, private PAC learner for the PARITY class in a FOCS 2008 paper by Shiva Kasiviswanathan, Homin Lee, Kobbi Nissim, Sofya Raskhodnikova and myself. But I think it is of much more general usefulness.

Roughly it states that given a {O(1)}-differentially private algorithm, one can get an {\epsilon}-differentially private algorithm at the cost of shrinking the size of the data set by a factor of {\epsilon}.

Suppose {A} is a {1}-differentially private algorithm that expects data sets from a domain {D} as input. Consider a new algorithm {A'}, which runs {A} on a random subsample of { \approx\epsilon n} points from its input:

Algorithm 2 (Algorithm {A'}) On input {\epsilon \in (0,1 )} and a multi-set {x\subseteq D}

  1. Construct a set {T\subseteq x} by selecting each element of {x} independently with probability {\epsilon}.
  2. Return {A(T)}.

Lemma 3 (Amplification via sampling) If {A} is {1}-differentially private, then for any {\epsilon\in(0,1)}, {A'(\epsilon,\cdot)} is {2\epsilon}-differentially private.

Continue reading

Locational Privacy

Leo Reyzin pointed out the following EFF article on locational privacy. The article does a reasonable job of explaining the role that new technology can have in both violating and protecting privacy.

I can’t count how many times I’ve had a similar conversation (ok, monologue) with government folks and non-technical friends. Maybe next time I can spare myself the trouble and email them the link.

Je m’en (af)fiche

Tucked neatly between (US) Independence Day and (French) Bastille Day this year was YESS, a joint French-US workshop for “young scientists and engineers”, held at the French embassy. This year’s theme was Identity Management, which largely meant biometrics, database privacy and anonymous communication. I wasn’t sure from the initial invitation how serious the whole thing was (note to organizers: explicit mention of freely flowing champagne in a workshop invitation is appealing, but a little suspicious), but it turned out to be a great workshop.

Ok, almost great. There were a number of program/department overviews by French and US officials, some of which made me feel like Bill Gasarch did in this picture. (To the speakers’ credit, no one made jokes about freedom fries even though fries were served in the Embassy cafeteria.) But the scientific talks were interesting, and I got to meet a cross-section of French researchers I don’t normally interact with. I also learned a lot about Tor and it’s role in Iran from Roger Dingledine.

I walked away from “YESS” with a few distinct impressions.

  • It is extremely difficult to give a program or departmental overview that is fun to listen to.
  • The French are a few years into the process of switching to a competitive, proposal-driven mode of funding research.
  • A significant fraction of the biometrics/security community read the fuzzy extractor papers and got the point (in a nutshell: it is possible to be rigorous about the security properties of biometric systems). Unfortunately, that fraction is a lot less than 1.
  • Making scientific posters is a questionable use of anyone’s time.

The first point is self-explanatory, I’ll save the second for a future discussion and I haven’t figured out what to make of the third.

But the posters! Called post-ère in French (why not affiche? Nobody knew.), the scientific poster lies somewhere between talk slides and a regular written article. In its ideal form it provides both a quick overall impression of a work, for the casual passer-by, and more in-depth explanations, for the viewer with more time and patience. In practice, it is  hacked together in a hurry and provides neither. For a typical failure, see my own feeble attempt here.

However, the real problem with posters is that even the good ones don’t seem to get read. It is almost always more compelling to listen to a bad talk than to read a good poster. What gives? By all rights, a poster session should be more interesting than a typical session of talks since you can allocate your time and attention more flexibly. But it doesn’t work — posters just don’t seem to make people care.

What is the best format for allowing everyone to get a short overviews of all the papers, and also have the option of learning more about a given paper? Is the web pushing science into a post-post-ère era? Or do I just need to learn to read?

I am troubled by this since the IPAM workshop I’m co-organizing will have time for a relatively small number of talks, but we want to give all attendees a chance to present their ideas. The initial plan was to have a poster session, but I am having my doubts. Maybe a rump session with many five-minute talks is more effective, or  perhaps a combination of the two (five-minute poster ads)?

Systematization of Knowledge track at Oakland

I am on the PC for “Oakland” this year (a.k.a. the IEEE Symposium on Security and Privacy).

I have been on the PC of a few conferences in areas outside my immediate expertise and so far I’ve enjoyed the experience. Usually, I am asked to join because they need someone to help them reject carefully review the few crypto/privacy papers that get submitted. Along the way, I get to learn about a different area of research, and about the taste in problems that is prevalent in the other community. Oakland is different because it is nominally about  my area, but the author community is essentially disjoint from the STOC/FOCS/Crypto/Eurocrypt crowd that I hang out with; consequently, the focus of the submissions is very different from that of the papers I am used to reading. I promise to share any (constructive…) comments I have on my experience.

Anyway, my main point: this year’s Oakland will include a new “systematization of knowledge” track. The call for papers says it all:

“The goal of this [track] is to encourage work that evaluates, systematizes, and contextualizes existing knowledge. These papers will provide a high value to our community but would otherwise not be accepted because they lack novel research contributions. Suitable papers include survey papers that provide useful perspectives on major research areas, papers that support or challenge long-held beliefs with compelling evidence, or papers that provide an extensive and realistic evaluation of competing approaches to solving specific problems. … [Submissions] will be reviewed by the full PC and held to the same standards as traditional research papers, except instead of emphasizing novel research contributions the emphasis will be on value to the community. Accepted papers will be presented at the symposium and included in the proceedings.”

Well-written regular papers already include a “systematization of knowledge” component in their related work sections: the obligation to summarize related papers often results in a clean, concise presentation of their high-level ideas. Unfortunately, the quality of the related work section rarely makes or breaks a conference submission, so mileage varies; hence the need for a separate track.

A few questions jump to mind:

  1. If this “systemization” track becomes standard, how will job candidates be viewed if their publication lists contain many such systemization papers? A successful textbook can dramatically increase a researcher’s profile; is the same true of survey papers?
  2. What areas of crypto/security/privacy are in direst need of “systemization”? Here a few suggestions for Oakland-appropriate topics:
    • definitions of security for anonymous communication systems (e.g. Tor)
    • techniques for de-anonymization of sanitized data (hopefully tying together papers published at Oakland, KDD, SIGMOD, VLDB, ICDE, etc)
    • Notions of “computation with encrypted data”: homomorphic encryption,predicate encryption, deterministic encryption, order-preserving encryption, etc.
  3. Assuming the Oakland systemization track is a success, what other conferences would benefit from adding such a track?

Insensitive attributes

And the award for best blog post title of the day goes to…

There is an Elephant in the Room; & Everyone’s Social Security Numbers are Written on Its Hide.

The post, by Richard Power, reports on an article by Alessandro Acquisti and Ralph Gross,Predicting Social Security numbers from public data”, (faq, PNAS paper) which highlights how one can narrow down a US citizen’s social security number to a relatively small range based only on his or her state and date of birth.

As the Social Security Administration explained (see the elephantine blog post linked above), this was not really a secret; the SSA’s algorithm for generating SSN’s is public. The virtue of the Acquisti-Gross article is in pointing out the security implications of this clearly.

One of the interesting notions the study puts to rest is the distinction between “insensitive” and “sensitive” attributes. Almost anything can be used to identify a person, and once someone has a handle on you it is remarkably easy to predict, or find out, even more.

Copyright, copywrong

Michaels Nielsen and Mitzenmacher pointed out a recent post by Harvard’s Stuart Shieber about the “don’t ask, don’t tell” policy that is the implicit norm in scholarly publications, at least in computer science.

“Publishers officially forbid online distribution, authors do it anyway without telling the publishers, and publishers don’t ask them to stop even though it violates contractual obligations. What happens when you refuse to play that game?”

I recommend reading the whole thing. Shieber does post his papers online, unlike many authors, he makes sure to attach an addendum to any copyright agreements with publishers to ensure that he is not in breach of contract. Publishers almost never complain, he says.

“In retrospect, this may make sense.  Since the contractual modification applies only to a single article by a single author, it is unlikely that anyone looking for copyright clearance would even know that all copyright hadn’t been assigned to the publisher.  And in any case publishers must realize that authors act as if they have a noncommercial distribution license…”

I will consider using the Science Commons addenda for future copyright agreements with publishers. But just to share my own story: When we submitted the final version of the fuzzy extractors paper to SICOMP (SIAM Journal on Computing), Leo Reyzin suggested we explicitly modify SIAM’s copyright agreement to make it a “publication agreement” that confers only non-exclusive publication rights to SIAM. The revised agreement let us retain all other publications rights, including free online distribution via sites of our choice. For my readers’ entertainment, here is our modified agreement with SIAM, which SIAM accepted without comment.

Finally, David Eppstein points out that free online journals make all the hassle so last century.

P.S.: For a great radio show about what people usually mean by “don’t ask, don’t tell”, listen to the June 16 episode of NPR’s Fresh Air, in which Terry Gross interviews Nathaniel Frank, author of Unfriendly Fire.

FOCS ‘09 crypto accepts

The FOCS 2009 accepted papers are posted, with abstracts. See the chair’s comments here, and other topic-specific discussions here, here and here. Despite some excellent submissions not making it in, there are still some (few!) crypto papers, all of which look interesting. In no particular order: 

  • Steven Myers and abhi shelat. One bit encryption is complete
  • Yi Deng, Vipul Goyal and Amit Sahai. Resolving the Simultaneous Resettability Conjecture and a New Non-Black-Box Simulation Strategy
  • Yuval Ishai, Eyal Kushilevitz, Rafail Ostrovsky and Amit Sahai. Extracting Correlations
  • Yael Tauman Kalai, Xin Li and Anup Rao. 2-Source Extractors Under Computational Assumptions and Cryptography with Defective Randomness

Quantum crypto papers:

Not exactly crypto, but highly relevant:

  • Iftach Haitner. A Parallel Repetition Theorem for Any Interactive Argument
  • Falk Unger. A Probabilistic Inequality with Applications to Threshold Direct Product Theorems

Scanning over the FOCS abstracts is hard because of information overload. I will try to read all of the papers on the list above (maybe I’ll even attend the conference) but for now two stand out because they resolve problems I have thought about:

First, Steve and abhi’s surprising paper (not yet available online), which gives a black-box construction of many-bit CCA-secure encryption from 1-bit CCA-secure encryption. This question is tied to very basic notions of authenticity and secrecy in cryptography. For CPA-secure encryption (that is, encryption secure against passive attacks), increasing the message length is straightforward: Goldwasser and Micali showed that encrypting each bit separately works (a classic example of a “hybrid” argument). However, for schemes that must resist active attacks, such as chosen-ciphertext attacks (CCA), bit-by-bit encryption fails miserably. Prior to this paper, there existed (limited) impossibility results, but no evidence that a black-box construction was possible.

Second, André and Iordanis’ paper on optimal strong quantum coin flipping. Information-theoretically secure quantum coin flipping was proved to be impossible in the late 90’s by Mayers and Lo and Chau, using the same techniques that rule out information-theoretically secure oblivious transfer and bit commitment. That result only rules out protocols that produce a very good biased coin (with bias 1/2+o(1)). However, protocols were constantly being proposed (and occasionally broken) which  produced weakly biased coins. This new paper gives a protocol matching Kitaev’s lower bound of 1/\sqrt{2}\approx 0.707 on the minimal bias. This is not of critical importance in practice, but it does elucidate one of the key phenomena which distinguish quantum from classical cryptographic protocols.

Predicate Privacy, Symmetric Keys and Obfuscation

Why are there so few blogs about research problems in crypto? For one thing, cryptographers are notoriously cagey about their open questions. (One could come up with several unflattering conjectures for why this is the case; perhaps the simplest one is that paranoia is a job qualification for cryptographers…) This post is a meager attempt to counter the trend.

At TCC 2009, Shen, Shi and Waters (SSW) presented a paper on predicate privacy in predicate encryption schemes. In this post, I wanted to point out some open questions implicit in their work, and a connection to program obfuscation.

Roughly, predicate encryption allow the decryptor to delegate a limited part of the decryption work to someone else. First, let’s consider the public-key version: Bob generates a public/secret key pair (pk,sk) and publishes pk, which anybody can use to send him a message. Using the corresponding secret key sk, he can read the plaintext of these messages.

Now suppose that Bob wants to let his mail server partially decrypt his incoming mail, just enough to determine whether to route messages to Bob’s phone or to his desktop. For example, he might want to hand the mail server (Hal) a key that allows Hal to determine whether the word “urgent” appears in the email subject line. A predicate encryption scheme supports a class C of predicates on the plaintext space if, for every predicate p\in C, Bob can generate a subkey k_p that allows to Hal to compute p(m) from a valid encryption of a message m. However, Hal should learn nothing else about the plaintext, that is, the encryptions of message pairs m,m' with the same value of p
should remain indistinguishable.

There has been a flurry of recent work enlarging the classes of predicates  for which predicate encryption is possible. The eventual goal would be a scheme which allows creating keys for any efficiently computable predicate. This post is not about those works.

In a different direction, Shen, Shi and Waters pointed out an important caveat about the definition (and existing constructions) of predicate encryption: Hal learns the predicate p itself. This might not be desirable: for example, I do not want to publicize the rules I use for prioritizing email. They set out to remedy this by constructing of a symmetric-key predicate encryption scheme. If Alice and Bob share a symmetric key k, the SSW scheme allows either of them to generate a key k_p such that Hal, given only k_p, learns nothing about p, and such that encrypted messages with the same predicate value remain indistinguishable to Hal. Their constructions works for a family of inner product predicates.

Why the symmetric-key version?  Well, as SSW point out, the restriction is in some sense necessary. In a public key scheme, Hal learns both k_p and the public key pk. Hence, he can encrypt messages of his choice and evaluate p on the resulting ciphertext, thus giving himself oracle access to p. For simple classes of predicates (e.g. inner products modulo 2), this type of oracle access allows one to learn the predicate p completely.

“Predicate privacy,” they conclude,  “is inherently impossible to achieve in the public-key setting.”

There are at least two good reasons to be skeptical about SSW’s conclusion, namely that symmetric-key schemes are the way to go for predicate privacy:

  1. The impossibility result just mentioned is really about obfuscation: public-key predicate encryption schemes can, at best, provide the level of security that is possible for program obfuscation. Namely, that access to k_p and pk should allow one to learn no more about p than one could from oracle access to p. Barak, Goldreich, Impagliazzo, Rudich, Sahai, Vadhan and Yang showed that even this type of security is impossible for many classes of predicates. On the other hand, some classes of predicates are obfuscatable. Learnable classes of predicates (such as the inner products mod 2 example above) are trivially obfuscatable since oracle access allows you to learn the predicate. More interestingly, point functions, which allow you to test if the input is equal to a particular value, can be obfuscated in the random oracle model, or assuming the existence of a very hard one-way permutation.
  2. Even in the symmetric-key setting, these considerations are important. The problem is that an attacker might know, or be able to influence, the email that Alice sends to Bob. If the attacker can choose Alice’s messages, we say he is mounting a chosen-plaintext attack (CPA). For example, in the second world war, allied cryptographers famously were able to learn the location of an impending Japanese attack by broadcasting fake news of a water shortage in the Midway Islands, in the clear, and watching the resulting Japanese communications. Thus, the standard model of security for symmetric-key crypto includes, at the least, access to an encryption oracle by the adversary. In the context of predicate privacy, a CPA attack gives Hal access to an oracle for the predicate p.

The relationship to program obfuscation is less clear in the symmetric-key setting, since Hal must still use an oracle, which knows the secret key, to access p and hence he never obtains a description of a small circuit for p. In particular, the impossibility results of Barak et al. do not apply.

The observations above lead to some natural open questions (which I may say more about in a future post, if the comments don’t get there first).

Open questions

  • What kind of security does the SSW provide in the face of a CPA attack? (My guess: oracle access to the inner product predicate; note that such predicates are not trivially learnable over large fields, where the predicate tells only if the inner product is 0 or non-zero.)
  • Is predicate privacy with CPA security impossible in general, much the same way that obfuscation is impossible in general?  (My guess: it is possible in general. In a recent conversation, Guy Rothblum guessed that it is impossible.)
  • Are there interesting classes of predicates for which public-key predicate privacy is possible? (My guess: yes).

Crypto and politics

While the interactions between cryptography and politics are usually far from the public eye, the crackdown in Iran has given anonymization and covert communication technologies (ironically!) a chance to shine. For a layman’s introduction, see this Irish Times article and this Washington Post one (via Michael M.). I’d appreciate people using the comments to post links to articles with more technical analysis of what is currently going.

The geek in me is curious to see how the conflict between surveillers and surveillance evaders will play out in Iran; but most of me is just plain angry. This blog is now green…

IPAM Workshop on Privacy and Statistical (Learning) Theory

I am on the organizing committee (with Cynthia Dwork, Steve Fienberg, and Sesa Slavkovic) for an upcoming workshop at UCLA’s Institute for Pure and Applied Mathematics (IPAM). The workshop is on the relationship between database privacy and the theoretical foundations of statistics and machine learning. It is imaginatively titled:

Statistical and Learning-Theoretic Challenges in Data Privacy
(February 22-26, 2010)

(because the catchier “What Can We Learn Privately?” was already taken).

The workshop web page describes the basic thrust pretty concisely:

The goal of workshop is to establish a coherent theoretical foundation for research on data privacy. This implies work on (1) how the conflicting goals of privacy and utility can or should be formulated mathematically; and (2) how the constraints of privacy—in their various incarnations—affect the accuracy of statistical inference and machine learning. In particular, the goal is to shed light on the interplay between privacy and concepts such as consistency and efficiency of estimators, generalization error of learning, robustness and stability of estimation algorithms, and the generation of synthetic data.

The workshop is born of (what I consider) an exciting research program with potential payoffs both for how sensitive data is managed (see, e.g., Abe Flaxman’s post on a recommendation for HIPAA’s overhaul) as well as statistics and statistical learning theory. For more detailed discussion, see:

Participation is open to essentially anyone; to make it easier, IPAM has funding to help some attendees with their travel costs, especially students and other junior researchers. You can apply through the IPAM web page.

Several excellent researchers have already confirmed that they will speak (see the web page for the current list). I am especially happy about the breadth of the areas they hail from: crypto, algorithms, social science statistics, nonparametric statistics, theoretical and applied machine learning, and health data privacy, among others.  Of special note, there will be four tutorials aimed at helping the diverse audience actually communicate:

  • Larry Wasserman and Martin Wainwright will speak about the basic foundations of statistics and statistical learning theory;
  • Two other people (possibly Cynthia Dwork and myself) will discuss the definitional approaches to privacy that have come out of the CS literature, especially differential privacy, and also the worst-case analysis perspective that is common to TCS papers.

The exact format and content of the tutorials is still t.b.d., so suggestions (either directly to me or via comments on this post) would be welcome.

Why the workshop?

Good interdisciplinary work is notoriously hard. The first barrier is linguistic: new terminology, definitions, measures of accuracy/loss, etc (“like a U.N. meeting without the benefit of translators”, as Dick Lipton recently put it, describing some of Blum and Furst’s initial interactions with AI folks). Nevertheless, the terminology barrier can be overcome relatively easily (i.e., on a scale of months or years) in theoretical fields with clean definitions and theorems, such as theoretical statistics and learning.

The more subtle barrier, and one that usually takes much more work to overcome, is one of perspective. Merely using the right language will get you partway, but “the wider point of view of [people in other fields] can be harder to grok” (Noam Nisan). What problem are they really trying to solve? What is their criterion for evaluating the quality or interest of a new idea? To add to the confusion, a field that looks monolithic from the outside may in fact be a highly heterogeneous mix of subdisciplines. For example, the question “what do statisticians actually (need to) do?”, which many of us working on data privacy have wondered aloud, has approximately as many answers as there are statisticians doing things…

As far as I can tell, these perspective barriers are best overcome by osmosis: spend as much time as possible interacting with a wide variety of people from the other field. I think theoretical work provides an especially fruitful venue for interaction because its products (namely definitions and theorems) are more universally interpretable. Of course, this opinion may simply be a result of my own preference for theoretical work…

So how does one get these interactions going? External stimuli, like deadlines for collaborative grant proposals, can help, but grant proposals require people who are already committed to working together. Workshops and conferences are also an important venue. Regarding data privacy, there were several successful workshops encouraging CS-statistics interaction: Bertinoro in 2005, NSF in 2007, CMU in 2007, DIMACS in 2008, NCHS in 2008 (no more web pages for two of those, unfortunately). The upcoming IPAM workshop is the first with an explicitly theoretical bent; I am hoping it will be an even greater success.