## Archive for the ‘**Data privacy**’ Category

## Tutorial Videos on and around Differential Privacy

http://dimacs.rutgers.edu/Workshops/DifferentialPrivacy/Slides/slides.html

The tutorial speakers covered connections between DP and a range of areas:

- Moritz Hardt: Differential private algorithms via learning theory
- Gerome Miklau: Query optimization techniques from the DB community
- Benjamin Pierce: Using PL techniques to automate and verify proofs of privacy
- Aaron Roth: Game-theoretic perspectives on privacy

- Slides (with corrections): http://www.cse.psu.edu/~asmith/talks/2012-08-21-crypto-tutorial.pdf
- Video (no corrections!): http://www.youtube.com/watch?v=dbWx62C5Q4o&list=PL3C6A9D61E40300E6&index=26

## DIMACS Workshop on Differential Privacy

Aaron Roth and I are running a 3 day interdisciplinary workshop on differential privacy at DIMACS (Rutgers), on October 24-26. This is immediately following FOCS, which is being held nearby, in downtown New Brunswick. The workshop will begin with a day of tutorials on differential privacy as understood in various communities (theory, databases, programming languages, and game theory), and will continue with two days of research talks and discussion.

Details of the workshop can be found here: http://dimacs.rutgers.edu/Workshops/DifferentialPrivacy/

(n.b.: some extra speakers have confirmed who are not yet on the web page).

As part of the program, we will also have a session of short (5-10 minute) talks from students, postdocs, and other interested parties. We encourage submission of abstracts for short talks. The solicitation is below.

See you all in October!

Aaron and Adam

**DIMACS Workshop on Differential Privacy across Computer Science**

October 24-26, 2012

(immediately after FOCS 2012)

Call for Abstracts — Short Presentations

The upcoming DIMACS workshop on differential privacy will feature invited talks by experts from a range of areas in computer science as well as short talks (5 to 10 minutes) by participants.

Participants interested in giving a short presentation should send an email to asmith+dimacs@psu.edu containing a proposed talk title, abstract, and the speaker’s name and affiliation. We will try to

accommodate as many speakers as possible, but

a) requests received before October 1 will get full consideration

b) priority will be given to junior researchers, so students and postdocs should indicate their status in the email.

More information about the workshop:

The last few years have seen an explosion of results concerning differential privacy across many distinct but overlapping communities in computer science: Theoretical Computer Science, Databases, Programming Languages, Machine Learning, Data Mining, Security, and Cryptography. Each of these different areas has different priorities and techniques, and despite very similar interests, motivations, and choice of problems, it has become difficult to keep track of this large literature across so many different venues. The purpose of this workshop is to bring researchers in differential privacy across all of these communities together under one roof to discuss recent results and synchronize our understanding of the field. The first day of the workshop will include tutorials, representing a broad cross-section of research across fields. The remaining days will be devoted to talks on the exciting recent results in differential privacy across communities, discussion and formation of interesting open problems, and directions for potential inter-community collaborations.

A tentative program and registration information can be found at

http://dimacs.rutgers.edu/Workshops/DifferentialPrivacy/

## Privacy in the NYT

An article in yesterday’s New York Times (front “page” on the web last night) does a good job of highlighting some of the intricacies of “privacy” in online social networks. The article links to a surprising number of technical research articles. There were also two quotes that stuck out.

‘“Technology has rendered the conventional definition of personally identifiable information obsolete,” said Maneesha Mithal, associate director of the Federal Trade Commission’s privacy division.’

This is not news to most computer scientists, but it is nice to hear it from the FTC. [On a related point, the FTC is holding the third of a series of roundtable discussions on electronic privacy today. Webcast here.]

The ending quote of the article, from Jon Kleinberg, was more of a downer:

“When you’re doing stuff online, you should behave as if you’re doing it in public — because increasingly, [you are].”

I disagree with the most literal interpretation of the quote, since there are still many ways to do things privately online. But keeping your privacy increasingly requires both technical sophistication and great care. And of course that endangers some of the coolest things about the Internet.

## IPAM Workshop Wrap-Up

Last week was the Statistical and Learning-Theoretic Challenges in Data Privacy, which I co-organized with Cynthia Dwork, Steve Fienberg and Sesa Slavkovic. As I explained in my initial post on the workshop, the goal was to tie together work on privacy in statistical databases with the theoretical foundations of learning and statistics.

- Slides for most talks are online
- Blog posts: Arvind N., Jon K. #1, #2 (see also an older post by Ben R.)

The workshop was a success. For one thing, I got a new result out of it and lots of ideas for problems to work on. I even had fun most of the time^{1}.

### — A shift in tone —

More importantly, I felt a different tone in the conversations and talks at this workshop than at a previous ones involving a similar crowd. For the first time, most participants seemed to agree on what the important issues are. I’ve spent lots of time hanging out with statisticians recently, so this feeling may not have been shared by everyone. But one change was objectively clear: the statisticians in the crowd have become much better at describing their problems in computational terms. I distinctly remember encountering fierce resistance, at the original 2005 CS-Stats privacy workshop in Bertinoro, when we reductionist CS types tried to get statisticians to spell out the procedures they use to analyze social science data.

“Analysis requires judgement. It is as much art as science,” they said (which we translated as, “Recursion, shmecursion. We do not know our own programs!”).

“But can’t you try to pin down some common objectives?”, we answered.

This week, there were algorithms and well-defined objectives galore. It helped that we had some polyglots, like Martin Wainwright and Larry Wasserman, around to translate.

### — The “computational lens” at work —

An interesting feature of several talks was the explicit role of “computational” perspective. Both Frank McSherry and Yuval Nardi used techniques from numerical analysis, namely gradient ascent and the Newton-Raphson method, to design protocols which were both more efficient and easier to analyze than previous attempts based on a more global, structural perspective. Frank described a differentially private algorithm for logistic regression, joint with Ollie Williams; Yuval described an efficient SFE protocol for linear regression, joint with Steve Fienberg, Rob Hall, and others.

### — Two under-investigated ideas —

At the wrap-up session (see the notes), I pointed out two directions that I think have been investigated with much less rigor than they deserve:

#### “Cryptanalysis” for database privacy

It would be nice to have a systematic study of, and standard nomenclature for, attacks on privacy/anonymity in statistical databases. Right now it seems every paper ends up defining (or not defining) a model from scratch, yet many papers are doing essentially the same thing in different domains. Even an incomplete taxonomy would be helpful. Here are a few terms I’d like to see becoming standard:

- linkage attack
- reconstruction attack
- composition attack (my personal favorite)

On a related point, it would be nice to see a good categorization of the kinds of side information that gets used. For example, Johannes Gehrke at Cornell and his students have a few papers laying out categories of side information (I have issues with some of the positive results in those papers, but I think the quantification of side information is interesting).

#### Relaxed definitions of privacy with meaningful semantics

This is probably a topic for a much longer post, but briefly: it would be nice to see meaningful definitions of privacy in statistical databases that exploit the adversary’s uncertainty about the data. The normal approach to this is to specify a set of allowable prior distributions on the data (from the adversary’s point of view). However, one has to be careful. The versions I have seen are quite brittle. Some properties to keep in mind when considering new definitions:

- Composition
- Side information: is the class of priors rich enough to incorporate complex side information, such as an anonymization of a related database? [see composition above]
- Convexity and post-processing, as in Dan Kifer’s talk
- Equivalent, “semantic” characterizations [e.g. here, here]

### — Other notes —

- The majority of the talks were completely or partly on differential privacy. Notable exceptions: Brad Malin, Xiaofeng Wang, Ravi Kumar, Jiashun Jin, Yuval Nardi. Our goal was not to have such a preponderance of differential privacy talks, but some of the people we expected to talk about other things (like Jerry Reiter) decided to focus on differential privacy. Tailoring the talk to the crowd?
- The nonspeaker participants were heavily skewed towards CS. In particular, at least [see comments!] four professors (Gerome Miklau, Anupam Gupta, Jonathan Katz, Yevgeniy Dodis) and three postdocs (Katrina Liggett, Anand Sarwate, Arvind Narayanan) from CS departments attended just to listen to the talks; I recognized only one stats postdoc (Saki Kinney). I also recognized lots of UCLA locals there too from CS (Yuval Ishai, Rafi Ostrovsky, Amit Sahai) but none from statistics.
- The rump session + posters combination worked very well (despite my earlier doubts). Rump session slides are online.

^{1} Serious sleep deprivation due to jet-lagged kids and talk prep made the “fun” part occasionally difficult.

## A private event

It wasn’t exactly in stealth mode, but I heard about Data Privacy Day 2010 only after it happened.

Born of an effort to promote awareness of data privacy issues by the non-profit The Privacy Projects, this year’s celebration (?) included events at a several universities. Most interesting to me was a roundtable discussion at UC Berkeley sponsored by the Federal Trade Commission. I’m skeptical about how much the federal government will do about protecting privacy, but it is good to see serious interest.

This year’s events concentrated on consumer privacy and its apparent conflict with emerging business models. My recent research has been on handling privacy concerns in “statistical databases” — large collections of sensitive information that we would like to open up to wider scrutiny and analysis. Unsurprisingly, I would like to see “Data Privacy Day” also cover this aspect of data privacy. There is a danger, though, that the topic becomes too diffuse. What are really the most pressing privacy issues, and what should a broad “data privacy” awareness event cover?

**UPDATE (2/3/10)**: Arvind was part of the roundtable and has some notes on it at 33 bits. He includes there some interesting comments on academics’ participation in policy discussions. I’ll add only that at Penn State, quite a few faculty members are involved in policy, but mostly away from the public eye. For example, two weeks ago I met with a White House official about privacy issues in the release of government data sets; I’ve also been involved in (executive branch) panels on government handling of biometric data. However, it is true that public participation in policy discussions by academics is limited. That may be because many academics realize they would make bad politicians; as Arvind notes, misaligned incentives also play a role.

## Differential privacy and the secrecy of the sample

(This post was laid out lazily, using Luca‘s lovely latex2wp.)

** — 1. Differential Privacy — **

Differential privacy is a definition of “privacy” for statistical databases. Roughly, a statistical database is one which is used to provide aggregate, large-scale information about a population, without leaking information specific to individuals. Think, for example, of the data from government surveys (*e.g.* the decennial census or epidemiological studies), or data about a company’s customers that it would like a consultant to analyze.

The idea behind the definition is that *users*–that is, people getting access to aggregate information–should not be able to tell if a given individual’s data has been changed.

More formally, a data set is just a subset of items in a domain . For a given data set , we think of the server holding the data as applying a randomized algorithm , producing a random variable (distributed over vectors, strings, charts, or whatever). We say two data sets are *neighbors* if they differ in one element, that is, .

Definition 1A randomized algorithm is -differentially private if, for all pairs of neighbor data sets , and for all events in the output space of :

This definition has the flavor of *indistinguishability* in cryptography: it states that the random variables and must have similar distributions. The difference with the normal cryptographic setting is that the distance measure is multiplicative rather than additive. This is important for the semantics of differential privacy—see this paper for a discussion.

I hope to write a sequence of posts on differential privacy, mostly discussing aspects that don’t appear in published papers or that I feel escaped attention.

** — 2. Sampling to Amplify Privacy — **

To kick it off, I’ll prove here an “amplification” lemma for differential privacy. It was used implicitly in the design of an efficient, private PAC learner for the PARITY class in a FOCS 2008 paper by Shiva Kasiviswanathan, Homin Lee, Kobbi Nissim, Sofya Raskhodnikova and myself. But I think it is of much more general usefulness.

Roughly it states that given a -differentially private algorithm, one can get an -differentially private algorithm at the cost of shrinking the size of the data set by a factor of .

Suppose is a -differentially private algorithm that expects data sets from a domain as input. Consider a new algorithm , which runs on a random subsample of points from its input:

Algorithm 2 (Algorithm )On input and a multi-set

- Construct a set by selecting each element of independently with probability .
- Return .

Lemma 3 (Amplification via sampling)If is -differentially private, then for any , is -differentially private.

## Insensitive attributes

And the award for best blog post title of the day goes to…

There is an Elephant in the Room; & Everyone’s Social Security Numbers are Written on Its Hide.

The post, by Richard Power, reports on an article by Alessandro Acquisti and Ralph Gross,* “*Predicting Social Security numbers from public data”, (faq, PNAS paper) which highlights how one can narrow down a US citizen’s social security number to a relatively small range based only on his or her state and date of birth.

As the Social Security Administration explained (see the elephantine blog post linked above), this was not really a secret; the SSA’s algorithm for generating SSN’s is public. The virtue of the Acquisti-Gross article is in pointing out the security implications of this clearly.

One of the interesting notions the study puts to rest is the distinction between “insensitive” and “sensitive” attributes. Almost anything can be used to identify a person, and once someone has a handle on you it is remarkably easy to predict, or find out, even more.

## IPAM Workshop on Privacy and Statistical (Learning) Theory

I am on the organizing committee (with Cynthia Dwork, Steve Fienberg, and Sesa Slavkovic) for an upcoming workshop at UCLA’s Institute for Pure and Applied Mathematics (IPAM). The workshop is on the relationship between database privacy and the theoretical foundations of statistics and machine learning. It is imaginatively titled:

Statistical and Learning-Theoretic Challenges in Data Privacy

(February 22-26, 2010)

(because the catchier “What Can We Learn Privately?” was already taken).

The workshop web page describes the basic thrust pretty concisely:

The goal of workshop is to establish a coherent theoretical foundation for research on data privacy. This implies work on (1) how the conflicting goals of privacy and utility can or should be formulated mathematically; and (2) how the constraints of privacy—in their various incarnations—affect the accuracy of statistical inference and machine learning. In particular, the goal is to shed light on the interplay between privacy and concepts such as consistency and efficiency of estimators, generalization error of learning, robustness and stability of estimation algorithms, and the generation of synthetic data.

The workshop is born of (what I consider) an exciting research program with potential payoffs both for how sensitive data is managed (see, *e.g.*, Abe Flaxman’s post on a recommendation for HIPAA’s overhaul) as well as statistics and statistical learning theory. For more detailed discussion, see:

- The survey paper by Cynthia Dwork and me, based on these two papers.
- The workshop proposal.

Participation is open to essentially anyone; to make it easier, **IPAM has funding to help some attendees with their travel costs**, especially students and other junior researchers. You can apply through the IPAM web page.

Several excellent researchers have already confirmed that they will speak (see the web page for the current list). I am especially happy about the breadth of the areas they hail from: crypto, algorithms, social science statistics, nonparametric statistics, theoretical and applied machine learning, and health data privacy, among others. Of special note, there will be four tutorials aimed at helping the diverse audience actually communicate:

- Larry Wasserman and Martin Wainwright will speak about the basic foundations of statistics and statistical learning theory;
- Two other people (possibly Cynthia Dwork and myself) will discuss the definitional approaches to privacy that have come out of the CS literature, especially differential privacy, and also the worst-case analysis perspective that is common to TCS papers.

The exact format and content of the tutorials is still t.b.d., so suggestions (either directly to me or via comments on this post) would be welcome.

### Why the workshop?

Good interdisciplinary work is notoriously hard. The first barrier is linguistic: new terminology, definitions, measures of accuracy/loss, etc (“like a U.N. meeting without the benefit of translators”, as Dick Lipton recently put it, describing some of Blum and Furst’s initial interactions with AI folks). Nevertheless, the *terminology* barrier can be overcome relatively easily (*i.e.*, on a scale of months or years) in theoretical fields with clean definitions and theorems, such as theoretical statistics and learning.

The more subtle barrier, and one that usually takes much more work to overcome, is one of *perspective*. Merely using the right language will get you partway, but “the wider point of view of [people in other fields] can be harder to grok” (Noam Nisan). What problem are they *really* trying to solve? What is their criterion for evaluating the quality or interest of a new idea? To add to the confusion, a field that looks monolithic from the outside may in fact be a highly heterogeneous mix of subdisciplines. For example, the question “what do statisticians actually (need to) do?”, which many of us working on data privacy have wondered aloud, has approximately as many answers as there are statisticians doing things…

As far as I can tell, these perspective barriers are best overcome by osmosis: spend as much time as possible interacting with a wide variety of people from the other field. I think theoretical work provides an especially fruitful venue for interaction because its products (namely definitions and theorems) are more universally interpretable. Of course, this opinion may simply be a result of my own preference for theoretical work…

So how does one get these interactions going? External stimuli, like deadlines for collaborative grant proposals, can help, but grant proposals require people who are already committed to working together. Workshops and conferences are also an important venue. Regarding data privacy, there were several successful workshops encouraging CS-statistics interaction: Bertinoro in 2005, NSF in 2007, CMU in 2007, DIMACS in 2008, NCHS in 2008 (no more web pages for two of those, unfortunately). The upcoming IPAM workshop is the first with an explicitly theoretical bent; I am hoping it will be an even greater success.