# Oddly Shaped Pegs

An inquiry into the Nature and Causes of Stuff

## Tutorial Videos on and around Differential Privacy

Aaron Roth and I organized a workshop on “Differential Privacy Across Computer Science” at DIMACS in the fall. Videos from the tutorials are now up (presumably they have been for a while, but I did not know it).
http://dimacs.rutgers.edu/Workshops/DifferentialPrivacy/Slides/slides.html
The tutorial speakers covered connections between DP and a range of areas:

• Moritz Hardt: Differential private algorithms via learning theory
• Gerome Miklau: Query optimization techniques from the DB community
• Benjamin Pierce: Using PL techniques to automate and verify proofs of privacy
• Aaron Roth: Game-theoretic perspectives on privacy
All four talks were excellent, and they are a great resource for people (interested in) getting into the field.
Those talks all assume at least passing familiarity with differential privacy. For a gentler introduction, my tutorial from CRYPTO 2012 is online. The first third or so of the talk is not on differential privacy at all, but rather surveys the attacks and privacy breaches that motivated approaches such as differential privacy.
Watching the video, I realize that my talk was very slow-paced, so you may prefer to just read the slides (or maybe watch the video at 2x ?):
Comments on any of the tutorials are welcome.

February 8, 2013 at 2:21 pm

## Differential privacy and the secrecy of the sample

(This post was laid out lazily, using Luca‘s lovely latex2wp.)

— 1. Differential Privacy —

Differential privacy is a definition of “privacy” for statistical databases. Roughly, a statistical database is one which is used to provide aggregate, large-scale information about a population, without leaking information specific to individuals. Think, for example, of the data from government surveys (e.g. the decennial census or epidemiological studies), or data about a company’s customers that it would like a consultant to analyze.

The idea behind the definition is that users–that is, people getting access to aggregate information–should not be able to tell if a given individual’s data has been changed.

More formally, a data set is just a subset of items in a domain ${D}$. For a given data set ${x\subset D}$, we think of the server holding the data as applying a randomized algorithm ${A}$, producing a random variable ${A(x)}$ (distributed over vectors, strings, charts, or whatever). We say two data sets ${x,x'}$ are neighbors if they differ in one element, that is, ${x\ \triangle\ x' =1}$.

Definition 1 A randomized algorithm ${A}$ is ${\epsilon}$-differentially private if, for all pairs of neighbor data sets ${x,x'}$, and for all events ${S}$ in the output space of ${A}$:

$\displaystyle \Pr(A(x)\in S) \leq e^\epsilon \Pr(A(x')\in S\,.$

This definition has the flavor of indistinguishability in cryptography: it states that the random variables ${A(x)}$ and ${A(x')}$ must have similar distributions. The difference with the normal cryptographic setting is that the distance measure is multiplicative rather than additive. This is important for the semantics of differential privacy—see this paper for a discussion.

I hope to write a sequence of posts on differential privacy, mostly discussing aspects that don’t appear in published papers or that I feel escaped attention.

— 2. Sampling to Amplify Privacy —

To kick it off, I’ll prove here an “amplification” lemma for differential privacy. It was used implicitly in the design of an efficient, private PAC learner for the PARITY class in a FOCS 2008 paper by Shiva Kasiviswanathan, Homin Lee, Kobbi Nissim, Sofya Raskhodnikova and myself. But I think it is of much more general usefulness.

Roughly it states that given a ${O(1)}$-differentially private algorithm, one can get an ${\epsilon}$-differentially private algorithm at the cost of shrinking the size of the data set by a factor of ${\epsilon}$.

Suppose ${A}$ is a ${1}$-differentially private algorithm that expects data sets from a domain ${D}$ as input. Consider a new algorithm ${A'}$, which runs ${A}$ on a random subsample of ${ \approx\epsilon n}$ points from its input:

Algorithm 2 (Algorithm ${A'}$) On input ${\epsilon \in (0,1 )}$ and a multi-set ${x\subseteq D}$

1. Construct a set ${T\subseteq x}$ by selecting each element of ${x}$ independently with probability ${\epsilon}$.
2. Return ${A(T)}$.

Lemma 3 (Amplification via sampling) If ${A}$ is ${1}$-differentially private, then for any ${\epsilon\in(0,1)}$, ${A'(\epsilon,\cdot)}$ is ${2\epsilon}$-differentially private.

September 2, 2009 at 12:19 pm

## Insensitive attributes

And the award for best blog post title of the day goes to…

The post, by Richard Power, reports on an article by Alessandro Acquisti and Ralph Gross,Predicting Social Security numbers from public data”, (faq, PNAS paper) which highlights how one can narrow down a US citizen’s social security number to a relatively small range based only on his or her state and date of birth.

As the Social Security Administration explained (see the elephantine blog post linked above), this was not really a secret; the SSA’s algorithm for generating SSN’s is public. The virtue of the Acquisti-Gross article is in pointing out the security implications of this clearly.

One of the interesting notions the study puts to rest is the distinction between “insensitive” and “sensitive” attributes. Almost anything can be used to identify a person, and once someone has a handle on you it is remarkably easy to predict, or find out, even more.