Posts Tagged ‘learning’
Tutorial Videos on and around Differential Privacy
http://dimacs.rutgers.edu/Workshops/DifferentialPrivacy/Slides/slides.html
The tutorial speakers covered connections between DP and a range of areas:
- Moritz Hardt: Differential private algorithms via learning theory
- Gerome Miklau: Query optimization techniques from the DB community
- Benjamin Pierce: Using PL techniques to automate and verify proofs of privacy
- Aaron Roth: Game-theoretic perspectives on privacy
- Slides (with corrections): http://www.cse.psu.edu/~asmith/talks/2012-08-21-crypto-tutorial.pdf
- Video (no corrections!): http://www.youtube.com/watch?v=dbWx62C5Q4o&list=PL3C6A9D61E40300E6&index=26
Differential privacy and the secrecy of the sample
(This post was laid out lazily, using Luca‘s lovely latex2wp.)
— 1. Differential Privacy —
Differential privacy is a definition of “privacy” for statistical databases. Roughly, a statistical database is one which is used to provide aggregate, large-scale information about a population, without leaking information specific to individuals. Think, for example, of the data from government surveys (e.g. the decennial census or epidemiological studies), or data about a company’s customers that it would like a consultant to analyze.
The idea behind the definition is that users–that is, people getting access to aggregate information–should not be able to tell if a given individual’s data has been changed.
More formally, a data set is just a subset of items in a domain . For a given data set
, we think of the server holding the data as applying a randomized algorithm
, producing a random variable
(distributed over vectors, strings, charts, or whatever). We say two data sets
are neighbors if they differ in one element, that is,
.
Definition 1 A randomized algorithm
is
-differentially private if, for all pairs of neighbor data sets
, and for all events
in the output space of
:
This definition has the flavor of indistinguishability in cryptography: it states that the random variables and
must have similar distributions. The difference with the normal cryptographic setting is that the distance measure is multiplicative rather than additive. This is important for the semantics of differential privacy—see this paper for a discussion.
I hope to write a sequence of posts on differential privacy, mostly discussing aspects that don’t appear in published papers or that I feel escaped attention.
— 2. Sampling to Amplify Privacy —
To kick it off, I’ll prove here an “amplification” lemma for differential privacy. It was used implicitly in the design of an efficient, private PAC learner for the PARITY class in a FOCS 2008 paper by Shiva Kasiviswanathan, Homin Lee, Kobbi Nissim, Sofya Raskhodnikova and myself. But I think it is of much more general usefulness.
Roughly it states that given a -differentially private algorithm, one can get an
-differentially private algorithm at the cost of shrinking the size of the data set by a factor of
.
Suppose is a
-differentially private algorithm that expects data sets from a domain
as input. Consider a new algorithm
, which runs
on a random subsample of
points from its input:
Algorithm 2 (Algorithm
) On input
and a multi-set
- Construct a set
by selecting each element of
independently with probability
.
- Return
.
Lemma 3 (Amplification via sampling) If
is
-differentially private, then for any
,
is
-differentially private.
Insensitive attributes
And the award for best blog post title of the day goes to…
There is an Elephant in the Room; & Everyone’s Social Security Numbers are Written on Its Hide.
The post, by Richard Power, reports on an article by Alessandro Acquisti and Ralph Gross, “Predicting Social Security numbers from public data”, (faq, PNAS paper) which highlights how one can narrow down a US citizen’s social security number to a relatively small range based only on his or her state and date of birth.
As the Social Security Administration explained (see the elephantine blog post linked above), this was not really a secret; the SSA’s algorithm for generating SSN’s is public. The virtue of the Acquisti-Gross article is in pointing out the security implications of this clearly.
One of the interesting notions the study puts to rest is the distinction between “insensitive” and “sensitive” attributes. Almost anything can be used to identify a person, and once someone has a handle on you it is remarkably easy to predict, or find out, even more.