BACK TO PAPERS

Defining a Datatrust

A datatrust will be an online service that allows organizations to make sensitive data available to the public and provide researchers, policymakers and application developers with a way to directly query that data. We believe the datatrust is only possible with technical innovations that will allow us to provide a new breed of privacy guarantee that is quantifiable and enforceable and community-centered governance and policy innovations that will inspire public confidence.

The following is a largely technical discussion of datatrust functionality. See our work on Governance and Policies for more on how the datatrust will be run and the rights and responsibilities of Data Donors and Data Users.

  1. How will it work?
  2. How will you protect privacy?
  3. What do you mean by a "quantifiable and enforceable" privacy guarantee?

  4. Additional resources.

I. HOW WILL A DATATRUST WORK?

A datatrust offers the following functionality:

  1. A database of sensitive data records stored in raw form, not pre-digested aggregate reports.
  2. A way to search and browse (ie. data catalog and data profile)
  3. A way to query data

Community-driven activities will include:

  1. A way to donate data
  2. A way to apply to use data
  3. A way to collaboratively curate data
  4. A catalog of data donors, data uses ad data users

An important distinction between the datatrust and most other open data portals you may be familiar with (e.g. data.gov), the datatrust offers direct query access to the data, but not the data itself. In that sense the datatrust is a data-driven online service, not a data provider.

II. HOW WILL YOU PROTECT PRIVACY?

Unlike most open government data releases, the datatrust will not rely on labor-intensive and subjective anonymization methods. Existing methods like scrubbing, swapping or synthesizing data limit the accuracy and usefulness of the data. Instead, privacy protection will happen on-the-fly as answers to queries are obscured using differential privacy, a new area of research that applies randomly-generated noise to answers protect privacy. In this way, the datatrust will be able to standardize and automate the “anonymization” process of releasing data.

This also explains why the datatrust must remain a query interface to data and cannot simply "give away" its data.

III. WHAT DO YOU MEAN BY A "QUANTIFIABLE AND ENFORCEABLE" PRIVACY GUARANTEE?

We believe that in order for the datatrust to be viable, we need to shift away from an "either-or" mental model of privacy to a more graduated approach where privacy cost can be measured and therefore spent in increments and privacy violations can be enforced when offenders have exceeded agreed upon limits (ie. privacy budget). A handy side effect of differential privacy is that because the "noise" that is applied to protect privacy is mathematically generated, it can also be used as a way to quantify the privacy risk of exposing data to queries.

This means that the datatrust must:

  1. Track all queries and answers
  2. Track the privacy cost of each query
  3. Remove datasets when their privacy budget has been used up