Datatrust Governance and Policies: Questions, Concerns and Bright Ideas.
A running list of open issues for governing a datatrust.
- What is the datatrust? What is its purpose?
- Who builds the datatrust technology? Who gets to use it?
- Who holds the data?
- Who runs the datatrust and how?
- The Community.
- The Board.
- The Staff.
- How is the datatrust funded?
- How do we monitor datatrust health?
- Can the datatrust change its mission? Does it have a living will?
I. WHAT IS THE DATATRUST? WHAT IS ITS PURPOSE?
The datatrust is an online service that will allow organizations to open sensitive data to the public and provide researchers, policymakers and application developers with a way to directly query the data, all without compromising individual privacy.
Today, most of the sensitive data about us (e.g. medical records, personal finance data, online search history) is inaccessible to us and to those who represent the public: elected officials, government agencies, advocacy groups, researchers.
Our goal for the datatrust is to create an open marketplace for information to democratize access to some of the most sensitive and valuable data we have.
To realize this goal, we believe the datatrust must also be a data catalog, a registry of queries and their privacy risks, and a collaboration network for Data Donors and Data Users.
II. WHO BUILDS THE DATATRUST? WHO GETS TO USE IT?
Currently, a prototype of the datatrust is being developed by CDP founder, Alex Selkirk's consulting company Shan Gao Ma. We will need to work out:
- Who owns this software, and how will it be licensed for use?
- Will the software be open sourced? Will we license it to other organizations interested in setting up their own datatrusts?
- In general, what is our process for vetting and accepting software donations?
- What operational audits/transparency will be needed to ensure the public that the technology is functioning as expected?
III. WHO HOLDS THE DATA?
CDP may store the data, but the datatrust could also exist alternatively as a “query plus privacy filter” layer that pulls data from diverse sources holds the data in RAM just long enough for questions to be asked and answered before the data is deleted.
Is this possible given the amount of data curation that needs to happen in order to make sense of data from diverse sources? What are the advantages and disadvantages of either model?
We will NEVER sell the data that is entrusted to us. We may charge maintenance fees for access to the datatrust (see How is the Datatrust Funded?), but we will never seek to make a profit by selling proprietary data. That is because the data will not belong to us. Rather, Data Donors will entrust us to hold the data, the way account holders entrust banks to hold their financial assets. We will hold the data only so Data Users from the public can access it safely. Any other use would be a violation of our mission.
We will not promise that our data is "anonymized," a word that has no technical or agreed-upon meaning. Rather, we will make a mathematical and therefore verifiable and enforceable promise.
The mathematical promise will require us to set a "privacy budget" for each data set, which limits the degree of confidence with which anyone can draw conclusions about you based on information given out by the datatrust. The privacy budget will limit the number of questions that can be asked of a data set, as well as the amount of noise that is added to obfuscate answers, which in turns obfuscates individual identities in the data set.
Providing a quantifiable privacy guarantee also allows individuals to make meaningful demands about exactly how much privacy risk they're willing to take on in handing over their data.
You can read more about our quantifiable privacy guarantee on our blog.
- With a limited budget of questions allowed for each data set, how will we apportion budget to Data Users? Who will manage and have input into the application process? By what standards will we evaluate applications? Will we favor certain uses over others? (e.g. research over business applications) Will we set quotas for how much privacy budget any one Data User can consume?
- We know we will need to crowdsource the curation of data donations. How will the community evaluate and curate donated data without using up too much privacy budget?
- We know Data Users will need to "inspect" data donations in order to decide whether and how to make use of it. How will we provide enough information about the data to enable inspections without using up too much privacy budget?
- Are there ways to "maximize" budget? Efficient ways to query data, keep track of which questions have already been asked and release answers the same way, require Data Users to share answers?
We do not want to be a magnet for data mining by law enforcement, as we believe that will undermine our goals to encourage more open data. Law enforcement will have the same access as the general public to the aggregated data.
- What recourse do we have if warrants or subpoenas are obtained to bypass the datatrust's privacy technology and gain unlimited access to the raw data? How can we limit the scope of what they have access to?
- Is there legislation we can support that would protect Data Donors and the data they've deposited with us, the way the Right to Financial Privacy Act protects financial records from being accessed by law enforcement without a warrant? How might proposed changes to the Electronic Communications Privacy Act affect the datatrust?