Balancing Privacy With Data Sharing for the Public Good
Socially valuable data can be combined with standards that safeguard individual privacy, an economist says.
Governments and technology companies are increasingly collecting vast amounts of personal data, prompting new laws, myriad investigations and calls for stricter regulation to protect individual privacy.
Yet despite these issues, economics tells us that society needs more data sharing rather than less, because the benefits of publicly available data often outweigh the costs. Public access to sensitive health records sped up the development of lifesaving medical treatments like the messenger-RNA coronavirus vaccines produced by Moderna and Pfizer. Better economic data could vastly improve policy responses to the next crisis.
Data increasingly powers innovation, and it needs to be used for the public good, while individual privacy is protected. This is new and unfamiliar terrain for policymaking, and it requires a careful approach.
The pandemic has brought the increasing dominance of big, data-gobbling tech companies into sharp focus. From online retail to home entertainment, digitally savvy businesses are collecting data and deploying it to anticipate product demand and set prices, lowering costs and outwitting more traditional competitors.
Data provides a record of what has already happened, but its main value comes from improving predictions. Companies like Amazon choose products and prices based on what you — and others like you — bought in the past. Your data improves their decision-making, boosting corporate profits. Private companies also depend on public data to power their businesses. Redfin and Zillow disrupted the real estate industry thanks to access to public property databases. Investment banks and consulting firms make economic forecasts and sell insights to clients using unemployment and earnings data collected by the Department of Labor. By 2013, one study estimated, public data contributed at least $3 trillion per year to seven sectors of the economy worldwide.
The buzzy refrain of the digital age is that “data is the new oil,” but this metaphor is inaccurate. Data is indeed the fuel of the information economy, but it is more like solar energy than oil — a renewable resource that can benefit everyone at once, without being diminished.
One of the best examples of the transformative power of open data is the U.S. government-led Human Genome Project, which began in 1990 as an effort to map the entire sequence of human DNA by 2005. Before this, private labs would target and patent specific genes for research or for commercial applications such as developing drugs to treat genetic diseases. Instead of guarding their discoveries, the labs participating in the Human Genome Project posted their data on a public website within 24 hours of sequencing it and made it freely available, an arrangement known as the Bermuda Principles.
This commitment to open data saved lives and ushered in a new era of scientific progress in genetics. A clever study by the economist Heidi Williams, now at Stanford, compared the Human Genome Project to a contemporaneous gene sequencing effort by the company Celera. When Celera mapped a gene first, it protected its intellectual property by requiring other firms to negotiate licensing agreements or pay high fees before using the data. Years later, the genes mapped by Celera led to many fewer innovations and commercial products than those that were immediately put in the public domain. One study estimates that a $3.8 billion public investment in the Human Genome Project generated $796 billion in benefits and, in 2010 alone, 310,000 new jobs.
The data sharing norms established by the Bermuda Principles greatly sped up the development of the mRNA coronavirus vaccines. A Chinese lab announced the discovery of the novel coronavirus on Jan. 9, 2020; sequenced it over the next weekend; and released the genome sequence to the public immediately thereafter. By the end of January, labs around the world were developing vaccines based on the genome sequence, despite not yet having an actual sample. Without a commitment to open data, coronavirus vaccines might still be months away.
To be sure, the use of consumers’ genetic data raises serious privacy concerns. While it is common practice to remove identifiers such as surnames from genetic data before releasing it to the public, researchers have sometimes managed to identify individuals anyway by combining anonymous gene sequences with genealogical databases and other public information such as age and state of residence. These problems can be solved with further protections, but they require constant vigilance.
Privacy can never be guaranteed with absolute certainty. The risks should always be minimized, and balanced against the benefits of the innovations that may arise from increased data availability.
Similar logic applies to economic data. Consider the U.S. policy response to the coronavirus. The Paycheck Protection Program provision of the Coronavirus Aid, Relief and Economic Security (CARES) Act provided hundreds of billions of dollars in forgivable loans to small businesses. Despite the large amount of relief available, demand for loans greatly exceeded supply. Ideally, loans would have been based on expected need, but the Treasury had no information about firms’ financial health.
In the absence of good data, the loans were based on expediency rather than expected need, using local banks as intermediaries, and they made loans disproportionately to firms with which they had strong connections. Economists estimate that the program spent between $150,000 and $377,000 per job saved, a high price for a program that was guaranteed for only a few months.
A better program would target aid to business sectors and geographies that most need help, using real-time data from the businesses themselves. This data already exists, but only behind company walls. It should be anonymized as carefully as possible and assembled for public use, so that local policymakers and entrepreneurs can direct the relief to those who need it most.
One promising model is the Opportunity Insights Economic Tracker, a publicly available repository of anonymized data contributed by private companies. The tracker was started in May by researchers at Harvard and Brown. (I collaborate with Opportunity Insights, although I was not part of the work on the tracker.) Real-time analysis of economic effects — enabled by better data sharing — can improve the targeting of policies to those in greatest need.
Federal regulation of data needs a dual mandate, balancing privacy concerns alongside the social benefits of greater access. Two legislative proposals from the last Congress — by Senators Kirsten Gillibrand and Sherrod Brown — called for the creation of a federal agency devoted to protecting consumer data. This agency would take complaints, conduct investigations and keep a close eye on emerging technologies that threaten individual privacy.
This data protection agency could be combined with Data.gov, a government website created in 2009 that assembles and hosts hundreds of thousands of data sets for public use. Together they could form a kind of federal data library, democratizing knowledge for the digital age.
Just as traditional libraries curate and organize their collections, so could a digital library, adding new data sources and cleaning and assembling them for public use. A federal data library could also take the lead in developing and using new tools such as differential privacy, a technique designed to preserve important features of data while protecting individual identities.
Data’s increasing value as an economic resource requires a new way of thinking. Strict privacy protections are needed to make socially valuable data available for the public good.