Democratizing Our Data: A Manifesto (excerpt)

September 16, 2020

Democratizing Our Data: A Manifesto (excerpt)

Excepted from Democratizing Our Data: A Manifesto, by Julia Lane (footnotes omitted). Copyright © MIT Press. Excerpted by permission of MIT Press. All rights reserved. No part of this excerpt may be reproduced or reprinted without permission in writing from the publisher.

1. THE PROBLEM, WHY IT MATTERS, AND WHAT TO DO

Public data are foundational to our democratic system. We know about income inequality and job trends thanks to data from the Bureau of Labor Statistics. We know what’s happening to economic growth thanks to data from the US Census Bureau. We know about the impact of business tax changes thanks to data from the Statistics of Income Division. Data like these are profoundly important for most of us, and especially for individuals and small businesses who can’t pay for expensive experts to produce customized reports.

One of government’s jobs is to level the data playing field. Statistical agencies have historically been the source of accurate and objective information for democracies, due to the limitations of private sector–produced data. For example, emergency supplies probably shouldn’t be allocated to an area based on the frequency of tweets from that location. Why? Because that would mean more supplies going to the people who tweet, underserving babies and elderly residents who are less likely to have Twitter accounts. Emergency supplies should be allocated based on information about the people likely to need such supplies, and government data are the way in which we ensure that the right people are counted. If people aren’t counted, they don’t count, and that threatens our democracy.

But recently, the playing field has been tilting against public data. Our current statistical system is under stress, too often based on old technology and with too little room for innovation. Our public statistical institutions often are not structurally capable of taking advantage of massive changes in the availability of data and the public need for new and better data to make decisions. And without breakthroughs in public measurement, we’re not going to get intelligent public decision-making. Over twenty years ago, one of the great statistical administrators of the twentieth century, Janet Norwood, pointed to the failing organizational structure of federal statistics and warned that, “In a democratic society, public policy choices can be made intelligently only when the people making the decisions can rely on accurate and objective statistical information to inform them of the choices they face and the results of choices they make.”

We must rethink ways to democratize data. There are successful models to follow and new legislation that can help effect change. The private sector’s Data Revolution—where new types of data are collected and new measurements created by the private sector to build machine learning and artificial intelligence algorithms—can be mirrored by a public sector Data Revolution, one that is characterized by attention to counting all who should be counted, measuring what should be measured, and protecting privacy and confidentiality. Just as US private sector companies—Google, Amazon, Microsoft, Apple, and Facebook—have led the world in the use of data for profit, the US can show the world how to produce data for the public good.

There are massive challenges to be addressed. The national statistical system—our national system of measurement—has ossified. Public agencies struggle to change the approach to collecting the statistics that they have produced for decades—in some cases, as we shall see, since the Great Depression. Hamstrung by excessive legislative control, inertia, lack of incentives, ill-advised budget cuts, and the “tyranny of the established,” they have largely lost the ability to innovate or respond to quickly changing user needs. Despite massive increases in the availability of new types of data, such as administrative records (data produced through the administration of government programs, such as tax records) or by digital activities (such as social media or cell phone calls), the US statistical agencies struggle to operationalize their use. Worse still, the government agencies that produce public data are at the bottom of the funding chain—staffing is being cut, funding is stagnant if not being outright slashed, and entire agencies are being decimated.

If we don’t move quickly, the cuts that have already affected physical, research, and education infrastructures will also eventually destroy our public data infrastructure and threaten our democracy. Trust in government institutions will be eroded if government actions are based on political preference rather than grounded in statistics. The fairness of legislation will be questioned if there is not impartial data whereby the public can examine the impact of legislative changes in, for example, the provision of health care and the imposition of taxes. National problems, like the opioid crisis, will not be addressed, because governments won’t know where or how to allocate resources. Lack of access to public data will increase the power of big businesses, which can pay for data to make better decisions, and reduce the power of small businesses, which can’t. The list is endless because the needs are endless.

This book provides a solution to the impending critical failure in public data. Our current approach and the current budget realities mean that we cannot produce all the statistics needed to meet today’s expectations for informing increasingly complex public decisions. We must design a new statistical system that will produce public data that are useful at all levels of government—and make scientific, careful, and responsible use of many newly available data, such as administrative records from agencies that administer government programs, data generated from the digital lives of citizens, and even data generated within the private sector.

This book will paint a picture of what this new system could look like, focusing on the innovations necessary to disrupt the existing federal statistical system, with the goal of providing useful and timely data from trusted sources so that we, the people, have the information necessary to make better decisions.

WHY IT MATTERS

Measurement is at the core of democracy, as Simon Winchester points out: “All life depends to some extent on measurement, and in the very earliest days of social organization a clear indication of advancement and sophistication was the degree to which systems of measurement had been established, codified, agreed to and employed.” Yet public data and measurement have to be paid for out of the public purse, so there is great scrutiny of costs and quality. The challenge public agencies face is that, as Erik Brynjolffson, the director of MIT’s Initiative on the Digital Economy, points out, we have become used to getting digital goods that are free . . . and instant and useful. Yet in a world where private data are getting cheaper, the current system of producing public data costs a lot of money—and costs are going up, not down. One standard is how much it costs the Census Bureau to count the US population. In 2018 dollars, the 1960 Census cost about $1 billion, or about $5.50 per person. The 1990 Census cost about $20 per head. The 2020 Census is projected to cost about $16 billion, or about $48 per head. And the process is far from instant: Census Day is April 1, 2020, but the results won’t be delivered until December.

Another standard is the quality of data that are collected. Take a look, for example, at the National Center for Health Statistics report to the Council of Professional Associations on Federal Statistics. Response rates on the National Health Interview Survey have dropped by over 20 percentage points, increasing the risk of nonresponse bias, and the rate at which respondents “break off” or fail to complete the survey has almost tripled over a twenty-year period.

As a result, communities are not getting all the information they need from government for decision-making. If we made a checklist of features of data systems that have made private sector businesses like Amazon and Google successful, it might include producing data that are: (1) real-time so customers can make quick decisions; (2) accurate so customers aren’t misled; (3) complete so there is enough information for the customer to make a decision; (4) relevant to the customer; (5) accessible so the customer can easily get to information and use it; (6) interpretable so everyone can understand what the data mean; (7) innovative so customers have access to new products; and (8) granular enough so each customer has customized information.

If we were to look at the flagship programs of the federal system, they don’t have those traits. Take, for example, the national government’s largest survey—the Census Bureau’s American Community Survey (ACS). It was originally designed to consistently measure the entire country so that national programs that allocated dollars to communities based on various characteristics were comparing the whole country on the same basis. It is an enormous and expensive household survey. It asks questions of 295,000 households every month—3.5 million individuals a year. The cost to the Census Bureau is about $220 million and another $64 million can be attributed to the respondents in the value of the time taken to answer the questions. Because there is no high-quality alternative, it is used in hundreds if not thousands of local decisions—as the ACS website says, it “helps local officials, community leaders, and businesses understand the changes taking place in their communities.” In New York alone, the police department must report on priority areas that are determined, in part, using ACS poverty measures, pharmacies must provide translations for top languages as defined by the ACS, and the New York Department of Education took 2008 ACS population estimates into account when it decided to make Diwali a school holiday.

Yet while reliable local data are desperately needed, the very expensive ACS data are too error prone for reliable local decision-making. The reasons for this include the survey design, sample sizes that are too small, public interpretation of margins of error when sample sizes are small, and lack of timely dissemination of data.

I’ll discuss some of the details of these reasons in chapter 2—but one core problem is the reliance on old technology. The data are collected by means of mailing a survey to a random set of households (one out of 480 households in any given month). One person is asked to fill out the survey on behalf of everyone else in the household, as well as to answer questions about the housing unit itself. To give you a sense of the issues with this approach: there is no complete national list of households (the Census Bureau’s list misses about 6 percent of households), about a third of recipients refuse to respond, and of those who respond, many do not fill out all parts of the survey. There is follow-up of a subset of nonresponders by phone, internet, and in-person interviews, but each one of these introduces different sources of bias in terms of who responds and how they respond. Because response rates vary by geography and demography, those biases can be very difficult to adjust for.18 Such problems are not unique to the ACS; surveys in general are less and less likely to be truly representative of the people in the United States and the mismatch between intentions and reality can result in the systematic erasure of millions of Americans from governmental decision-making.

Statistical agencies face major privacy challenges as well. The increased availability of data on the internet means that it is much easier to reidentify survey respondents, so more and more noise has to be introduced into the data in order to protect respondent privacy. This noise results in reduced data reliability, particularly for small populations. For example, the Census Bureau is systematically making data worse to protect privacy.

Census data from 2010 showed that a single Asian couple—a 63-year- old man and a 58-year- old woman—lived on Liberty Island, at the base of the Statue of Liberty. That was news to David Luchsinger, who had taken the job as the superintendent for the national monument the year before. On Census Day in 2010, Mr. Luchsinger was 59, and his wife, Debra, was 49. In an interview, they said they had identified as white on the questionnaire, and they were the island’s real occupants.

Before releasing its data, the Census Bureau had “swapped” the Luchsingers with another household living in another part of the state, who matched them on some key questions. This mechanism preserved their privacy, and kept summaries like the voting age population of the island correct, but also introduced some uncertainty into the data.

Yes, you read that right. Not only are US taxpayers paying $48 a head to be counted (not including the cost of taxpayer time to fill out the forms), but then the numbers are systematically distorted and made less useful to protect privacy. Community input into the tradeoff between data quality and privacy protection is sorely needed.

In addition, the data that are produced are hard to interpret and apply at the state or local level. The ACS has come under fire for the fact that the estimates the survey produces are “simply too imprecise for small area geography.”

What is the practical implication? Take the measurement of child poverty, for example, which is used by state and local governments to figure out how to allocate taxpayer dollars to poor children. In one county (Autauga) in Alabama, with a total population of about fifty-five thousand, the ACS estimates that 139 children under age 5 live in poverty—plus or minus 178! So the plausible range is somewhere between 0 and 317.

The problem is not just that the errors are large, but also that they are larger, reflecting lower-quality estimates, for lower-income and central city neighborhoods23—precisely the areas that are often targeted for policy interventions. This undemocratic distortion of millions of Americans will result in the inaccurate estimation of, for example, the effects of health and tax policy on low-income individuals, and, unfortunately, the inequality in data coverage is increasing.

Even worse, the data are not timely. They are made available two years after they are collected (in May of 2019, 2017 data are available on the ACS website) and the five-year rolling average approach (aggregating over a moving window of the previous five years of data as the survey date rolls through time) means that the 2017 data include survey information from 2013. Obviously, the information is largely unsuitable for areas that are rapidly changing due to immigration, outmigration, or the opening or closure of major employers.

What to Do

Because the system the federal government uses to produce statistical data is large and complex, a number of systemic changes need to be made. The organizational structure, as well as the composition and skill of the government workforce, needs to change. And the ties to community and local demand need to be institutionalized and made stronger.

The details will be discussed later in the book, but simply put, the government structure has to change so that innovation can occur and new data can be produced. In the private sector, new businesses are born and expand, replacing older businesses and providing new services. For example, firms like Waze figured out how to combine and analyze massive amounts of information about individual car trips in order to provide instant information to travelers about the best way to get from A to B. Their business, and others like them, replaced the business of producing physical maps that were difficult to use and often out of date.

Since that solution doesn’t work for governments, we need to identify what parts of the federal statistical system should be retained and what parts should be reallocated. The challenge is identifying an alternative. An important argument in this book is that the Data Revolution makes it possible.

Changing the workforce is critical. For data to have value, the employees in an organization have to have the skills necessary to translate that data into information. The entire structure of the private sector has been transformed in the past twenty years to reflect the need for such skills. In 2018, one of the biggest US companies, Facebook, grounded in data, had a market value per employee of about $20.5 million, with very little physical capital and a workforce skilled in manipulating data. Twenty years ago, one of the biggest US companies, General Motors, grounded in manufacturing, had a market value of $230,000 per employee, with a great deal of physical capital and a skilled manufacturing workforce. Such change is difficult to effect in the public sector. Government salary structures make it difficult to hire and retain enough in-house data analysts, let alone respond quickly to reward employees for acquiring new skills. The government is competing against Facebook and Google not only for salaries but also prestige. The occupational classification of “data scientist” didn’t even exist in the federal government until June of 2019.25 Open source tools, like Python, which are commonly used in private-sector data analysis, are regarded with suspicion by many government IT organizations. The pressures to meet existing program needs make it difficult for agency staff to try something new, and while failure is celebrated in the private sector, it can be career ending in the public sector. These combined challenges have led to the current situation—agencies cannot get the significant resources necessary to make use of new data, and because they don’t use new data, they don’t get new resources.

New products that respond to community needs must be developed. There is a huge opportunity to do so. The amount of new data available is overwhelming. Real-time data can be collected on cell phones, from social media sites, as a result of retail transactions, and by sensors or simply driving your car. Turning the data into useable information requires a very different set of skills than the ones deployed in the survey world. Data need to be gathered, prepared, transformed, cleaned, and explored, using different tools. The results need to be stored using new database tools, and analyzed using new techniques like machine learning and network analysis. Visualization and computational techniques are fundamentally different with data on a massive scale, rather than simply tens of thousands of survey answers. The privacy issues are different, as are the requirements for data search and discovery and reproducibility.

While today’s data world is, in many ways, a Wild West, data being produced for the public sector need to be designed carefully. The key elements of the federal statistical infrastructure are too important to lose: we need to expand the current statistical system to think about how public data should be produced, and how they must be trustworthy and measured well and consistently over time, and how confidential information should be protected. A world in which all data are produced by a market-driven private sector could be a dangerous one—where there are many unidentified or unreported biases; where privacy is not protected; where national statistics could be altered for the right price; where if a business changes its data collection approach, the unemployment numbers could skyrocket (or drop); where respondents’ information could be sold to the highest bidder.

Action is required because the way governments produce statistics won’t change by itself. In the private sector, market forces create the impetus for change, because organizations that don’t adapt are driven out of business. There’s no similar force driving government change. Over the past thirty years, I’ve worked with people at all levels of government—federal, state, county, and city—in the United States and throughout the world. I’ve developed tremendous respect and admiration for the highly skilled and dedicated workforce that brings us the information driving our economy. These professionals know What needs to be done to make change happen. Hundreds of studies have provided useful recommendations. But when, in the course of thirty years, hundreds of good people try to change the system and the system doesn’t change, it’s clear nothing is going to happen without disruption.

This book proposes a new and, yes, disruptive approach that spells out what to do. It keeps the best elements of the current model—the trust, professionalism, and continuity—while taking away the worst elements—the bureaucracy and rigidity. It proposes a restructuring to create a system that will:

1. Produce public statistics that are useful at all levels—federal, state, and local.

2. Empower a government workforce to innovate in response to new needs.

3. Create a trusted organization that is incentivized to respond to community demand.

This is a golden moment to rethink data use by establishing, codifying, agreeing to, and employing new systems of measurement. Governments at the state and local levels are upping their investment in developing analytics teams to support better management. At the federal level, Congress passed the Foundations for Evidence-Based Policymaking Act of 2018 and the White House published the first Federal Data Strategy. Both efforts require agencies to invest in data, analytical approaches, and more thorough evaluation activities to get rid of programs that don’t work and expand programs that do. Many state and local governments are turning to data-and evidence-driven decision-making and forming new partnerships with universities, with the private sector, and with each other to do so.

The challenge is making sure that the focus is on creating new value rather than creating new processes. In the private sector, thousands of firms get started; only the successful ones survive. Federal, state, and local governments don’t have the pressure of failure, so their response is to establish new positions. The federal government’s response has been to require each of the twenty-four major US government agencies to have a chief data officer (CDO), a chief evaluation officer, and a senior statistical officer; at last count, nearly fifty states, counties, and cities had also hired CDOs. Ensuring that the people in these positions have the support or control that they need to succeed is essential: if an ineffective system is introduced in government, it can be hard to course-correct. Governments at all levels are investing in training their staff to acquire data skills; it will be similarly critical to ensure those investments are substantive rather than perfunctory.

* * *

Julia Lane is a founder of the Coleridge Initiative, Professor at the NYU Wagner Graduate School of Public Service and the NYU Center for Urban Science and Progress, and an NYU Provostial Fellow for Innovation Analytics.