A Plan to Make Police Data Open Source Started on Reddit

The Police Data Accessibility Project aims to request, download, clean, and standardize public records that right now are overly difficult to find.
NYPD officer
“The problem is that all of this data, although it’s public, is buried inside of these really crappy or antiquated public records portals,” says Kristin Tynski, who launched the project.Photograph: Angus Mordant/Bloomberg/Getty Images

On May 18, Kristin Tynski dropped a link into the Reddit community r/privacy: “I scraped court records to find dirty cops.” Tynski, who owns a marketing firm, had collected the public police records in Palm Beach County, where she lives, and wrote up her findings on data like traffic citations and race. She wondered if other Redditors might want to do the same in their counties. “If cops can watch us, we should watch them,” she wrote.

Exactly one week later, George Floyd was killed in custody of the Minneapolis police, his death captured on video by witnesses. As outrage began building in the streets of that city, Tynski once again took to Reddit. “I think I accidentally started a movement,” she wrote on May 26, describing how dozens of people had already joined her effort, which was now being organized in Slack. This time, there were more than just stirrings of interest. Tynski had no way to know, but the timing of her small data-mining experiment coincided with what some experts say is the biggest protest movement in US history. Thousands of Redditors upvoted her post, and then migrated to a new subreddit, r/DataPolice, coordinating an effort to collect public police records en masse. Their mission: “to enable a more transparent and empowered society by making law enforcement public records open source and easily accessible to the public.”

That kind of centralized, nationwide database doesn’t exist in the US right now. For years, researchers, journalists, and activists have turned to official records, from incident reports to misconduct complaints, as one window into police behavior in the United States. “The problem is that all of this data, although it’s public, is buried inside of these really crappy or antiquated public records portals,” says Tynski. Few states make it easy to mass-export law enforcement data, which can make the process tedious. Some states require a formal public records request to access the documents; sometimes people have had to sue for the data. And once the data has been downloaded, it has to be cleaned, combined, and standardized to create a national data set—the kind that might help researchers find patterns of racial bias, excessive use of force, or repeat complaints of misconduct. Tynski’s group, which calls itself the Police Data Accessibility Project, aims to do just that.

The Police Data Accessibility Project isn’t the first to try to amass public police data for analysis, but previous efforts have mostly fallen to universities and journalists. (The government has also made some effort: The FBI launched a new national use-of-force database in 2019, but participation by law enforcement agencies is voluntary.) The Police Data Accessibility Project, on the other hand, is a grassroots effort. More than 2,000 interested internet users have joined an associated Slack group, and over 6,000 have subscribed to r/DataPolice. (Advance Publications, which owns WIRED’s publisher, Condé Nast, is a Reddit shareholder.) Tynski’s project is also, in some ways, larger in scope. Unlike previous projects bound by geography or types of records, the Police Data Accessibility Project aims to aggregate all public police records nationwide into one easily searchable database. “The parameters are, what are local police forces publishing? We want all of that public data,” says Eddie Brown, a US Army veteran who has taken the role of chief operating officer for the group.

Doing so will be difficult, tedious, and technical work. So far, the members of the Police Data Accessibility Project have mostly spent their time building the custom scrapers needed to export files from data portals, rather than gathering the data itself. With so many volunteers chipping in, there have also been a number of debates about the ethics of the project: Should they include the names of police officers in their database? Should they use sources like Blue Leaks, a trove of stolen police documents released in June? The group has decided no on both counts, citing privacy and the importance of data custody, or having a legal right to the data in the set.

The large scope of the project, combined with the distributed volunteer force, has posed challenges. “It’s certainly a concern that we’ll lose momentum just in not being able to get organized well enough, fast enough,” Tynski says. While protests are still taking place across the country regularly, they peaked in early June. Shifting attention might worsen retention; already, Tynski says she’s seen hundreds of “members” drop off of the Slack and Reddit groups.

Tynski hopes that people will continue to see the value in data-gathering as a form of civilian action. “This is a technical challenge,” she says. “With a lot of technical Americans who feel they could do something tangible, it’s something actionable.” To that end, the group has made plans to transform itself from a volunteer workforce into a nonprofit. Brown, who is participating in Stanford Graduate School of Business’ Ignite Program, also successfully pitched PDAP as a venture project there to further develop its business plan.

Tynski has also been adamant that the group’s job is to collect the data, but not to analyze it—a delicate task that she believes is better left to experts. Many are already on the case: In 2017, Stanford researchers created the Open Policing Project to collect and standardize data on traffic stops across the country. By now, it has added more than 200 million records to its repository and standardized them into a single database, and has found evidence of systemic bias against Black and Hispanic drivers. The Henry A. Wallace Police Crime Database, created by Bowling Green University in 2017, serves as a database for criminal arrests of crimes committed by police officers in all 50 states. Those researchers found that only a fraction of police officers are ever criminally charged for killing suspects in custody, and an even smaller amount are convicted. City-specific projects, like the Invisible Institute’s Citizen Police Data Project in Chicago or the Legal Aid Society’s Cop Accountability Project in New York, have also made startling discoveries from public data—like the high percentage of officers who had more than 10 complaints lodged against them, or that specific officers had been sued more than a dozen times for inappropriate use of force, without any discipline from the department.

Some projects emerged to fill gaps in official public record systems: The Washington Post has been trying to track every fatal shooting by on-duty police officers in the US since 2015. The FBI also collects this data, but because all contributions from law enforcement agencies are voluntary, it’s been criticized as incomplete.

Police data can also tell just one side of the story. The records from police departments can leave out much of the behavior that, when captured on camera, has led to public outrage, disgust, and protest. The rise of body cameras has shown that in some police departments, for example, officers drastically underreport their use of force. For that reason, some projects—like Raheem, in Oakland—have endeavored to collect data on police interactions from citizens, rather than relying solely on the police’s interpretation from public documents.

It’s one thing for researchers to collect the data and draw inferences, but data alone does not lead to better policing. “The number of people being killed by the police year over year has not gone down,” says Samuel Sinyangwe, a data scientist with Campaign Zero, a police reform group. (According to the Post tracker, police have fatally shot around 1,000 people in the US every year since 2015.) “So it becomes important, beyond the rhetoric and the policy proposals, to look at the outcomes and see if institutions are doing what they say they’re doing.”

A year ago, Sinyangwe founded the Police Scorecard to evaluate police departments using public data in California, which releases more detailed records than most states. Officers must report demographic information, like race and gender, in every interaction, which is supposed to make it easier to track bias. California police departments are also required to report officers’ use of force, including when an officer perceives a suspect to be in possession of a weapon. “Some departments have a huge proportion of cases where [police officers have] killed people, and they’ve said they thought the person had a gun but they didn’t have a gun,” says Sinyangwe. All of this data can offer clues into whether departments, or even specific officers, have problems.

Data, ultimately, is a tool, and like any tool it can be mishandled—even with the best intentions. Another Campaign Zero project, #8CantWait, offers a recent cautionary tale. The campaign, launched in the aftermath of George Floyd’s death, promoted a platform of eight policies for cities to adopt, like banning chokeholds. “Data proves that together these eight policies can decrease police violence by 72 percent,” the group stated on its website and social media—a claim that was taken up by the project’s many celebrity supporters. Digging into the data, however, some critics found that number misleading and based on weak data science; others noted that killings had continued in cities with similar policies in place. “The use of statistics is largely a matter of interpretation,” Cherrell Brown and Philip V. McHarris, two activists, wrote in a post criticizing the campaign and requesting those statistics be removed. “When people invoke data and statistics it can serve as a veneer of empirical proof that renders something difficult to critique. Police also use statistics and interpret them in a way to justify their actions.” #8CantWait has since updated its platform claims. (Sinyangwe himself published a statement admitting the campaign rollout and messaging was “flawed.” “Forty years of research shows that places with more restrictive use-of-force standards are less likely to kill people, but it's extremely difficult to prove causation,” he told WIRED.)

Still, data is an important piece of understanding what law enforcement looks like in the US now, and what it could look like in the future. And making that information more accessible, and the stories people tell about policing more transparent, is a first step.

Correction on 7/8/2020: An earlier version of this article misstated the name of the subreddit associated with the Police Data Accessibility Project. It is r/DataPolice, not r/PoliceData.


More Great WIRED Stories