Published Dec 13, 2019
How much police misconduct is there?
A data visualization project by A Wilde
Police Misconduct?
“We demand an end to police brutality NOW!” reads a printed placard from the 1963 March on Washington that is now archived in the Smithsonian. Police misconduct, or brutality, is not a new phenomenon. Throughout history, the police have reinforced the brutal, racist, fundamentally unjust regimes of Jim Crow, of slum segregation, and of worker suppression. As actors who enforce what crime is and who commits it, they, themselves, are rarely held accountable to the communities in which they act. And unfortunately they mostly act in poor communities of color, places that have the least amount of political power and access to influence the “dialogue.”
In the 1980s, the funding of the War on Drugs and the ensuing policies of mass incarceration greatly expanded the presence of police in our society. Between 1985 and 2005, spending on policing quadrupled, mostly under the auspices of the War on Drugs (Gascón and Foglesong 2010). In this same period the rate of incarceration increased 300% to a higher rate than anywhere else in the world .
This expansion of policing was applied extremely unequally, focusing most intensely on poor communities of color. Between 1980 and 2000 drug arrests increased for white people only 30%, while for black people they increased 350% (Beckett et al. 2006). Many of these arrests resulted in prison sentences for those involved, leading to the well documented racial and class disparities within prison population today where poor, black people are disproportionately behind bars.
Police abuse of power is well known as a fact of life within these communities that are aggressively policed, as is illustrated by the testimony in a 2018 report of the New York Advisory Committee to the US Commission on Civil Rights on “The Civil Rights Implications of ‘Broken Windows’ Policing in NYC and General NYPD Accountability to the Public”. The Movement for Black Lives has catapulted the worst instances of this brutality into popular conversation, but the bigger picture of consistent low-grade abuse of power has not yet been recognized as a systemic national problem. Advocates are slowly building quantitative evidence that there are systemic problems by demanding more visibility into the lawsuits against police and internal discipline records of officers. In Chicago, the Invisible Institute in partnership with the University of Chicago Law School’s Legal Aid Clinic, has produced Citizens Police Data Project, a comprehensive, searchable database of the complaints made against police officers and the Chicago Reporter has made a similarly comprehensive database around misconduct lawsuits involving police. They both used these databases to issue a series of stories and reports about the systemic implications of their aggregated data.
The CAPstat data
In June of 2019, the Legal Aid Society of New York published a website CAPstat.nyc, showing a database that was developed through many months of research into police misconduct lawsuits. The website provides information about the lawsuits in one view, the police commands involved in these lawsuits in another, and the individual cops involved in the lawsuits in a third view. However, the data was not presented in any aggregate form. This project was intended to aggregate the data available.
I was interested to know how precincts compared to each other in terms of settlements for police misconduct. Where were the precincts with the most lawsuits? What was the rate of a settlement outcome for these lawsuits? More generally, what exactly were the set of lawsuits that was contained on the site? Could you extrapolate larger trends by aggregating them?
In the process of researching the data from Legal Aid, I learned that the city government had also started publishing allegations of misconduct in June on 2019 in excel files via the New York City Law Department. These files did not include information about the involved commands, and given that I knew I wanted to map this data, I decided to pursue the Legal Aid data and leverage the research labor that they had put in to identify the involved commands. I left the city-published misconduct data to a future project.
Unfortunately, the CAPstat.nyc website did not have an API by which to easily access the underlying data on the site, so I had to scrape the data. To do this I used the python library scrapy. Again, because I was focused on mapping the data, I initially wrote a web crawler that targeted the “command” views of the data. This web crawler went command by command and followed the links to each of the lawsuits associated with the commands and pulled all of the lawsuit data on that page, linking also the associated precincts.
In my first attempts at scraping the site, I found that the CAPstat website was fairly fragile and not able to respond quickly to a large volume of requests, so I had to throttle down the rate of requests per minute to only 2 or 3. This rate of requests meant that it would be extremely difficult to try to scrape the records of the 50,000 individual officers that they had, which drove me to focus just on the lawsuits in this project.
Comparing the number of lawsuits that the command web crawler yielded (1,168) to the lawsuit search page, I realized that over half of the lawsuits that were in the dataset did not have associated commands. To get the remaining 1,404 lawsuits, I wrote a separate web crawler that went through the lawsuit search interface to extract the data from each individual lawsuit. I used the id of the lawsuits from the urls (e.g. SDNY15cv491) to de-duplicate and join with the lawsuit data that was pulled via the command web crawl. I stored these all in a MongoDB instance.
Preparing the data
For upload into Tableau, I cleaned the data to normalize the settlement outcome from a dollar amount if it was settled and a text string category if it was not to “Settled” with a separate column for the amount if it was settled. I also had to turn the settlement string into a number. For lawsuits that were not associated with a command I had to extract the year that the lawsuit was filed from the id (e.g. SDNY15xxxx was filed in 2015).
There was also differentiation on the CAPstat site between federal cases and “Other” cases, so I was originally doing some cleaning to maintain this differentiation, but the “Other” category ended up being such a small segment of the overall lawsuits that I did not end up using this information in the final visualizations.
Considering the data
My first exercise was to look at the settlement data and see how much of it was associated with a numeric precinct and thus could be mapped. I could not find shapefiles for other configurations of police commands like the Transit Districts or the Housing PSAs, so the settlements had to be associated with numeric precincts. It turned out that relatively few lawsuits were available in the dataset until 2015. I still mapped out the data year-by-year to see which parts of the city had consistent settlements in the dataset. What this revealed was a consistent pattern of more settlements in predominantly black parts of the city like central Brooklyn, Eastern Queens, and the Bronx.
It is only from here that I started looking at the dataset in more general terms. I looked at the lawsuits over time and the distribution of the lawsuits by outcome over time. I also looked at the overall distribution of settlements. Lastly, looked at the breakdown of lawsuits by command, of which those with unknown command dominated.
Designing the visualizations
For my overall color palate, I chose to use dark green to signify settled lawsuits, a yellow-green to signify lawsuits that did not settle but also did not have an outcome in favor of the defendant, and a yellow to signify lawsuits that were in favor of the defendant. Different shades of grey were used for pending and unknown outcomes. Eventually I came to add blue as signifier for information about the police command that was involved in a lawsuit as well. I did not have the will to explore different fonts.
The visualizations I created for the pin-up were a series of dashboards that tried to move the user through the dataset. I started with a big number at the top, the total number of lawsuits, just to be very clear about what it was that was being investigated. I added an embedded website of the CAPstat.nyc webpage, but unfortunately it didn’t port to Tableau public and in class no one seemed to realized that it was something they could interact with, so I took it out for later versions. I initially also had a graph of the lawsuits over year with a breakdown of the cases by type. People did not seem to understand or, at least, take the time to read, my embedded annotations around the two different datasets. From that I understood that I had to take more page space to emphasize them. This lead me to move this graph to a second page and to break it down more clearly by eliminating all distracting colors and categories first.
My second dashboard was originally a series of six tree-maps breaking down the proportion of the lawsuits by outcome first in sum and then year-by-year. I had hoped to use this as a way for people to compare how the proportion of pending lawsuits expanded if they were filed more recently. In reality I think it was a very busy page, and was mostly visually overwhelming. I ended up splitting this out into two different visualizations. The proportion of the whole, I added to the first page, as part of a dot-matrix of the lawsuits, as this shows visually the 2,752 lawsuits that make up the data visualization, as well as the clear proportion of them that are pending or settled compared to other outcomes. To get the year-by-year effect, I added outcome to a by year bar graph on the second page of the visualization. This is using adjacent multiples to show the same data with different breakdowns applied. I used this in my previous project on my phone and I think it is an effective way to orient the viewer to different things that are important in the same graph. Perhaps I am treating the visualization too serially (like a power-point presentation), and this does not hold up for visualizations that stand in the wild where the viewer can go through it any which way however they please. I would be interested to hear feedback on this. I certainly am more interested in data visualization that follows a paged, story-book approach more than I am interested in visualizations that simply give you a search bar and tell you to find the story on your own. I think most people will never take the time to get anything meaningful out of it in the latter case.
My third dashboard was focused on the police commands that are associated with the lawsuit. For the first pinup, I had it as a dot-matrix on the left with different colors for each of the commands and then on the right a bar chart that aggregates the number of lawsuits by command. People were confused by the same colors in the two plots representing different things. They did not get at all that I was trying to show that the majority of the lawsuits had an unknown involved command. So I broke this dashboard out into two different boards. The first illustrates the “hole” in the data literally, using a disaggregated bubble-plot, that is basically a dot-matrix in a circle. It shows very clearly the number of unknown lawsuits compared to those with commands associated. I then try to break this proportion down further in a treemap below by showing how many of those that do have a known associated command are still pending, so that it is only 17% of the total that you can actually map. The second dashboard about police commands was the bar chart with the number of cases per command. Between the pinup and the presentation, I simplified outcome categorizing to only show pending, settled, and all other case outcomes. The intention is that if there is a particular precinct that a person is interested in, they could scroll to find it. I also wanted to illustrate that not all of the police commands that lawsuits were associated with were the numeric, easily mappable, precincts.
In the pin-up my final dashboard was a series of three maps showing the total amount of settlements per precinct, the total number of lawsuits per precinct, and the average lawsuit settlement per precinct. For my final product, at your suggestion, I also added a fourth map to show the median settlement per precinct.
For the final presentation I also added a summary dashboard that is intended to draw out some of the major questions that arise out of this dataset. Namely, given this limited dataset, what can we learn? What would we need to measure police misconduct? Do lawsuit settlements offer a good measure? And why do certain precincts, like Precinct 75, have much higher rates of lawsuits than others?
For the future
As mentioned previously, there is the misconduct allegation data from the city itself to analyze, but without any data about the command involved or anything other than the officers involved and the settlement payout, if any. I wonder if it is possible to identify the precinct that the lawsuit occurred in via other data that is available through NYC Open Data or the NYPD.
Additionally, the CAPstat dataset contains, in an incomplete fashion, information about the number of police officers involved in the lawsuits, the allegations against the officers, the false criminal charges that the victim suffered, and the kinds of use of force alleged to have been involved in the incident. I would like to look in more detail about what, if any, patterns emerge here, even given the limited nature of the data.
Lastly, given the extremely high bar for filing a lawsuit, I would be very interested, as Chicago has done, to see the data about complaints made to the Civilian Complaint Review Board about officer’s conduct. Given the lower bar for filing a complaint compared to a lawsuit, this may be a better reflection of the breadth of police misconduct.

