Month: June 2019

What tools do we have to combat disinformation?

June 24, 2019June 11, 2020Leave a comment

What types of defenses against disinformation are possible? And which of these would we actually want to use in a democracy, where approaches like censorship can impinge on important freedoms? To try to answer these questions, I looked at what three counter-disinformation organizations are actually doing today, and categorized their tactics.

The EU East StratCom Task Force is a contemporary government counter-propaganda agency. Facebook has made numerous changes to its operations to try to combat disinformation, and is a good example of what platforms can do. The Chinese information regime is a marvel of networked information control, and provokes questions about what a democracy should and should not do.

The result is the paper Institutional Counter-disinformation Strategies in a Networked Democracy (pdf). Here’s a video of me presenting this work at the the recent Misinfoworkshop.

Continue reading What tools do we have to combat disinformation?

An Introduction to Algorithmic Bias and Quantitative Fairness

June 15, 2019September 1, 20191 Comment

There are many kinds of questions about discrimination fairness or bias where data is relevant. Who gets stopped on the road by the police? Who gets admitted to college? Who gets approved for a loan, and who doesn’t? The data-driven analysis of fairness has become even more important as we start to deploy algorithmic decision making across society.

I attempted to synthesize an introductory framework for thinking about what fairness means in a quantitative sense, and how these mathematical definitions connect to legal and moral principles and our real world institutions of criminal justice, employment, lending, and so on. I ended up with two talks.

This short talk (20 minutes), part of a panel at the Investigative Reporters & Editors conference, has no math. (Slides)

This longer talk (50 minutes), presented at Code for America SF, gets into a lot more depth, including the mathematical definitions of different types of fairness, and the whole tricky issue of whether or not algorithms should be “blinded” to attributes like race and gender. It also includes several case studies of real algorithmic systems, and discusses how we might design such systems to reduce bias. (Slides)

My favorite resources on these topics:

The Workbench workflow analyzing Massachusetts traffic ticket data.
Sandra Mayson, Bias In, Bias Out. One of my favorite overall discussions of algorithmic bias.
Megan Stevenson, Assessing Risk Assessment in Action. What happens with criminal justice risk assessment in the real world?
Corbett-Davies and Goel, The Measure and Mismeasure of Fairness is a well done more mathematical discussions of fairness measures.
Open Policing Project findings. A very clearly thought out analysis of US national traffic stop data.
Workbench Open Policing Project tutorial. An interactive introduction to working with this data.
Arvind Narayanan, 21 Definitions of Fairness and Their Politics. More on the connection between quantitative and political concepts of fairness.

Extracting campaign finance data from gnarly PDFs using deep learning

June 13, 2019October 9, 20209 Comments

Update, Oct 2020: we’ve done a lot more since this post! If you want to try working on this problem, Weights and Biases is very kindly hosting a public benchmark.

I’ve just completed an experiment to extract information from TV station political advertising disclosure forms using deep learning. In the process I’ve produced a challenging journalism-relevant dataset for NLP/AI researchers. Original data from ProPublica’s Free The Files project.

The resulting model achieves 90% accuracy extracting total spending from the PDFs in the (held out) test set, which shows that deep learning can generalize surprisingly well to previously unseen form types. I expect it could be made much more accurate through some feature engineering (see below.)

You can find the code and documentation here. Full thanks to my collaborator Nicholas Bardy of Weights & Biases.

Why?

TV stations are required to disclose their sale of political advertising, but there is no requirement that this disclosure is machine readable. Every election, tens of thousands of PDFs are posted to the FCC Public File, available at https://publicfiles.fcc.gov/. All of these contain essentially the same information, but in in hundreds of different formats, like these:

Continue reading Extracting campaign finance data from gnarly PDFs using deep learning

Jonathan Stray

Information, culture, and belief

Monthly Archives: June 2019

What tools do we have to combat disinformation?

An Introduction to Algorithmic Bias and Quantitative Fairness

Extracting campaign finance data from gnarly PDFs using deep learning