What tools do we have to combat disinformation?

What types of defenses against disinformation are possible? And which of these would we actually want to use in a democracy, where approaches like censorship can impinge on important freedoms? To try to answer these questions, I looked at what three counter-disinformation organizations are actually doing today, and categorized their tactics.

The EU East StratCom Task Force is a contemporary government counter-propaganda agency. Facebook has made numerous changes to its operations to try to combat disinformation, and is a good example of what platforms can do. The Chinese information regime is a marvel of networked information control, and provokes questions about what a democracy should and should not do.

The result is the paper Institutional Counter-disinformation Strategies in a Networked Democracy (pdf). Here’s a video of me presenting this work at the the recent Misinfoworkshop.

I should say from the start that I make no attempt to define “disinformation.” Adjudicating which speech is harmful is a profound problem with millennia of history, and what sort of narratives are “false” is one of the major political battles of our time. While I do have my own opinions, that’s not what this work is about. Instead, my goal here is to describe methods: what kinds of responses are there, and how do they align with the values of an open society?

The core of my analysis is this chart, which organizes the tactics of the above organizations into six groups.

Institutional counter-disinformation strategies

I’ll describe each of these strategies briefly; for more depth (and references) see the talk or the paper.

Refutation, rebuttal, or debunking might be the most obvious counter-strategy. It’s also well within the bounds of democracy, as it’s simply “more speech.” It’s most effective if it’s done consistently over the long term, and in any case it’s practiced by most counter-disinformation organizations.

Exposing inauthenticity combats one of the oldest and best-recognized forms of disinformation: pretending to be someone you are not. Bot networks, “astroturfing,” and undisclosed agendas or conflicts of interest could all be considered inauthentic communication. The obvious response is to discredit the source by exposing it.

Alternative narratives. A long line of experimentation suggests that merely saying that something is false is less effective than providing an alternative narrative, and the non-platforms in this analysis combat disinformation in part by promoting their own narrative

Algorithmic filter manipulation. The rise of platforms creates a truly new way of countering disinformation: demote it by decreasing its ranking in search results and algorithmically generated feeds. Conversely, it is possible to promote alternative narratives by increasing their ranking.

Speech laws. The U.S. Supreme Court has held that the First Amendment generally protects lying; the major exceptions concern defamation and fraud. In Europe, the recent report of the High Level Expert Group on Fake News and Online Disinformation recommended against attempting to regulate disinformation. But in most democracies platforms are still legally liable for hosting certain types of content. For example, Germany requires platforms to remove Nazi-related material within 24 hours or face fines.

Censorship. One way of combatting disinformation is simply to remove it from public view. In the 20th century, censorship was sometimes possible through control over broadcast media. This is difficult with a free press, and it is even harder to eliminate information from a networked ecosystem. Yet platforms do have the power to remove content entirely and often do, both for their own reasons and as required by law. (This differs from speech laws because the latter may impose fines or require disclaimers or otherwise restrict speech without removing it.)

Despite their differences, there are many common patterns between the East StratCom Task Force, Facebook, and the Chinese government.  Each of the methods they use has certain advantages and disadvantages in terms of efficacy and legitimacy — that is, alignment with the values of an open society.

A cross-sector response — both distributed and coordinated — is perhaps the biggest challenge. In societies with a free press there is no one with the power to direct all media outlets and platforms to refute or ignore or publish particular items, and it seems unlikely that people across different sectors of society would even agree on what is disinformation and what is not. In the U.S. the State Department, the Defense Department, academics, journalists, technologists and others have all launched their own more-or-less independent counter-disinformation efforts. In many countries, a coordinated response will require coming to terms with a deeply divided population.

But no matter what we collectively choose to do, citizens will require strong assurances that the strategies employed to counter disinformation are both effective and aligned with democratic values.

An Introduction to Algorithmic Bias and Quantitative Fairness

There are many kinds of questions about discrimination fairness or bias where data is relevant. Who gets stopped on the road by the police? Who gets admitted to college? Who gets approved for a loan, and who doesn’t? The data-driven analysis of fairness has become even more important as we start to deploy algorithmic decision making across society.

I attempted to synthesize an introductory framework for thinking about what fairness means in a quantitative sense, and how these mathematical definitions connect to legal and moral principles and our real world institutions of criminal justice, employment, lending, and so on. I ended up with two talks.

This short talk (20 minutes), part of a panel at the Investigative Reporters & Editors conference, has no math. (Slides)

This longer talk (50 minutes), presented at Code for America SF, gets into a lot more depth, including the mathematical definitions of different types of fairness, and the whole tricky issue of whether or not algorithms should be “blinded” to attributes like race and gender. It also includes several case studies of real algorithmic systems, and discusses how we might design such systems to reduce bias. (Slides)

My favorite resources on these topics:

Extracting campaign finance data from gnarly PDFs using deep learning

I’ve just completed an experiment to extract information from TV station political advertising disclosure forms using deep learning. In the process I’ve produced a challenging journalism-relevant dataset for NLP/AI researchers. Original data from ProPublica’s Free The Files project.

The resulting model achieves 90% accuracy extracting total spending from the PDFs in the (held out) test set, which shows that deep learning can generalize surprisingly well to previously unseen form types. I expect it could be made much more accurate through some feature engineering (see below.)

You can find the code and documentation here. Full thanks to my collaborator Nicholas Bardy of Weights & Biases.


TV stations are required to disclose their sale of political advertising, but there is no requirement that this disclosure is machine readable. Every election, tens of thousands of PDFs are posted to the FCC Public File, available at https://publicfiles.fcc.gov/. All of these contain essentially the same information, but in in hundreds of different formats, like these:

In 2012, ProPublica ran the Free The Files project (you can read how it worked) and hundreds of volunteers hand-entered information for over 17,000 of these forms. That data drove a bunch of campaign finance coverage and is now available from their data store.

Can we replicate this data extraction using modern deep learning techniques? This project aimed to find out, and successfully extracted the easiest of the fields (total amount) at 90% accuracy using a relatively simple network.

How it works

I settled on a relatively simple design, using a fully connected three-layer network trained on 20 token windows of the data. Each token is hashed to an integer mod 500, then converted to 1-hot representation and embedded into 32 dimensions. This embedding is combined with geometry information (bounding box and page number) and also some hand-crafted “hint” features, such as whether the token matches a regular expression for dollar amounts. For details, see the talk.

Deepform network

Although 90% is a good result, it’s probably not high enough for production use. However, I believe this approach has lots of room for improvement. The advantage of this type of system is that it can elegantly integrate multiple manual extraction methods — the “hint” features — each of which can be individually crappy. The network actually learns when to trust each method. In ML speak this is “boosting over weak learners.”

A research data set

Are you an AI researcher looking for challenging research problems that are relevant to investigative journalism? Have I got a data set for you!

Deepform data set

There is a great deal left to do on this extraction project. For example, we still need to try extracting the other fields such as advertiser and TV station call sign. This will probably be harder than totals as it’s harder to identify tokens which “look like” the correct answer.

There is also more data preparation work to do. We discovered that about 30% of the PDFs documents still need OCR, which should increase our training data set from 9k to ~17k documents.

But even in its current form, this is a difficult data set that is very relevant to journalism, and improvements in technique will be immediately useful to campaign finance reporting.

The general problem is known as “knowledge base construction” in the research community, and the current state of the art is achieved by multimodal systems such as Fonduer.

I would love to hear from you! Contact me on twitter or here.