Folks who want to use AI/ML for good generally think of things like building predictive models, but smart methods for extracting data from forms would do more for journalism, climate science, medicine, democracy etc. than almost any other application. Since March, I’ve been working with a small team on applying deep learning to a gnarly campaign finance dataset that journalists have been struggling with for years. Today we are announcing Deepform, a baseline ML model, training data set, and public benchmark where anyone can submit their solution. I think this type of technology is important, not just for campaign finance reporters but for everyone.
This post has four parts:
Why form extraction is an important problem
Why it’s hard
The start of the art of deep learning for form extraction
The Deepform model and dataset, our contribution to this problem
Form extraction is incredibly useful
Form extraction could help climate science because a lot of old weather data is locked in forms. These forms come in a crazy variety of different formats from all over the world and across centuries.
What types of defenses against disinformation are possible? And which of these would we actually want to use in a democracy, where approaches like censorship can impinge on important freedoms? To try to answer these questions, I looked at what three counter-disinformation organizations are actually doing today, and categorized their tactics.
The EU East StratCom Task Force is a contemporary government counter-propaganda agency. Facebook has made numerous changes to its operations to try to combat disinformation, and is a good example of what platforms can do. The Chinese information regime is a marvel of networked information control, and provokes questions about what a democracy should and should not do.
There are many kinds of questions about discrimination fairness or bias where data is relevant. Who gets stopped on the road by the police? Who gets admitted to college? Who gets approved for a loan, and who doesn’t? The data-driven analysis of fairness has become even more important as we start to deploy algorithmic decision making across society.
I attempted to synthesize an introductory framework for thinking about what fairness means in a quantitative sense, and how these mathematical definitions connect to legal and moral principles and our real world institutions of criminal justice, employment, lending, and so on. I ended up with two talks.
This short talk (20 minutes), part of a panel at the Investigative Reporters & Editors conference, has no math. (Slides)
This longer talk (50 minutes), presented at Code for America SF, gets into a lot more depth, including the mathematical definitions of different types of fairness, and the whole tricky issue of whether or not algorithms should be “blinded” to attributes like race and gender. It also includes several case studies of real algorithmic systems, and discusses how we might design such systems to reduce bias. (Slides)
My favorite resources on these topics:
The Workbench workflow analyzing Massachusetts traffic ticket data.
Sandra Mayson, Bias In, Bias Out. One of my favorite overall discussions of algorithmic bias.
Update, Oct 2020: we’ve done a lot more since this post! If you want to try working on this problem, Weights and Biases is very kindly hosting a public benchmark.
I’ve just completed an experiment to extract information from TV station political advertising disclosure forms using deep learning. In the process I’ve produced a challenging journalism-relevant dataset for NLP/AI researchers. Original data from ProPublica’s Free The Files project.
The resulting model achieves 90% accuracy extracting total spending from the PDFs in the (held out) test set, which shows that deep learning can generalize surprisingly well to previously unseen form types. I expect it could be made much more accurate through some feature engineering (see below.)
You can find the code and documentation here. Full thanks to my collaborator Nicholas Bardy of Weights & Biases.
TV stations are required to disclose their sale of political advertising, but there is no requirement that this disclosure is machine readable. Every election, tens of thousands of PDFs are posted to the FCC Public File, available at https://publicfiles.fcc.gov/. All of these contain essentially the same information, but in in hundreds of different formats, like these:
There is now, at long last, wide concern over the negative effects of technology, along with calls to teach ethics to engineers. But critique is not enough. What tools are available to the working engineer to identify and mitigate the potential harms of their work?
I’ve been teaching the effects of technology on society for some time, and we cover a lot of it in my computational journalism course. This is an outline for a broader hands-on course, which I’m calling the Ethical Engineering Lab.
This eight-week course is a hands-on introduction to the practice of what you might call harm-aware software engineering. I’ve structured it around the Institute for the Future’s Ethical OS, a framework I’ve found useful for categorizing the places where technology intersects with personal and social harm. Each class is three hours long, split between lecture and lab time. Students must complete a project investigating actual or potential harms from technology, and their mitigations.
Each lecture is structured around a set of issues, cases where technology is or could be involved in harm, and tools, methods for mitigating these harms. The goal is to train students in the current state-of-the-art of these problems, which often requires a deep dive into both the social and technical perspectives. We will study both differential privacy algorithms and HIPAA health data privacy. In many cases there is disagreement over the potential for certain harms and their seriousness, so we will explore the tradeoffs of possible design choices.
Some of you may have heard about by new data journalism project — The Computational Journalism Workbench. This is an integrated platform for data journalism, combining scraping, analysis, and visualization in one easy tool. It works by assembling simple modules into a “workflow,” a repeatable, sharable, automatically updating pipeline that produces a publishable chart or a live API endpoint.
I demonstrated a prototype at the NICAR conference. UPDATE: Workbench is now in production at workbenchdata.com and has now been used in teaching in dozens of schools.
I’ll be working on Workbench for at least the next few years. My previous large data journalism project is the Overview document mining system, which continues active development.
In honor of MisinfoCon this weekend, it’s time for a brain dump on propaganda — that is, getting large numbers of people to believe something for political gain. Many of my journalist and technologist colleagues have started to think about propaganda in the wake of the US election, and related issues like “fake news” and organized trolling. My goal here is to connect this new wave of enthusiasm to history and research.
This post is about persuasion. I’m not going to spend much time on the ethics of these techniques, and even less on the question of who is actually right on any particular point. That’s for another conversation. Instead, I want to talk about what works. All of these methods are just tools, and some are more just than others. Think of this as Defense Against the Dark Arts.
Let’s start with the nation states. Modern intelligence services have been involved in propaganda for a very long time and they have many names for it: information warfare, political influence operations, disinformation, psyops. Whatever you want to call it, it pays to study the masters.
Many people have realized that natural language processing (NLP) techniques could be extraordinarily helpful to journalists who need to deal with large volumes of documents or other text data. But although there have been many experiments and much speculation, almost no one has built NLP tools that journalists actually use. In part, this is because computer scientists haven’t had a good description of the problems journalists actually face. This talk and paper, presented at the Computation + Journalism Symposium, are one attempt to remedy that. (Talk slides here.)
This all comes out of my experience both building and using Overview, an open source document mining system built specifically for investigative journalists. The paper summarizes every story completed with Overview, and also discusses the five cases I know where journalists used custom NLP code to get the story done.
I feel we’re on the precipice of some delightfully weird and possibly very alarming developments at the intersection of code and money. There is something deep in the rules that is getting rewritten, only we can’t quite see how yet. I’ve had this feeling before, as a self-described Cypherpunk in the 1990s. We knew or hoped that encrypted communication would change global politics, but we didn’t quite know how yet. And then Wikileaks happened. As Bruce Sterling wrote at the time,
At last — at long last — the homemade nitroglycerin in the old cypherpunks blast shack has gone off.
That was exactly how I felt when that first SIGACT dump hit the net, by then a newly hired editor at the Associated Press. Now I’m studying finance, and I can’t shake the feeling that cryptocurrencies — and their abstracted cousins, “smart contracts” and other computational financial instruments — are another explosion of weirdness waiting to happen.
I’m hardly alone in this. Lots of technologists think the “block chain” pioneered by bitcoin is going to be consequential. But I think they think this for the wrong reasons. Bitcoin itself is never going to replace our current system of money transfer and clearing; it’s much slower than existing payment systems, often more expensive, uses far too much energy, and don’t scale well. Rather, bitcoin is just a taste, a hint: it shows that we can mix computers and money in surprising and consequential ways. And there are more ominous portents, such as contracts that are actually code and the very first “distributed autonomous organizations.” But we’ll get to that.
There is a just-so story that explains the existence of money. Before money, the story goes, we all had to barter for the goods we wanted. If I wanted wheat and had chickens, I needed to find someone who wanted chickens and had extra wheat. Money solves this “double coincidence” problem by letting me sell my chickens to buy your wheat. If we didn’t have money we’d invent it immediately.
The problem with this simple story is that it may not match history. There has never been a pure barter economy, according to anthropologists. Pre-money economies were organized in a variety of other ways, including central planning, informal gift economies, and IOUs denominated in cows.
Sir Jon Hicks’ classic A Market Theory of Moneyfills this gap. Hicks was a major figure in 20th Century economics who eventually won a Nobel, and here at last is a straightforward story that explains why we have banks at all. It’s still not clear to me that this account is historically grounded – or that we can understand what a modern bank does, or should do, on the basis of historical parable — but at least this account provides a better history than barter.
With that cautionary note, here’s Hicks’ story of banking. He begins in a world where money is already the usual form of payment, and breaks down a transaction into three pieces: Continue reading The Origin of Banking