Machine Learning Could Jolt Legal Research — We Just Need the Data

This model is hunting for systemic patterns and biases. Paywalls be damned.

Written by Stephen Gossett
Published on May. 03, 2023

In federal court, why would one judge waive filing fees 20 percent of the time, while another judge from the very same district waives them some 80 percent of the time?

It’s a curious discrepancy surfaced last year by the research team at SCALES, a project that’s applying machine learning to court-record data so legal researchers can search for systemic patterns, inconsistencies and biases in the judicial system.

The fee-waiver discovery was notable, but researchers hope the platform, which is now entering the beta development stage, will soon let users make queries and unearth patterns that are broader and deeper still.

For instance, “What is the relationship between the length of time a case takes and the district in which it’s raised?” Kristian Hammond, professor of computer science at Northwestern University and SCALES team member, asked. Or, “Is there a change over time of how long it takes different cases to flow through the courts?”

Researchers have so far built the proof-of-concept application, completed an early round of user testing for interface and functionality feedback, and abstracted the data analytics and data configuration within the system.

kristian hammond scales“It’ll reason through what questions it can answer and provide those as a way of shaping the input,” Hammond (left) said.

The project — which includes law and computer engineering professors from Northwestern, the University of Texas–Austin, Georgia State University and the University of Richmond — takes on several key challenges in wide-scope legal analytics: building the analytical platform, making it user friendly and assembling the necessary court data.

RelatedHow Casetext Makes Lawyers More Accessible Aith AI

 

The Data Access Problem

A long-standing hurdle for projects like SCALES is data access. Federal court records in the U.S. are public, but they aren’t free. They’re paywalled behind a system called Public Access to Court Electronic Records (PACER), which charges 10 cents per page.

“Ten cents per page may not sound like much, but it adds up fast,” Sarath Sanga, a law professor at Northwestern, said last year. “A single case could easily cost $100. A year’s worth of cases would cost tens of millions.”

Advocates for open court records have long decried the financial hurdle.

“The fact that the government doesn’t have an easily accessible, searchable, well-constructed interface where the public can access ... the judicial decisions that constitute case law is just completely ridiculous,” said Michael Livermore, a law professor at University of Virginia and co-editor of Law as Data: Computation, Text, and the Future of Legal Analysis.

Find out who's hiring.
See all Data + Analytics jobs at top tech companies & startups
View 3894 Jobs

The PACER fee, critics argue, puts up a roadblock to public access while also generating only a trivial amount of revenue, at least in terms of the full annual judicial budget. PACER reportedly brings in some $145 million per year in access fees, but the judiciary overall receives some $7 billion in annual discretionary funding. The small difference, relatively speaking, should simply be allocated into the budget by lawmakers, critics say.

And we’re inching closer to that sort of solution. In 2020, the House of Representatives passed the Open Courts Act, which, if signed into law, would eventually eliminate PACER fees. Money generated before the elimination would fund a long-overdue system upgrade.

SCALES researchers championed the legislative proposal, but they’re not waiting around to begin data collection. Instead of using PACER, which would cost millions of dollars, the team is working directly with various courts on permission requests while also discussing options to access the data without exposing it directly to the public.

“The fact that the government doesn’t have an easily accessible, searchable, well-constructed interface where the public can access ... the judicial decisions that constitute case law is just completely ridiculous.”

SCALES isn’t the first project to tackle this problem. The Free Law Project and Harvard University’s Caselaw Access Project have a similar focus and allow for bulk data downloads. Hammond said the hope for SCALES is that it’ll go one step further in terms of usability.

“The data is valuable, but unless you know how to access [it] and apply different kinds of analysis, it’s still just data, and you need someone else to come in,” he said. “We’re trying to get rid of the someone else.”

 

SCALES Introduction

Transforming, Enriching and Securing Court Data

Data, of course, has to be cleaned and wrangled to be useful for machine learning. That’s always a chore — one that regularly consumes upwards of 80 percent of a data project’s time. Luckily, the kinds of data sets SCALES deals with are at least semi-structured. Still, they present ambiguities that need clarification.

For instance, filed motions have different descriptions — motions to exclude, waive, introduce, etc. So a group of students manually tagged motions in several records; then the team used that tagged data to train an automated tagger.

“Privacy is a massive issue for us.”

At the same time, court records alone might not unearth as many clarifying insights as they would if integrated with other data sets. That’s why SCALES is also normalizing the court data against outside, complementary data sets about lawyers, judges, corporate litigants and more.

“Right now, we can ask, ‘Is there a relationship between numbers of motions to exclude and outcomes?’ And we can look at questions like, ‘Is there a correlation between the size of companies and outcomes?’” Hammond said.

The intent is to allow researchers to surface big-picture trends and inconsistencies, but does the possibility of greater openness also risk broadcasting personal plaintiff information?

To that end, SCALES is building the system to allow for individual- and case-level redactions in the documents. More importantly, it’s also pursuing federated learning. The technique, popular in healthcare, allows for broad, machine-learning analysis on large data sets while keeping identifying information concealed.

“Privacy is a massive issue for us,” Hammond added.

RelatedThese 11 Startups Are Working on Data Privacy in Machine Learning

 

Why Are Paywalls Still Around?

Even as projects like SCALES work to resolve technical considerations, the question remains: Why do such data-access problems persist? How has the PACER payment model, widely considered an archaic holdover, been allowed to stick around?

For Livermore, it boils down to a simple answer: revenue. Even if it’s a drop in the bucket compared to general funding, it’s a revenue stream the system has grown accustomed to.

“In the scope of the U.S. government, this is not a lot of money, [so] it’s absurd that this is a way of raising revenue — that we put a barrier between the people and their laws in order to raise a little dough,” Livermore said.

“It’s absurd that this is a way of raising revenue — that we put a barrier between the people and their laws in order to raise a little dough.”

There’s another element at play too, according to Hammond: “People just don’t want to be monitored.”

The Open Courts Act is the biggest recent reason for hope among transparency advocates, but there have been other positive developments. Last spring, the Supreme Court held that legal annotations to laws cannot be copyrighted, after the state of Georgia claimed material annotated to its statues — which a hired third party, LexisNexis, had entered — was subject to copyright and therefore eligible to be paywalled.

Some anecdotal evidence points to shifting attitudes, also. Remember that district with judges who had wildly different rates of waiver exemptions? Those judges “expressed interest in using [SCALES] data to improve the decision-making process,” the researchers wrote in a policy recommendation report published in Science magazine.

“We count this as an early and encouraging validation of our claim that judges will be especially receptive to quantitative feedback that is straightforward, apolitical, and incontrovertible,” they wrote.

RelatedMachine Learning Will Push Library of Congress Research Forward

Explore Job Matches.