Case Data that Does Good

8 min readMar 4, 2021

So What’s It All About?

Human Rights First (HRF) is an advocacy and action organization that creates policy solutions to ensure human rights are not violated here in the US. One of their campaigns is to provide legal counsel for those seeking asylum. One of the unique challenges that lawyers in immigration court face is that much of the case documentation is not available to the public, unlike in criminal court, for example. In order for lawyers to increase their chances of success in court, it would be helpful if they could access past immigration court cases to research things like how a certain appeals judge usually rules on certain types of cases. This could not only help lawyers build a case tailored to the judge who will be hearing it, but also provide descriptive statistics about trends in court rulings.

To tackle this problem, my team produced a website consisting of a query-able database, a method to upload new case PDFs, and models that would extract information from them. The information extracted would be used to auto-complete fields when the user was uploading the PDF for ease of use and to decrease the chances of typos on behalf of the user. I worked on a team comprised of three silos: Front End Web Development, Back End Web Development, and Data Science. As a member of the data science team, our goal was to develop an API that could take in a PDF, convert it to text, perform natural language processing to extract information about the verdict, and then output the information to an end-point accessible to the back end developers.

Going into the experience, I initially feared that the scope of the project may be too large or may grow over time to become unmanageable. Legal cases are complicated, and the language used in published rulings is arcane at times. We only had the appellate verdict PDFs with no original hearings, and on top of that, only a few hundred case files to work with. It seemed like such a small amount of unstructured data may not be enough to provide actionable insights to HRF. On top of all that, some of the PDFs were grainy, different cases had different pieces of information redacted, their formatting was non-uniform, and many of them contained no unique identifier. Below, you can see some of the PDFs that we were provided:

How Did We Go About It, You Ask?

My main contribution to the project was in natural language processing (NLP). About halfway through the project, a representative from HRF reached out to see if we could get our API to identify cases that argued against the one-year guideline. Asylum-seekers who enter the US have one year from their date of entry to apply for asylum, with a few exceptions. HRF wanted to know about what percentage of appeals cases were related to the one-year-guideline, and if there were any trends accompanying it. For instance, many LGBTQ+ people are fearful of announcing their sexuality in open court if they come from countries where they would be persecuted for doing so. This may constitute an extraordinary circumstance, but in order to know how effective this argument is on appeal, HRF asked us to investigate.

In order to get this information, we faced a few technical challenges. First, the optical character recognition (OCR) model responsible for converting PDFs to text was not perfect. At times, it would output “asy1um seeker” or “excep|ion to the rule”. This could pose difficulties for NLP, for obvious reasons. Lucky for us, Python is an open-source language, and a number of spell checking libraries are available. We used Pyspellchecker to catch those errors before continuing on to the next step. Still, the text was not perfect, but remembering the scope was key. The user uploading the PDF would be the lawyer who represented the case, and was thus completely familiar with all aspects. Any information retrieved by our API would autofill fields as suggestions to the user. Thus, high accuracy was preferred so that the user would have minimal corrections to implement before submission, but fine-tuning the model past a certain point would be a waste of valuable time.

After correcting spelling and formatting issues to a reasonable degree, the next challenge was to find a way to identify the cases where the one-year guideline was at issue. To begin with, I began pouring over case after case, hoping to notice some sort of pattern emerge. Not surprisingly, it took me hours to even locate a few cases that argued against the deadline. In order to expedite my search, I thought of some logical terms that might accompany an argument against the one-year guideline. I began by creating a corpus of all of the text from all case files that mentioned the word ‘asylum’, then iterating through it to find cases that used the term “year”. Of these, only a portion of them were relevant.

It was at this point that I realized that a bit of human insight might set me up better. After speaking with our stakeholder, an immigration attorney, I gained some insights that provided some huge momentum in the right direction. Since all of our cases were appeals, the only time that the one-year guideline would be mentioned is when the attorney argued for exemption from it. There are only two possible cases in which this happens: changed circumstances or extraordinary circumstances. Since court rulings tend to stick to a strict nomenclature, searching for the terms “changed circumstance” and “extraordinary circumstance” was a great starting point.

After locating all documents where asylum was positioned after “APPLICATION” and the body contained one of the two exemption terms, I was able to quickly find where the terms were in the text and read the few sentences surrounding them to decide it was talking about the one-year guideline. After looking over about 10–15 files, a handful of words seemed to show up relatively often around the exemption terms when they were referring to the one-year guideline: ‘time’, ‘year’, ‘period’, ‘delay’, and ‘deadline’. Of course, there were many cases that also contained one of the exemption terms but did not relate to the guideline, and in all of those cases, the five context words were nowhere near the exemption terms. A visualization is below. Exemption terms are highlighted red when they are found in a case that does not argue for exemption from the one-year guideline and green when they do. The context words are highlighted yellow.

The only step left was to convert the logic into Python, incorporate it into the scraper, and add the endpoints to the backend:

And What Came Of It All?

As of 3/4/21, our product can do the following:

First, the user uploads a PDF, and the front end sends it to the backend, who databases the image and sends it to the data science API. Then, the API does the following:

The OCR module converts it to a .txt file
The scraper corrects for errors from OCR and extracts the following information from the text: Application type (aslyum/convention against torture/withholding of removal), judge(s) hearing the case, date, verdict, one-year guideline case (true/false), mentions of gang-related violence, and applicant information (country of origin, sex, language, indigeneity, and protected social groups).

The API’s database module then encodes the CSV results and returns a .json file to the backend, which in turn sends the information to the front end to auto-fill fields for the user to validate. The validated information is sent to the backend and databased using AWS EBS. This allows the front end to display the database in table format, giving the user the option to query the database, filtering by field.

This functionality allows for lawyers to prepare for individual cases, but it’s important to note that it provides no statistical analysis to help inform policy change. HRF hopes to accomplish both of these through our analysis. Future features could be added like a dashboard with current statistics regarding what kinds of cases are being approved or denied. Interactive maps may also be a great way to communicate statistics while boosting engagement among the legal community. Since a major hurdle for HRF is getting their hands on data that is not publicly available, creating an interactive feature that draws in users could be a key ingredient in getting reliable statistics.

Even more improvements could be made by simply adding more fields to the data base. Some fields that may serve the interests of HRF are mentions of criminality in immigration appeals cases, a subject matter that has been coined “crimmigration” and is known among litigators to be a particularly tricky application of criminal law in immigration court.

With a growing database of cases, our deployed website will grant other institutions access to previously unavailable data and be able to conduct their own analysis. To this end, there’s really no telling just how many positive externalities may come of the project.

This project really boosted my confidence in my ability to apply my skills to a real-world organization. Consulting with stakeholders to find innovative solutions across different functional silos is something that is crucial to being a good data scientist, and it was great practice in eliciting and receiving feedback about my work. For instance, when working with HRF’s legal counsel, we found that it became a bit of a “Which came first? The chicken or the egg?” type of situation. As data scientists we were asking the lawyers what information to look for. The lawyers, in turn, were asking, “What kind of information can you find for us?” This allowed for some creative back-and-forth, and it allowed me to figure out which of my ideas were in the scope of the project and what parts of the cases I was misunderstanding. On top of all of the professional growth I underwent, the real-world impact that the project has on those less fortunate could truly be life-changing to some people down the road.

Case Data that Does Good

So What’s It All About?

How Did We Go About It, You Ask?

And What Came Of It All?

Written by Reid Harris