Natural language processing, or NLP, is a fundamental tool in our everyday lives — whether it’s a search engine automatically generating our query or our phone autocorrecting a misspelling. At the Urban Institute, we employ NLP to transform large volumes of unstructured human language into structured data that enables our researchers to analyze and formulate key findings. We’ve used NLP applications in both the Connecting the Police to Communities project and in the new dataset collected on land-use reforms.
Recently, the Biden administration issued an executive order requiring federal agencies to assess how their programs and policies contribute to systemic barriers to increase fairness and transparency in federal decisionmaking. In response, more than 90 federal agencies submitted Equity Action Plans. To understand these commitments that federal agencies made to increasing social equity, Urban’s Office of Race and Equity Research and the Racial Equity Analytics Lab conducted NLP and text mining on the equity actions plans.
In this blog post, we explore what research questions text mining and NLP can answer when applied to a nuanced topic like social equity. In our work, we sought to answer the following questions:
1. What is the relationship between text mining and NLP?
2. Why use text mining with public documents?
3. How does text mining work on existing documents (rather than interactions)?
4. What are the challenges around applying text mining to social equity concepts?
5. Where do we see opportunities to apply text mining to understand and track social equity?
Text mining is the process of turning vast amounts of unstructured text into structured tabular data via NLP. Often, text mining is used to derive insights about the content of huge volumes of text by calculating the absence or presence of keywords, the frequency distribution of words, and the length of sentences. Generally, text mining begins with preprocessing the text of interest and removing any noise, such as reducing words to their root stem (e.g., converting “equities” to its singular “equity”). Once we process the data this way, we can identify patterns and interpret the results.
NLP is a form of machine learning and one of many text mining methodologies used to capture texts’ meaning and context. NLP techniques enable computers (or machines) to understand text by analyzing sentence structure and grammar. In short, text mining encompasses various methodologies for analyzing unstructured text data, and NLP is one of those methodologies.
Traditionally, researchers conduct qualitative coding and manual review when they want to understand commonalities across a wide range of documents. The process involves significant time, which is costly and may not capture certain themes unless they’re prespecified. By contrast, automated processes like text mining can review huge volumes of texts for a list of key mentions and produce a list of common topics within seconds.
To conduct our analysis of the equity action plans, we created a rubric based on key equity principles, such as the acknowledgment of past harm and commitments to improve not only access but also outcomes. The qualitative team analyzed about 27 agencies’ and subagencies’ equity action plans, then passing the work on to 10 subject matter experts who validated or amended the initial reviews.
Together, we took an “agile” approach, which emphasizes collaboration across teams without overly structured processes, enabling the teams to develop in response to user feedback. Our data mining team frequently consulted with the qualitative team, sharing what the text mining returned, and fine-tuning the NLP algorithm to unearth deeper insights. This kind of “agile development” let us fail and learn quickly. With the text mining techniques, we were able to analyze 83 supplementary documents in addition to the 27 equity action plans analyzed by the qualitative team. Across the 110 agency-specific documents, all federal agencies had at least one document in the pool, and some had as many as five documents.
Before applying text mining techniques to our documents, we first identified 13 initial research questions. Of these, we narrowed the pool based on what questions could be answered using counts of frequency, presence, or absence of key words for each agency and plan. Ultimately, we focused the text mining exercise on eight questions. These questions sought to answer how many federal agencies mention the following:
· an external commission, advisory board, or feedback group
· disaggregated data
· BIPOC communities
· language access, interpretation, or other related barriers
· a chief data officer
· an accountability plan
· efforts to train their staff to support equity actions
· barriers or challenges with enrollment, participation or completion of their programs
For each question, we provided keywords to an algorithm to find in the text. The results from the text patterns revealed complexities on whether the key word captured the underlying concept asked by the research question itself. For example, finding mentions of a “chief data officer” helped us determine whether designated officials are proposing data collection or governance solutions. In this instance, the qualitative team could review equity action plans from agencies that mentioned a chief data officer to better understand the context of this role within the agency.
The technical process of our text mining work followed four steps:
First, we created the corpus for analysis, using web scraping to compile documents in portable document format (PDF) — a time-saving technique compared with manually locating and downloading each Equity Action Plan. The corpus consisted of 110 documents — 27 Equity Action Plans and 83 supplementary documents, including learning agendas, annual evaluation plans, capacity assessments, and evaluation policies.
Second, we used AWS Textract to turn the PDF documents into text files that contained only the text (body, headings, and titles) for the analysis. We then imported the text files and tokenized them, which is the first step in any NLP pipeline. Tokeniziation breaks unstructured data and natural language text into chunks of discrete elements that can be used to represent the document of interest. The text can then be turned into a numerical data structure, such as a tabular format, suitable for machine learning or NLP application.
Third, we created a dictionary of the key words we wanted to analyze. To do so, we applied key word searches to the preprocessed corpus, creating frequency tables for the qualitative team to review and provide feedback on the preliminary results. Below are the keywords we analyzed:
Lastly, we restricted the sentences returned by the keyword search to those that included other terms of interest (co-occurrence.) We hoped that by understanding which key words commonly appeared together, we could start understanding relationships between concepts.
Text mining doesn’t have to be restricted simply to whether the word appears. Instead, it can be combined with other factors like coappearance, minimum or maximum word counts, and other more complex logic. The key is to keep the questions in scope. If we’re only interested in mentions of accessibility related to language, for example, searching for the word “accessibility” isn’t enough. Instead, we would need to search for sentences or paragraphs containing the word “accessibility” AND “language” (or other words related to language barriers).
The agile and iterative process enabled the qualitative and text mining teams to focus on areas of interest in a timely and complementary fashion that would have been challenging for one team to complete alone. But NLP still faces challenges around deciphering context because of the natural ambiguities in language itself.
Lexical and semantic ambiguity occurs when the same word has multiple meanings or can be interpreted to have more than one meaning depending upon the sentence. Across all 90 federal agencies, there are varied interpretations of “equity.” One of our research questions was, “How many agencies mention efforts to train their staff to support equity?” which was challenging to answer because the word equity can have different definitions, interpretations, and associated actions. The US Department of Labor, for example, may refer to gender equity in workforce development, whereas the Environmental Protection Agency may refer to climate equity in that everyone should benefit from a clean environment.
Syntactic ambiguity occurs when a sentence can have two or more distinct meanings as a result of the word order within a phrase or sentence. “Accountability,” for example, can have different meanings depending on which order the word appears within a phrase. To answer our question, “How many agencies mention a plan for accountability?” we searched for the key word “accountable.” Depending on where “accountable” is in a sentence, however, it may refer to the federal agency taking responsibility for past historical harms or to specific performance metrics to measure the action plan’s success. Text mining doesn’t differentiate between the two meanings.
In instances where there was ambiguity, the text mining team worked with the qualitative team to fine-tune the keyword search. When searching for specific identity groups, the keywords included the names of each population. We also used other keywords, such as “disadvantaged” or “overburdened” in place of “underserved.” (We do not consider these equivalent concepts or meanings, but rather adjacent or associated concepts.)
Variability in depth, format, substance, and length of the documents we studied also posed challenges. Obtaining accurate counts proved challenging for some of our initial research questions because our key words appeared in document headers and subheaders. In these cases, the frequency counts weren’t directly comparable because their differences could be from formatting decisions rather than agency focus.
Frequency counts also have inherent limitations. Even if we could obtain accurate counts, they wouldn’t help us assess the “quality” of the mention, as a document may have more mentions of a subject for a number of reasons. Therefore, we can’t say an agency has a stronger focus on an equity subject because it mentions a concept more often than another agency.
For our purposes, we identified three ways in which text mining can be helpful for qualitative social science research:
1. quickly obtaining a descriptive analysis on a large body of documents to facilitate more efficient qualitative and quantitative research
2. providing guidance for further examination of anomalies, such as a high frequency or absence of mentions for a specific keyword
3. collaborating between text mining and qualitative research
For the first and second opportunities, text mining provides a high-level overview of a large body of documents, offering a clear picture of which policies federal agencies are considering to promote equity. This overview could help us start conversations with agencies about their existing proposals and how they compare to their peer agencies. Also, the absence of keywords was informative in directing our attention to equity action plans that may have presented particular equity dimensions differently.
Ultimately, the qualitative rubric we developed was more robust and in-depth than what we found with text mining alone. It was designed to answer what the text mining couldn’t. Research questions that involved context, judgement, and domain knowledge weren’t well-suited for the text mining analysis. Text mining, in these cases, is best used for unearthing the key insights in collaboration with a qualitative rubric to support new research. The detailed findings from the qualitative rubric analysis can be found in Pathways to Equity at Scale.
There remain inherent limitations with text mining and NLP when analyzing nuance topics like equity. Equity is a noisy and evolving conceptual framework that has drawn increased attention. There are many opportunities for text mining and NLP applications to further equity analysis, such as sentiment analysis of equity documents, topic modelling of large document sets, and enumeration of equity research papers using NLP to follow the broader equity conversation. But for now, government and researchers aren’t in a position to involve a machine to decipher meaning from text when we, ourselves, are unclear on what equity means in all dimensions of our complex lives.
-Manuel Alcala Kovalski
Want to learn more? Sign up for the Data@Urban newsletter.