Summary
- A new CEPR working paper, co-authored by UCL Professor Stephen Hansen provides an overview of the methods used for algorithmic text analysis in economics.
- The paper covers the fundamentals, explaining how semantic meaning embedded in words can be captured by algorithms like BERT and GPT. We covered this in Part I (NLP in Economics – Touching Base With the Basics).
- In Part II, we summarise how such data representations of text can be used to solve four common problems in economics, and the challenges they face in doing so.
Introduction
This Deep Dive forms the second part of a new CEPR working paper, which provides a conceptual overview of the building blocks of algorithmic text analysis in economics.
In this part, we go through the common applications that encompass most of the text-as-data research. These include measuring document similarity, concept detection, how concepts are related, and associating text with metadata. We also cover the paper’s discussion on their limitations.
Measuring Document Similarity
The paper discusses four common measurement tasks that applied researchers use NLP for. The first is measuring document similarity – a popular use-case for search engine output, recommendation systems, and plagiarism detection.
All methods for computing document similarity begin with some vector representation of documents. Then, the distance between document vectors captures their similarity. The more similar documents are, the smaller the angle between two vectors is (i.e., they are heading in the same direction). The standard distance measure is called cosine similarity.
The simplest method for document vector representation uses the bag-of-words count vector, which we discussed in Part I. In this case, a document vector just consists of the number of times different words pop up in the text.
A variant of this is to upweight words that are specific to certain documents (i.e., ones that have a non-zero count). This is called the term frequency-inverse document frequency (tf-idf).
Examples of bag-of-words-based approaches to similarity include measuring the distance between online news articles and social media posts to group items into common stories, and measuring similarity between college syllabi and academic journal articles to proxy the gap between course content and the newest research.
Sometimes, though, word count vector representations of documents can be unhelpful. There can be tens of thousands of words, not all of which are non-zero. This makes measuring distances between documents very noisy.
Instead, researchers can reduce the dimensionality by applying LSA or LDA to bring out similar themes. Another approach uses word embeddings to represent documents. In this case, the document vector is the average over the word embeddings corresponding to words in the document.
Ignoring distances between document vectors all-together, one could instead form clusters of related documents. A popular method for clustering is k-means. An advantage of clustering, relative to topic models, is that it works on arbitrary vector representations of documents, rather than being limited to term counts, as in LDA. Documents are also tied to a single cluster, rather than having a distribution over multiple topics.
In labour economics, a clustering method could be applied to job descriptions to construct occupational categories. In retail, companies may use clustering to identify similar groups of consumers for targeted marketing. Fake news can also be sought out by using this method. There are many popular use-cases.
Concept Detection
Sometimes, textual data is the only source of information about economically crucial concepts. It can provide insights into economic policy uncertainty, skills demand in the labour force, economic sentiment and more. This relates to the detection of concepts in economic texts.
Pattern Matching
One way to detect concepts is to employ dictionaries within the bag-of-words model. These can either be general-purpose dictionaries (e.g. AFINN or VADER), domain-specific dictionaries (e.g. LM in finance), or ones chosen based on their ability to predict human-annotated documents.
Algorithmic Approaches
Algorithms can also be used to associate documents with concepts and can uncover more complex semantic rules. We already saw this in Part I, where topic models summarise commonly occurring words into themes. For example, LDA can be used to study how specific topics in central bank statements relate to market movements.
The challenge of topic models, however, is that the topics it detects are not labelled, making interpretability more difficult. Also, unsupervised learning tools cannot be targeted toward identifying specific concepts, making it difficult to link topics to economic concepts. For this, an initial filter may be needed to remove unrelated content.
Merging human domain knowledge with algorithms is at the heart of incorporating ‘seed’ words. These pre-selected words reflect relevant concepts that can then be further populated (using, for example, cosine similarity) with word embeddings to create a part-human, part-ML dictionary.
This is becoming increasingly popular in macro and finance applications. However, standard applications do not address polysemy – that is, words with multiple meanings. To overcome this issue, embedding algorithms like ELMo draw on the neighbouring words to produce context sensitive embeddings that distinguish word senses.
Dictionaries, either fully human or part-human, part-machine made, do not include all the nuances of semantic meaning. For example, ‘fantastic’ is coded the same as ‘good’. A workaround is to compute the proximity between documents and dictionaries in a semantic space defined by word embeddings to get a continuous measure of association.
One application of this approach is to measure the use of economics language by judges. The authors compute the similarity between embedded representations of the text of individual judges and a lexicon of economics-related phrases. They find that judges who attend economics training use more economics language.
The last approach to algorithmic concept detection discussed in the paper is machine prediction based on human annotation. Here, humans with domain expertise generate labels on a subset of data, which an algorithm then learns from to detect concepts. This can then be scaled up out-of-sample, effectively taking the role of a human.
In a recent paper, BERT-like models are shown to achieve outstanding performance for predicting human labels. Attention-based classifiers are far better than sequence embedding models at labelling relevant concepts because they model how words in language interrelate to generate meaning beyond word counts, associations, and syntactic patterns.
How Concepts Relate
The third problem discussed in the paper is how concepts are related in a corpus. For example, how positive or negative sentiment is associated with economic conditions.
As we saw above, concepts can be detected from dictionaries. The simplest approach of measuring how concepts are related is by tabulating the number of times terms from each dictionary co-occur within a local window.
Associations can also be captured by word-embeddings, which can be tested using a word embedding association test (WEAT). These word-embedding-based measurements of connections between concepts are based on local co-occurrence of words.
It begins with sets of attribute words A and B that denote opposite ends of a conceptual spectrum. For example, A (B) might contain words reflecting positive (negative) sentiment. Then any other word, or set of words, can be projected into the conceptual space by measuring its relative position between A and B with cosine similarity.
An example of this would be to locate various terms into separate conceptual dimensions. The two dimensions could be Class and Politics, and terms from the corpus could show the connection between the two (Chart 1).
Missing in this approach is the direction of connections. To address this, one can use linguistic annotations to construct and quantify such directions. This is what Ash et al. (2023) do for US Congressional speeches.
Associating Text With Metadata
The last use-case for NLP in economics mentioned in the paper is linking text data to outcome variables. For example, predicting salaries from texts of job postings.
The appropriate tool for tackling this problem is supervised learning, as the goal is to maximise the goodness-of-fit in new documents. In Part I, we discussed using random forests and gradient boosting to make text-related predictions. Alternatively, one can also use BERT or GPT models.
We recently covered an interesting example of how to use text data for predictions. The authors used a multinomial inverse regression to create ML-only dictionaries. This involved running regressions of stock price changes on word frequency to determine whether a word has a positive, neutral, or negative impact on stock prices.
NLP Limitations: #1 Validation
We have covered several algorithms that can be used for tackling the core empirical applications involving text. It is not often clear which ones to use. Yet, modelling choices matter.
To show this, the authors put the methods we discussed in Part I to the test on annual 10K fillings from 4,033 firms. They want to use these filings to measure the degree to which firms are competitors. To do this, they use the following ten different approaches to construct document vectors:
- Bag-of-words-based term counts: 1). raw counts; 2). tf-idf weighted term counts.
- Average word embeddings based on: 3). pre-trained GloVe; 4). GloVe estimated on the Risk Factors corpus; 5). Same as 4, but using tf-idf weights to compute averages; 6). Word2Vec estimated on the Risk Factor corpus; 7). Same as 5, but for Word2Vec.
- Dimensionality reduction of document-term matrix: 8). LSA; 9). NMF; 10). LDA.
Once they have constructed the document vectors, the authors compute the pairwise cosine similarities between them. Then – the true test – they measure the level of agreement between the ten approaches (Chart 2).
They find that the average agreement rate is 0.78, where 0.5 represents an independent ranking and 1 is a perfect overlap. Clearly, while some of the embedding-based approaches show high agreement with each other, in general there is large divergence across methods. In other words, not all measures agree on which firms are competitors.
Not only this, but the choice of algorithm is also important for downstream inference. The authors look at how the degree of competition between firms (as estimated from the documents) depends on key firm factors, such as their correlation to daily stock returns, and the size of the firm (Chart 3).
While most estimated effects go in the expected direction, point estimates and confidence intervals differ greatly, and methods disagree on which covariate is most associated with textual similarity.
So, even downstream, where there are plenty of robustness checks on regression outputs, we can see how important it is to pay as much attention to the upstream modelling choices. Yet, according to the authors, in economics we currently lack objective benchmarks against which to validate the choice of algorithm.
It is crucial, therefore, that word embeddings contain economic-specific word relationships, so that they can accurately predict missing words in economic contexts. This requires expert annotations, which could be either subjective and/or costly.
NLP Limitations: #2 Interpretability
So far, we have seen that Transformer-based classifiers are the best in terms of performance (out-of-sample word prediction). However, they are notoriously opaque, making them less desirable if interpretability is needed.
Arguably, even more so than in other machine learning approaches in economics, the field of NLP is moving to ever more complex models that are favouring prediction over interpretability. However, interpretability must remain important to economic applications.
This is because textual analysis can easily create spurious correlations. The authors use the example of ‘Texas’. This word may be an accurate predictor of right-wing ideology, but the term is not related to a belief system. Words on immigration, religion, or politics would be better, but they often co-occur with ‘Texas’, leading to a spurious prediction.
Interpretation is also important for LDA, and more broadly in unsupervised learning scenarios. In LDA, human judgement is required for the number of topics, and it has been shown that the topic number that maximises human’s interpretability does not match the number that maximises goodness-of-fit.
Moreover, economic data is subject to more noise and structural breaks than other environments in which modern NLP algorithms are developed on. Interpretability of an algorithm’s classification logic will be important for predictive performance in new domains.
So, the authors recommend using simple approaches in NLP analysis. And they suggest using model explanation methods to provide interpretable diagnostics on the features that an algorithm is relying on.
Bottom Line
New applications of NLP in economics are coming through at an incredible pace. It is exciting, but daunting. The frontier is being expanded on so many fronts that it is hard to know where to begin. Hopefully, this two-part series, summarising the work of Hansen and co. provides some structure to your thoughts.
There are key areas that still need attention. Validation tasks that help researchers systematically choose the right models. Combining text and numeric data at an upstream stage to overcome inference problems down-the-line. And, after the core measurement problems are addressed, the inclusion of causal inference.
Nevertheless, interesting times lay ahead, especially with the emergence of LLM models. They open the door to multilingual text analysis, better data labelling and improved validation. Reassuringly, the authors also believe that they will always require domain expertise, so that should only complement, not replace most jobs.
Sam van de Schootbrugge is a Macro Research Analyst at Macro Hive, currently completing his PhD in Economics. He has a master’s degree in economic research from the University of Cambridge and has worked in research roles for over 3 years in both the public and private sector. His research expertise is in international finance, macroeconomics and fiscal policy
Photo Credit: depositphotos.com