Summary
- A new CEPR working paper, co-authored by UCL Professor Stephen Hansen provides an overview of the methods used for algorithmic text analysis in economics.
- The paper covers the fundamentals, explaining how semantic meaning embedded in words can be captured by algorithms like BERT and GPT. We cover this in Part I.
- In Part II, we will summarise how such data representations of text can be used to solve four common economic problems, and the challenges faced in doing so.
Introduction
The rapid development in Natural Language Processing (NLP) has fostered a diverse methodological frontier. While exciting, especially given the emergence of a new generation of deep neural network models known as Transformers, there remains little guidance for researchers on how best to deploy these new techniques.
This lack of structure means there is no common framework or even vocabulary for analysing text. In an attempt to bridge the gap, a new CEPR working paper provides a conceptual overview of the methods that now form the basic building blocks of algorithmic text analysis in economics.
The Building Blocks of Text Analysis
Textual analysis begins with a ‘document’. These could be easily machine-readable texts (e.g., Word files or PDFs), but could also be in more challenging formats, such as in markup language (e.g., HTML or XML), or in scanned image files (e.g., PDFs of historical books). To extract these texts in Python, researchers typically use the following software packages:
- Beautiful Soup (HTML/XML parsing).
- Layout Parser (optical character recognition).
Once extracted and organised, raw documents are then converted into sequences of linguistic features by (i) splitting sentences on whitespace/punctuation (tokenising); (ii) dropping non-letter characters; (iii) dropping common stop words, like ‘the’/’to’/’is’; (iv) adjusting letters to lowercase, and (v) stemming words to remove suffices (the Porter stemmer is a common default).
In economics, this standard pre-processing approach represents documents as lists of words, typically reduced to some root form. One such representation is the bag-of-wordsmodel, where each unique vocabulary term is assigned an index value from 1 through to V. Each term can then be counted by document and stored in a document-term matrix.
This matrix stores the counts of each vocabulary term by document, where each column in the matrix represents term counts, and each row represents a document (Example 1). For example, ‘growth’ may show up 100 times in document 1, and zero times in document 20. Meanwhile, ‘recession’ does not show up in document 1, but has a high frequency in document 20.
Generally, there could be tens of thousands of columns in this high-dimensional matrix. The matrix is also sparse, in that most vocabulary terms are not present in all documents (i.e., they are given a value of zero).
From PCA to LDA
The next step is to add meaning to the words. This involves reducing the dimension of our ‘document-term’ space into a more helpful ‘meaning’ space.
For economists, you can think of this as a factor analysis designed to capture structure in high-dimensional economic data. One of the most common dimensionality reduction techniques is a principal component analysis (PCA).
Say we wanted to know the main driver of G4 headline inflation. In example 1, we would have four columns – US, EU, UK, and Japanese inflation data – and 120 rows – monthly observations over 10 years.
A PCA analysis would reduce this 120×4 matrix into a 4×1 vector, leaving just one column and four rows. The column is our principal component, the variable that explains most of the variation in our four inflation series. The rows represent how much this principal component explains headline inflation in each of our areas.
In this example, we find the degree to which inflation across multiple countries is influenced by a common macroeconomic driver. In textual analysis, we want to uncover the extent to which words across multiple documents are driven by common themes. For this, we use Latent Dirichlet Allocation (LDA).
LDA looks for a workable thematic summary of words in our document-term matrix. Each theme, or topic, can be found by searching for groups of words that frequently occur together in documents across our body of texts. Each term within a topic is assigned a probability, and those assigned especially high probabilities govern the topic’s ‘theme.’
For example, high probabilities assigned to words like ‘quake’ or ‘tsunami’, are likely to imply that the topic they are uncovering is ‘natural disasters.’ Then, in turn, if ‘natural disasters’ is given a high probability versus other topics uncovered in the text corpus, it is likely to have come from a document on, say, global warming instead of FOMC meetings (Example 2).
More formally, in LDA topic modelling, documents are probability distributions over latent topics and the topics themselves are probability distributions over words. Just like PCA, labelling the common components or themes is up to the end user and requires some domain expertise.
Semantic Meaning in a Local Context
Standard LDA models elicit meaning at a global level. They impute information from word frequencies independently of where they occur in our texts. However, semantic meaning is largely contained in the local context – a word’s meaning will depend on either it’s immediate or longer-range neighbours.
While the bag-of-words model can be extended locally by tabulating n-grams, an influential line of work in NLP reframes the global analysis to a local one by measuring each term’s local co-occurrence with other terms. Also known as word embedding, these models compress the high-dimensional word lists into relatively low-dimensional vectors on co-occurrence across documents to leverage information in a local context.
For co-occurrence at a local level, Word2Vec is perhaps the most influential word embedding model (see GloVe for global co-occurrence). Using both individual words and a sliding window of context words surrounding the individual words, the algorithm either predicts the current word from the surrounding context or vice versa.
Intuitively, the algorithm gives similar representations to words that appear in similar contexts across documents. If researchers have a lot of text data, the algorithm can estimate bespoke embeddings to capture word meanings specific to the application – this is self-supervised learning.
With smaller datasets, one can use pre-trained embeddings estimated on a large, auxiliary corpus (like Wikipedia) and port them to a new application. This strategy is an application of transfer learning, which is a methodology in machine learning that focuses on applying the knowledge gained from solving one task to a related task. This approach is not often used in economics, because generic embeddings may not produce the most useful word representations for economic tasks.
Transformers in NLP
ChatGPT is a transformer-based, pre-trained language model. To see how it works, imagine the following two sentences, where [MASK] refers to an omitted word:
As a leading firm in the [MASK] sector, we hire highly skilled software engineers.
As a leading firm in the [MASK] sector, we hire highly skilled petroleum engineers.
Humans intuitively know which key words to focus on to predict omitted words. In the example, both sentences are the same, except for the words ‘software’ and ‘petroleum.’ These allow us to infer that the omitted words are likely to be ‘IT’ in the first sentence and ‘energy’ in the second.
Word embedding algorithms cannot do this. They weight all words in the context window equally when constructing embeddings. A recent breakthrough in NLP has been to train algorithms to pay attention to relevant features for prediction problems in a context-specific manner.
Self-attention, as this is known, takes a sequence of initial token embedding (from, say, Word2Vec) and shoots out new token embeddings that allow the initial embeddings to interact. The new tokens are weighted averages of the initial tokens, and the weights determine which pairs of tokens interact to form a context-sensitive word embedding.
The attention weights in the self-attention function are estimated by Transformers – large neural networks – to successfully perform masked-word prediction, like in the example above. Unlike Recurrent Neural Networks (RNNs) before them, they process the entire textual input all at once, increasing parallelisation and reducing training times.
Generative pre-trained transformers (GPT) are a family of models pre-trained to perform next-token prediction on large corpora of generic text (e.g., Wikipedia, Common Crawl, etc.). Another family of well-known models – Bidirectional Encoder Representations from Transformers (BERT) – instead perform masked-token prediction.
Masked-token prediction is effectively what we tried above. DistilBERT produced the following list of words most likely to fit the masked words for the two example sentences (Table 1). As we can see, it does a good job of identifying important information, even when it lies several tokens away from masked words.
Modern NLP has made large strides forward in understanding semantic meaning in an everyday context. They can, however, also be fine-tuned for supervised learning tasks – that is, updated for prediction in specific contexts. And, because Transformer models have a good general understanding of diverse texts, fine-tuning achieves good performance even with relatively few labelled training samples.
Finishing this section, the authors point out that these models have downsides. Transformers lack transparency, making it impossible to replicate the full estimation pipeline. They also require vast hardware sources, meaning most researchers must begin by downloading previously fitted models and updating them.
Moreover, Transformer models only operate on relatively short documents. This works well for sentences or paragraphs, but not for longer documents such as speeches or corporate filings. For longer documents, it is usually better to use non-Transformer-based alternatives like gradient boosting.
Bottom Line
Text algorithms are unlocking many interesting research questions for economists. The first of this two-part summary on NLP in Economics, provides some structure on how to leverage information in texts. From inputting texts to reducing high-dimensional matrixes, and from equally weighted word embeddings to trained attention weights, the paper helpfully merges the basics with frontier NLP research. I hope you find it helpful…
Sam van de Schootbrugge is a Macro Research Analyst at Macro Hive, currently completing his PhD in Economics. He has a master’s degree in economic research from the University of Cambridge and has worked in research roles for over 3 years in both the public and private sector. His research expertise is in international finance, macroeconomics and fiscal policy
Photo Credit: depositphotos.com