Summary
- A new FEDS working paper uses Twitter chatter to build a measure of credit and financial market sentiment.
- The authors use Keyword Clustering alongside FinBERT to assign sentiment scores to tweets. They then average over tweets to create sentiment indices at different frequencies.
- Their monthly index correlates highly with corporate bond spreads and other price- and survey-based measures of financial conditions.
- At higher frequencies, the index can predict next-day stock market returns and forecast changes in the US monetary policy stance.
Introduction
We recently published a two-part series on natural language processing (NLP) applications in economics. The first covers the building blocks of textual analysis and the popular methods used by researchers. The second provides examples of how to apply these methods to tackle common problems in economics.
One such problem is detecting concepts in economically relevant texts. This is what a new FEDS working paperdoes with tweets. The authors use keyword clustering to identify tweets with financial market semantics. Then, they apply the state-of-the-art BERT model to create sentiment indices from these tweets.
Clustering
In Part II of our ‘NLP in Economics’ series, we explained how to use clustering to measure similarities between texts. Here is how this paper applies it.
First, the authors collect 60 keywords (or word roots) related to financial discourse. Then, they use Wiki2Vec, a word embedding model that elicits meaning in a local context, to assess the similarity of the keywords based on their co-occurrence in Wikipedia texts.
This ‘assessment’ is a measure of the cosine distance between the paper’s keywords and semantically similar Wikipedia words. The authors then take these calculated similarities to determine optimal keyword groups, or clusters.
In total, they choose three cluster groups of words with similar meanings (Chart 1). The three clusters loosely map into financial contracts (Group 1), entities (Group 2), and actions or contractual features (Group 3).
Pre-Processing Tweets
Having identified three clusters, the authors collect tweets that contain at least one word from each. They eliminate tweets that refer to advertising (e.g., ones that contain ‘social security’ or ‘credit card’ phrases), cryptocurrency (e.g., ‘crypto’), and decentralised financial assets (e.g., ‘NFT’).
Next, the authors pre-process the remaining tweets by removing excess white spaces, tags, hyperlinks, and information that is not part of the text body of the tweet. They shift the time zones to reflect Eastern Standard Time and subsequently store information on the time and date of the tweet.
They filter out retweets and any duplicates or near replicas of tweets (although information on retweet counts and other engagement counts are kept). Retweets make up roughly 2mn of the 7.1mn tweets collected, leaving 5.1mn tweets. The removal of full or near replicas bring the total down to 4.4mn tweets from 2007 to April 2023 (Chart 2).
Creating a Sentiment Index
Large language models, such as Bidirectional Encoder Representations from Transformers (BERT) or Generative Pre-trained Transformers (GPT), are large neural networks trained to perform masked-word prediction.
They take a sequence of word embeddings from, in this case, Wiki2Vec, and learn how much attention (i.e., what weight) to give words based on their pairwise interactions with other words in the corpus.
They are the frontier in understanding semantic meaning because they generate meaning beyond word counts, associations, and syntactic patterns. As a result, they outperform other ML- and dictionary-based models in tasks like sentiment calculation.
Here, the authors use FinBERT. It has been fine-tuned to better understand financial jargon. It calculates sentiment scores for each sentence in a text based on the probability of it being positive, negative, or neutral.
For tweets, which often contain multiple sentences, the authors replace full stops with semicolons. This allows FinBERT to then provide a sentiment score between -1 and +1 for each tweet in the sample.
The authors drop tweets with the highest probability of being neutral . They then sum the sentiment scores of the remaining tweets and divide by the total number of tweets to get sentiment values at different frequencies.
The result is a Twitter Financial Sentiment Index (TFSI) presented at a daily (Chart 3), weekly or monthly frequency. The index is orientated so that higher values indicate a deterioration in sentiment. Key deteriorations match up with the Taper Tantrum in 2013, the Covid-19 outbreak in 2020, and the beginning of Federal Reserve (Fed) tightening in September 2021.
The authors also find that the variation in the index over time can be mostly explained by the share of users that post tweets with positive or negative sentiment, rather than by the intensity of the tweeted sentiment. In other words, it is the classification of a tweet that really matters, not how large its probability of being positive or negative.
Results
To begin, the authors show that their monthly TFSI broadly correlates with other common metrics of economic and financial conditions: (i) the Baa corporate bond spread; (ii) the excess bond premium, and (iii) the University of Michigan Consumer Sentiment Index (Chart 4).
Next, they show that the overnight TFSI, measured from all tweets made between 4pm and 9am, can predict next day S&P 500 stock returns. The lower the overnight sentiment, the lower stock returns are on the following day. In numbers, a one-standard-deviation increase in the TFSI predicts a 6bps fall in daily stock returns the following day.
Lastly, the authors find evidence that tweets relate strongly to Fed communications in and around FOMC days. Fed-related tweets account for about 25% of financial discourse on FOMC days, a figure that remains above average for up to five days after (Chart 5).
These discussions, between 4pm the day before and 2pm on the day of FOMC meetings, can also predict monetary policy tightening. According to the authors, larger contractionary monetary policy shocks are associated with souring sentiment (Chart 6). However, the reverse is not true, meaning FinTwits have a negative bias.
Bottom Line
This paper is a good example of how advancements in large language models (LLMs) can be leveraged to make simple sentiment indices for prediction. But remember that LLMs, while better at eliciting context than perhaps dictionary-based sentiment indices, are not as transparent, making them less desirable for interpretability.
It is important, therefore, to validate the choice of LLMs for sentiment indices. The authors compare their results against VADER – a dictionary-based sentiment analysis tool specifically designed to measure sentiment expressed in social media. The results vary considerably, with a correlation of just 0.68 between the two seemingly equivalent measures. This will have had downstream implications, I am sure.
Sam van de Schootbrugge is a Macro Research Analyst at Macro Hive, currently completing his PhD in Economics. He has a master’s degree in economic research from the University of Cambridge and has worked in research roles for over 3 years in both the public and private sector. His research expertise is in international finance, macroeconomics and fiscal policy
Photo Credit: depositphotos.com