This is an edited transcript of our podcast episode with Stefan Jansen, published 19 May 2023. Stefan is the author of the widely read ‘Machine Learning For Algorithmic Trading’. He is the founder and Lead Data Scientist at Applied AI. He advises Fortune 500 companies, investment firms and startups across industries on data & AI strategy and developing machine learning solutions. Before his current venture, he was a partner at Infusive, an international investment firm, where he built the predictive analytics and investment research practice. He also was a senior executive at Rev Worldwide, a global fintech company focused on payments. Earlier, he advised Central Banks in emerging markets, and consulted for the World Bank. In this podcast, we discuss what benefits does machine learning bring that other techniques don’t have, the challenge of using machine learning in finance, what ChatGPT is and the underlying tech of LLMs, and much more. While we have tried to make the transcript as accurate as possible, if you do notice any errors, let me know by email.
Introduction
Welcome to Macro Hive Conversations With Bilal Hafeez. Macro Hive helps educate investors and provide investment insights for all markets, from equities to bonds to FX. For our latest views, visit macrohive.com. Before I start my conversation, I have three requests. First, please make sure to subscribe to this podcast show on Apple, Spotify, or wherever you listen to podcasts. Leave some nice feedback and let your friends know about the show. My second request is that you sign up to our free weekly newsletter that contains market insights and unlocked content. You can sign up for that at macrohive.com/free. All of these make a big difference to me and make the effort of putting these podcasts together well worth the time. Finally, and my third request is that if you are a professional, family office or institutional investor, do get in touch with me. We have a very high obtained research and analytics offering that includes access to our world-class research team, model portfolio, trade ideas, machine learning models, and much, much more. You can email me at [email protected] or you can message me on Bloomberg for more details.
Now, onto this episode guest, Stefan Jansen. Stefan is the author of the widely read, Machine Learning for Algorithmic Trading. He’s the founder and lead scientist at Applied AI. He advises Fortune 500 companies, investment firms and startups across industries on data and AI strategy and developing machine learning solutions. Before his current venture, he was partner at Infusive, an international investment firm where he built the predictive analytics and investment research practice. He was also a senior executive at Rev Worldwide, a global fintech company focused on payments. And earlier, he advised central banks in emerging markets and consulted for the World Bank. Now, onto our conversation.
So greetings and welcome Stefan to the podcast show. I’ve been looking forward to having this conversation with you.
Stefan Jansen (01:45):
My pleasure. Thanks so much for having me, Bilal.
Bilal Hafeez (01:47):
Now, before we go into the heart of our conversation, I do always like to ask my guests something about their origin story. So why don’t you tell us something about what you studied at university, how you got into machine learning and quantitative finance and something about your career journey until now.
Stefan Jansen (02:03):
Yeah, so I think coming to machine learning, I have a maybe slightly unusual background. I’m an economist by training and I was always very much attracted to the quantitative and especially data sort of driven side of things because I always found it very kind of necessary to take all these various models to the data and see what works and what doesn’t. So tinkering with the whole econometric machinery at the time, I found that really interesting and using computers to do that, very convenient. You can do things at scale and they do the work for you. So I liked that a lot and I did like actually finance from a theoretical perspective back then. But I also always had a very different interest, which was working in international development because I had travelled early on quite a bit abroad.
So I had lived in Guatemala for a year after high school and then I studied in Brazil for a year and I always looked at these countries from a developmental economics development perspective very much and social development. And so after doing a masters in economics back in Germany where I’m from, in the like early 2000s, I was sitting on the fence between doing a PhD in economics and doing something sort of in the field. And I was very heavily gravitating towards the academic or the PhD side because I just liked the technical work of it, but I’m not an academic at heart at all. So the very specific nature of the work or the topics just seemed so, because the data were small at the time.
I did a master thesis on early warning systems for financial crisis, which used the IMF as this international financial statistics data set where at the time you did still get the entire CD-rom. And I used all this data to replicate some IMF system, which used, I don’t know, maybe 200,000 observations or so, which in economics was large at the time, but most of the other things were tiny. Like look at the data set with 2000 observations to figure out something about industrial organisation, seemed like, I don’t know if that was going to move the world, but on the other side it was an option to go to Asia.
So I worked in Indonesia for three years at the Central Bank as a policy advisor after the financial crisis. And that kind of was more appealing because you have being in the thick of it and you see how policy impacts things on the ground and you are also in a very different cultural environment. So all of these things together always fascinated me. So I did that for some time and then eventually decided that public sector work wasn’t as dynamic as I maybe like it, I’m a little more entrepreneurial by nature. So I did another masters at Harvard, sort of an international development masters that was also fairly technical, but with the idea to actually get out of this field.
So after that I joined an advanced sort of startup in Austin in fintech and that got me back into data and machine learning because we worked in payments technology. And in payments, suddenly you have actual data at scale and you can actually do things with them. And that was the time when machine learning suddenly became a topic. Andrew Ang launched his courses and looking at those, I literally in the first lesson saw basically the technique that I had used in my master thesis school watch, which was logistic regression sort of compared to some non-parametric method was the first thing they discussed. And I like what? This is machine learning. And now in an industry setting, suddenly this seems relevant. There was no holding back because I realised that a lot of my training and interests suddenly actually had a use, this together with the early interests kind of took me back to this. So that was a little over 10 years ago and I’ve been doing it kind of ever since.
What Benefits Does Machine Learning Bring That Other Techniques Do Not Have
Bilal Hafeez (05:50):
And I have to say you’ve written this Magnus Opus of a book, Machine Learning for Algorithmic Trading. It’s close to 800 pages, but it’s a book that is part of the standard reading for everyone in our team who does machine learning. Everybody uses it like a bible. So you’ve definitely had an impact certainly on the Macro Hive team and I am sure on lots of other people as well. And we’ll talk more about that later as well.
Machine learning and artificial intelligence, it’s a big buzz term at the moment, AI in general, but I think what’s useful is to demystify some of these concepts. So maybe if we start with what you touched on at the beginning, econometrics. So most people have done economics courses or finance courses in their classes they learn about econometrics where you have a number of variables, you run your regression and it estimates the output. So A plus B, A coefficient plus B coefficient equals Y, and you run the regression. You can do an Excel, you can run a package like EViews or you can run, see a chart with scatter plots and you write a nice linear regression and you have some idea of the relationship between two different variables. And it’s a workhorse of lots of people in economics, in finance, people love their charts, two lines moving together. So I guess the first question I have for you is what does machine learning bring to the table that econometrics can’t do?
Stefan Jansen (07:13):
Well in a nutshell like high dimensionality and non-linearity. It’s sort of econometrics on steroids in a way. One way to look at it that’s at least how I discovered it initially was that of course the things that you look at in econometrics would also generally be considered part of machine learning. But very often the application, the focus of the application shifts from trying to measure, metrics comes from measuring things. So econometrics originally with Tim Berg and all the forefathers is a little bit about figuring out how you can reliably measure relationships between economic variables and precisely to show, as I said earlier, to figure out whether implications of models actually hold up to the empirical reality. So you have an entire statistical machinery around that and the like and machine learning comes much more because it’s driven by different disciplines, much more with a predictive angle to things.
So that means they might be using some techniques that you find in econometrics, especially sort of the overlap is more at the baseline, at the foundation it’s not just linear regression. But as you go into the machine learning literature or the sources there, you suddenly have machine linear regressions that are designed to use large numbers of variables and then they add shrinkage to deal with some of the problems that come when you operate with operating in high dimensions where you get a lot of variance in your predictions because parameters can be so influenced by what you see during training. So you actually literally have sort of a variation of linear regression that gets these rich or lasso kind of approaches embedded with some penalty term that aims to deal with the high dimensionality.
Bilal Hafeez (09:02):
And when you say higher dimensionality, what do you mean by that?
Stefan Jansen (09:05):
Well, I mean if somebody who’s watching has done linear regression in sort of an academic context, you’ll typically have single digit numbers or maybe a few dozen predictors in there. I mean sure, maybe with some dummy variables et cetera, it blows up a little bit, but you’re rarely going to run regressions where you have hundreds of thousands of variables or even a much larger number. I mean something like logistic regression could run at Google and the context of predicting click-through rates and there could be thousands and thousands of predictors if not more, right? And it could be used because it’s fast, you may have low latency in these forecasts even though it’s linear, you may be losing a little bit because it’s a simple model, but you gain a lot because it’s fast. So something like this could run in a production context and a tech company, but with numbers of variables, there would be orders of magnitude larger than what you would do in an academic context with a traditional OLS ordinarily square type of regression.
So that’s kind of going from a modest number of dimensions where you really try to zoom in on variables that you care about and then say that as body boost, kind of trying to figure out what works there, to something where you say I’m not going to use everything that is potentially useful to predict my variable of outcome and then I’m going to try to work around the issues with having so many different variables that might be highly correlated and so forth to address the variants that comes with the noise that you inevitably have in your training data by adding some additional bells and whistles, which are these shrinkage parameters, which is simply a penalty. That means the optimization algorithm is only going to increase or so in absolute terms, the value of parameter, which would make it more responsive to certain data points in your test set when you’re actually trying to predict that if that increase contributes to a significant improvement in your fit in sample.
So you have to overcome some hurdles. And so this is biased variance kind of way of looking this where you say that in order to predict better with lower variance, which means the predictions, they don’t swing so much with different test data that you predict on, you may be happy to take into corporate or swallow some bias, which means you’re slightly off your prediction.
So that’s one way how the different focus on a different application, which in machinery tends to be prediction as opposed to explaining why variables A and B and C or variables XY or X1, X2, X3 impact the outcome Y through their various coefficient, and then coming up with statistics that measure whether the impact is statistically relevant or not or just subject to chance. So this is how you’d go from explanation and measurement to prediction and then you have the entire other dimension that machine learning brings to the table, which is non linearity, which in econometrics you always sidestep by saying, well we only care about a tiny little area on the modelling side.
The Challenge of Using Machine Learning in Finance
Bilal Hafeez (12:20):
Yeah, yeah, econometrics you are heavily constrained on, it’s generally linear. You can have quadratic functions which kind of introduce some kind of nonlinearity, but it’s very constrained. I mean it’s very simplistic in many ways. So machine learning allows you to work with larger data sets, allows you to uncover non-linear relationships. It’s more explicitly about prediction rather than explanation. So it sounds ideal for finance. And so the question I have to you then, why doesn’t everyone use it in finance today?
Stefan Jansen (12:51):
Well my book has 800 pages, so it takes a little bit to get through it. Yeah, no, just kidding. So well obviously finance has its own constraints. Machine learning has taken off. So I work a lot in different industries as well where the applications are sort of more resembling low hanging fruits. In finance you have the special nature, first of all, many people are trying to predict the same thing. So to have an edge generally is harder. Prices may incorporate a lot of the information that’s out there because so many people are at work trying to interpret that information both on the longer term as well as on very short horizons. So you’re simply already picking a much harder game in terms of predicting things and that also implies that the signal, a signal meaning any type of information in anything of any, the data that you have at this point in time that may speak towards the future, any of that signal may already have been exploited by someone in some way.
So what the data represents is largely non-signal, which we also like to call noise. So that is hard. And there of course you have this challenge that these potentially complex models in machine learning, they can pick up a lot of the noise just as well as the signal. So there’s a huge risk that the models instead of extracting the signal, they also at least also extract the noise, if not only the noise, which means they become useless for prediction or very faulty. So that clearly is a big challenge. Then you also have the problem of machine learning, which on the one hand is like high dimensions, which in finance you may have. You have at least the resemblance of a lot of potential imports, lots of securities are trading lots of variables. You can compute from the price history, so much other data you could bring to the table.
But one thing that you don’t necessarily have is a very long history of this data. So YouTube may also not have such a long history, but they do have a lot of data for the short amount of history that they have. So if you want to build machine learning models that are able to recognise cats and dogs, they stand on a different foundation in data terms than somebody doing the same thing in finance where you have maybe 10, 20, 30 years of stock data where it’s not entirely clear that going from daily data to tick data really enlarges the information that you get in the data.
Bilal Hafeez (15:24):
Yeah, no, I see what you mean. So finance, you can get lots of variables on one side of the equation you can get millions and millions of potential predictive variables. But on the other side, the dependent variable, either stock market, there’s only one US stock market that goes back 20, 30 years and so you don’t get the big data on that side. Whereas with YouTube videos you get both sides, you get billions of cats and cat videos as well as billions of potential futures within that. So in finance, I mean where do you think machine learning is used most commonly? Which areas do you find it most heavily used?
Stefan Jansen (16:02):
I mean I think there are applications sort of across the board, people use it for higher frequency trade because it is useful to extract information. What is generally the precondition is that also the clients I work with, where I find this where find the application of machine learning productive is when you operate in an environment where they are already quantitative investment strategies that may be run by humans at relatively high frequency. So we’re talking a few seconds, minutes or longer term where you can then use machines to explore similar quantitative information in a more efficient way.
So if you have signals as opposed to human interpretation of news, so you don’t have that much people that read newspapers and listen to and look at Bloomberg news tickers or read in-depth research reports or visit company executives and grill them on certain items. But you have people that look at some type of quantitative source of data signals and the like, if you have something like this, you can very often, especially if that has already been demonstrated to be useful and profitable, you can very often translate that to machine learning setting in a quite straightforward way. So this is sort of on the quantitative side. So if you ask where is it used, it’s obviously-
Bilal Hafeez (17:20):
So an example there would be if somebody has a momentum model or carry model or something like that, so it demonstrates, it does make money over time, you can then have a machine learning overlay which will enhance the returns because it will kind of use data more efficiently to predict the next day returns off that strategy. So you can basically risk manage it using the ML signal.
Stefan Jansen (17:43):
Yeah. You can sometimes very often these strategies, they use some signals and they use some rules to translate signals into entry or exit orders. So the entire thresholding that happens there can often be done more effectively or more efficiently by machine learning model because a little more sensitive, it just can process these things a little better. So that’s a useful application. I think this is what’s closest to kind of home because the people that operate in this space also tend to be closest to this kind of methodology as opposed to the purely fundamental investors that build on their deep insights to an industry. Naturally you ask and they ask how can a model sort of capture my judgement and for good reasons, especially those that do well, they certainly have good reasons to ask these questions, it’s not as straightforward.
But then you hear somebody like Sam Altman who runs OpenAI say that when they ask him what do you think is the real killer or the big application that is going to come out of things like ChatGPT, he does say that he is looking forward to the person that builds the killer investment machine ala Ren Tech using large language model. So moving from the sort of purely quantitatively driven to something that actually starts processing much more diverse information that’s out there in the world. You know-
Bilal Hafeez (19:09):
You mentioned Red Tech, you mean Renaissance’s Technologies, which is perhaps the most successful fund in the world.
Stefan Jansen (19:13):
Exactly.
Bilal Hafeez (19:13):
That’s highly quantitative works at the high frequency level, but I guess Sam Altman’s arguing that if you incorporate what we call unstructured data, text, audio, video on top of structured data, then you’ll have this amazing investment tool potentially in the future.
Stefan Jansen (19:30):
You could.
Bilal Hafeez (19:30):
But who knows.
Stefan Jansen (19:32):
Potentially the capabilities are sort of emerging. The thing with these large distances, the ones that we have to travel to get from the current state to something like general artificial intelligence, we have to live with the paradox that we are a lot closer to the target but we’re still very far away. So there is a bit of a tension there and personally working with these models, I think it’s fantastic, what they can do. The performance is amazing and you see glimpses of something that goes beyond text summarization towards reasoning and these things, but they’re also the cases where they’ve failed. So we’re still wondering to which extent are we observing with these oppressive, like Jeffrey Hinton runs around and tells kind of a case where he got struck by the common sense knowledge of a large language model that was able to solve certain simple tasks but that previous AI models were unable to understand.
I can tell in a minute what it was about, but the idea was that there are too many other cases where it doesn’t work quite yet. So the question is this a statistical fluke? Are we seeing random cases of things working? And with all the biases that we have in perceiving news, oh they stand out and people don’t report the other 80% where it doesn’t quite work. I personally find in my own usage sometimes ChatGPT a little less useful than I’d like it to be. I think it’s of course fantastic for many common tasks where the quality standards maybe not that high writing in the random email and the like, we really want to formulate a text according to something you really have in mind some concept sometimes it does take quite a bit of work to get it there.
Bilal Hafeez (21:19):
I agree as well. I mean we’re using a lot in our company so I’m getting different people with different approaches to use it and we haven’t found quite found the killer approach for everyone to use it. And so I think the way you write the question or the prompt and hence the whole industry of prompt engineering’s very important. So it’s not as straightforward as it seems. There’s a lot of add-ons you have to do to make it really well, really understand the nuances of prompting and things like that. Yeah.
Stefan Jansen (21:47):
We’re at a tricky point here, I tend to be a sceptic by nature. Maybe it’s my German roots, I don’t know, but I’m curious to see what we will say about this in three years. Some people think the technology will plateau unless we find a different kind of paradigm beyond the transformer led kind of reputation of text. Others are very bullish about it. I have an open mind, I think anything is possible here, but right now we need a little more to get to these New Renaissance technologies based on LLMs in the near future.
Understanding Neural Networks
Bilal Hafeez (22:26):
Maybe we can talk a bit more about LLMs, but perhaps should we maybe should we start with neural networks? So this is one version of machine learning algorithms on neural networks and that’s the basis upon which LLMs, large language models like ChatGPT are based. So can you kind of describe, explain neural networks like a five-year-old as people would write in a ChatGPT?
Stefan Jansen (22:46):
Yeah, I mean it really helps if you’re familiar with the concept of a linear, especially a logistic regression just in a way, and neural networks, a vanilla baseline sort of densely connected neural network is literally multiple or many logistic regressions connected to each other where the output of one logistic regression flows into the follow up, the next logistic regression. So it’s like chained functions. So the important thing about neural networks is that you have sort of neurons or nodes in there that receive inputs. At the base layer, they literally just receive the input data. This can be pixel values from images or it can be any numbers from financial data or so. And these neurons where these data or these input values arrive, they pass these values on along some sort of draw them like a graph where you have these notes and these edges that are connected.
So the data flows from one node or a neuron over this edge to a new node. And as it flows over this edge, it gets typically multiplied by some weight, which is simply a coefficient, hence a similarity to a linear regression where you have your input data point X and you have some beta coefficient that multiplies it by three or by 0.3 or by -10, and then you have a new value. And then at these various inputs, these neurons at the second layer, so neural networks tend to have multiple layers. You have this input layer where data comes in and then these input node or neurons, they’re connected to multiple neurons above, which also means that each neuron on the level above is connected to multiple input neurons. So what they essentially do is they receive these various input signals. So data points that have to multiply by some coefficients, they just sum those up, again very much like a linear regression and then have some sort of intermediate value there.
And then comes the important point. This intermediate value passes through an activation function. And the activation function can take many different forms. So whatever the value is originally it gets squished to a value between zero and one, or if it’s like a 10H function, then it’s like -one and one. So there are many different variations of these activation functions. The important part is that they transform whatever they get as input in a non-linear fashion into some output value. You also have these functions that look like they’re called rectified linear units. They look like a call option. So they’re sort of linear, but they actually are cut off so they’re like flat at some point. So whatever the input value is up to a certain point gets kind of a zero output. And then after that it goes up linearly.
And now you have to imagine that this is what happens in one of these neurons that happens in all of the parallel neurons in a given layer and in these output values, these activation values which may be between zero and one or however the activation function operates then continue to flow to the subsequent layers. So you suddenly get a lot of complexity here because you now have all these various neurons that operate in parallel, they do similar operations and this can happen throughout quite a few different layers until eventually they arrive at some upward layer, which typically is designed to fit some task. So you may have a regression neural network that simply outputs one value, which can maybe range from minus infinity to plus infinity. Or you can have a classification neural network that maybe outputs 10 different values where each value resembles the probability that the object that you’re trying to classify is in one of 10 classes. And then you would say, oh, where the probability is highest, that’s the prediction for class, maybe number three. So maybe it’s a cat and not a dog instead of a dolphin or so.
So that’s essentially at a very high level how this works conceptually, the idea is that these are all nested functions, so you can write this out as how kind of the output value ultimately depends on all these different inputs because it flows through the graph, this computational graph from the input through not layer 1, 2, 3, however many you have until the output. And that is kind of the forward path. So this is the forward flow of the data from the input layer to the output. So that’s kind of how the vanilla networks work and maybe you get a glimpse already if it’s so complex. There are many, many, many different ways to design those networks and that’s where it gets really interesting.
Bilal Hafeez (27:25):
Yeah. And I guess originally when neural networks first developed, I know was it in the fifties or the sixties, the idea was to kind of mimic the human brain in some ways and hence the terms I suppose, neurons and such. But the idea-
Stefan Jansen (27:39):
Yeah, they were-
Bilal Hafeez (27:40):
To have these chains.
Stefan Jansen (27:44):
Exactly. So even earlier models goes back to the twenties or so, but what that didn’t have at the time were learning algorithms. So people have always tried to figure out, oh, what do we have between our ears and how does it work? And so there were these mathematical models, but they only had in the fifties for the first time this kind of learning algorithm because one thing is that you can draw this kind of thing and you can write all these equations, but the interesting piece as well initially these weights, these coefficients, they don’t have a value where you can maybe randomly initialise those, but how do these ways actually arrive at some useful value? And in linear regression you have some algorithm to do kind of the metrics algebra you can solve your equation there. That doesn’t quite work with neural networks because they’re a little too complex and then on linear.
So you have to have a learning algorithm that actually as you have some training data where you have all these features, all these variables that provide the values for these input neurons, and then you also have an output value that tells the model, well given these 10 or a hundred or a thousand input values, that is the value that you ideally predict. And then you don’t have just one sample of those, but a lot of those and the model should get all of those right or right as possible. And then for that the weights needs some actual values. So how do you get those? That was something that literally took decades and decades to solve. So they had an issue, a simple one at the time, but to arrive at this actual bad propagation algorithm that could do this not just for a single layer but for multiple layers and do so actually in practise took quite a bit of work. And of course Jeffrey Hinton is the one man that’s most closely associated with that sort of innovation.
The 2017 Google Breakthrough That Led to ChatGPT
Bilal Hafeez (29:30):
So we have these neural networks which have all these layers and these neurons some ability to learn and hence neural networks are part of deep learning, I suppose, that’s part of the sort of deep learning family, I suppose what you could say. And I suppose one challenge always with neural network models was the idea of dealing with sequences of information. So when you have one event followed by another event, followed by another event, it would always struggle a bit with sequences. Whereas something like images, which is not really, an image is just an image is one thing, but when you have a sequence like a series of words in a sentence, that would always be a challenge. And so there was a development in your networks where you had what’s called transformers were developed I think by Google in 2017, which was the big breakthrough I suppose for large language models. So can you talk a bit more about what that innovation was?
Stefan Jansen (30:25):
Yeah, so I mean the interesting thing about neural networks, and we’ll see this in a minute when we talk about applications and finance, is that in various areas, domain specific architectures tend to emerge. You mentioned images. So there are these convolution networks that take into account that in an image it does matter what the neighbouring pixel, what’s the value of a neighbouring pixel. It doesn’t matter as much what the value is of a pixel that’s like 50 centimetres away in a picture, like the one behind me. And then you have sequences, the vanilla network treats each data point as drawn from a distribution that’s independent and identically distributed. So in sequence you don’t have that. So there were these recurrent neural networks that came up that were specifically designed to keep a memory of what happened before, which is already progress. And this was kind of the workforce model for also language, which language of course is a sequence.
Now in time series you may be very concerned with the previous value. In language, you may or may not be concerned with the previous word. Depending on the sentence, you may also care about words that are actually five words away or that follow later, that together form a unit and help you interpret or make sense, sort of recover the semantics of what’s being spoken. So the transformer was kind of a sequence model where it suddenly, the original paper was called attention is all you need. So they come up with this attention mechanism where the model learned from numerical representations of the words. So an intermediate step was to discover that you could represent words as vectors. So it was a longstanding problem in language. How can you take language words and convert them into something that a computer can understand that still retains some of the meaning.
So there was this string of research that came up with this idea of embeddings where individual words or parts of words are represented by a vector that maybe has a hundred real numbers in them. That turned out to be very useful because the way these numbers behave, especially relative to each other actually retains some information about what that word actually means. So these transformer models, they operate on these vectors and they learn given a lot of input data, very large amounts of input data, so the entire internet essentially, which words, which other vectors to pay attention to when interpreting text. And they learn to interpret text by filling in or predicting missing words. So one way the technique or one way training at this scale was possible without having humans label data and say, oh, we do machine translation, so let’s take a large body of English text and then let somebody translate this into French.
That of course you can’t do with the entire internet. But if you learn in the, they call this semi supervised learning, sorry, self-supervised learning where you have the entire input data and you just randomly remove words and then you train the model to figure out what the missing words is. So in order to do that well to sort of predict the missing tokens, the model learns which other tokens in the surrounding context, and the context can be fairly large, should the model actually pay attention to, and that’s why it’s called attention mechanism, to predict the missing word.
And that was a major breakthrough because suddenly you could build on the inherent structure of language, which is, as I described earlier, it may be the next word or the previous word, but it may also be something that was said in the previous sentence or a little earlier, a little later, that if all this is taken together, it suddenly becomes very clear what the missing word is. Whereas if you only look at the earlier models, looked at the previous of the next say three words, you might be missing a lot of that. So that was really kind of a quantum leap capabilities of language models.
Uses for ChatGPT and LLMs
Bilal Hafeez (34:32):
Yeah. And so you’ve mentioned a few sort of terms there, vectors which are very important. So the innovation there was to find a way of translating words or parts of words into something that is more understandable by computers, and vector embeddings is a big term people use now to understand, to kind of translate sentences and language into something that computers can understand or these languages can understand. And you also mentioned tokens as well. So people use ChatGPT now depending which version you use or use the APIs, you’ll realise you get charged for the number of tokens that are used. So tokens are kind of parts of words I suppose, are parts of words. And the more tokens you have when you have an input, the more compute powers needed and so you’ll be charged more. And so it kind of works as you’ll expect. The more compute powers needed, the more you’ll get charged and it’s done at the sort of unit of tokens. And so ChatGPT very cleverly uses all of this technology.
Now in terms of applications, so you’ve said you have looked at Chat GPT, and LLM, I mean, what’s your gut say about how we can use this in finance today? I mean is it summarization? Which of course is useful, that’s helpful. Is it building new time series based on sentiment analysis more effectively? So using these LLM models to interpret the sentiment of corporate earnings, transcripts, going back historically. I mean, what do you think is the best way to use this now?
Stefan Jansen (36:09):
I mean there are some applications that are very obvious. You mentioned them already. I mean if you ask again like Sam Altman himself says he doesn’t really use a lot of tools and ChatGPT, the one thing he uses it for daily is summarization, right? Be that entire set of incoming emails, Slack channels, what have you, or other knowledge bases, documents, et cetera. I mean that’s very obvious. And personally I think that in the area of finance, language models will be the biggest use case immediately is going to be support. And that has been used already of course for all these people that ingest a lot of information. I mean they exist, there’s so many analysts, researchers, et cetera, et cetera. So it’s pretty clear that this is going to condense some of the information, but that’s more like a generic use case. Of course it’s just generating text.
I mean from investor relations to research reports, some of these things have already been automated, all of this is going to get better and faster, but that’s not really exciting in the sense that it opens new investment possibilities, it just makes people more productive at what they’re doing already. So all these new kind of applications, I think we are in an experimentation phase. One thing that comes up a lot with these large language models is when it comes to how can we take this to the next level now that we already exhausted the internet to train them, how useful are synthetic data? How useful is synthetic data going to be, right? Because of course these generative models, they can generate a lot of new data. So what’s the current frontier of research is let us now produce more output produced by these models to then train better models.
That is really interesting. Is that going to work? This is unproven. So OpenAI is very hopeful on that front, but I mean it’s a reasonable thing to assume that that might work, but it might also fail because well the philosophers, well how can you create new information from things that aren’t there? So I think you can debate this endlessly, whether it might or might not work. And synthetic data of course is something that’s also interesting for finance. There’s a chapter in the book that deals with this a little bit that talks about the possibilities that was kind of based on the paper that came out 2018, ’19. I think the capabilities at that point were very limited. There is hope that you may be able to produce better synthetic data also in finance, maybe to train model. It’s already happening certainly in other domains like risk management or consumer finance, et cetera.
So to generate synthetic data, which could of course, and that harks back to one of the earlier things we discussed to maybe deal with the limited data histories that we have from to apply machine learning to finance. So there is an opportunity there which is certainly actively explored, but I think far from solved whether that’s viable. That’s something I think I personally find would be a major game changer if that were possible. But again, it’s somewhat early days. The general idea, so if you think about how ChatGPT and models of this nature are trained, they use one of these large language models as main input, which uses the transformer slash attention kind of technology and other features. And then you have the second layer that uses this reinforcement learning that builds on human feedback. The models that we’re seeing right now, they are using somewhat generic templates.
They may be sophisticated. So it becomes more and more an issue to train these models using smart experts that produce smart templates that are good sort of examples for the model to guide their output. Because the goal of these templates is to make the model more usable because the original large language model produces all sorts of stuff. It may or may not be useful. So you have to catalyse that and align this with the user’s goals.
Now these are generic templates. So you could start thinking about what kind of outputs of such a model could actually be interesting specifically for my finance use case. If It’s larger, it becomes easier to add this reinforcement learning layer. So there are already software tools being open source that permit this, Microsoft open-sourced something not too long ago that allow firms companies to start building their own ChatGPT versions tailor to their use case, which may be closer to make suggestions, pick out of the hundred stocks in our universe, the 20 that we should invest in based on the following. And then you have in the context window relevant inputs. So there’s scope, but there’s work to do as well to get there.
How to Use Decision Trees, Random Forests and Gradient Boosting
Bilal Hafeez (40:55):
No, no, that’s great. Now I want to sort of pivot towards some of the topics as well. I mean we talked about neural networks, we didn’t talk about decision trees, which is again one of the workhorses of machine learning and the spinoffs from that random forest gradient boost model. So can you talk a bit about decision trees within machine learning because that’s really one of the foundational algorithms that allows you to think about machine learning in a good way.
Stefan Jansen (41:23):
Yeah, mean, so decision tree as an algorithm works a little bit like this kind of game of 20 questions, you try to figure out from knowing, not very much trying to isolate some solution by always dividing or isolating a subset of your samples or your potential solutions. So as an algorithm it’s like you want to figure out how to, so say you have a data set that has 10 features and you have positive and negative outcomes. So you try to figure out which feature should I take next to split my observations by whether that feature is above or below a certain level. And then now I have my first split in the tree and now I have some data points on the right, some data points on the left depending on the criteria that I chose. And then I simply continue sort of recursively again and again operating on the remaining data until ideally I only have notes at the bottom where I only have cases where I have a positive outcome or a negative outcome or data points if it’s a regression that are as close as possible by each other.
Where you may realise you could of course build this decision tree ad infinitum so that you only have a single case in each node at the end, at which point you will always have a perfect outcome. And that may be great for the dataset you just worked with but may actually not work as well for another thousand data points that are randomly collected because you sort of over fit on data point. So decision trees are prone to do that, which is why we have these ensemble algorithms you might want to mention next, right?
Bilal Hafeez (42:54):
Yeah, sure.
And within statistics it kind of looks like a binomial sort of distribution when you draw the trees when you’re doing kind of the broad maths of this all. And the reason I say that is a lot of people get scared by machine learning, but in many ways the principles are, if you are familiar with statistics and econometrics, it’s actually not that hard to switch from one to the other.
Stefan Jansen (43:19):
Yeah, exactly.
Bilal Hafeez (43:20):
People should be afraid to make that leap.
Stefan Jansen (43:23):
Yeah, it’s a lot of this jargon to put it negatively or just once you realise that people have tried to solve somewhat similar problems coming from a different angle, the solutions will somewhat overlap. They may be called differently, but as you get a little closer to the methodologies, you re-encounter a lot of the patterns I tried to explain earlier with the neural networks where you sort of have logistic regressions as building blocks and once you realise that, oh it’s actually my training from econometrics might actually take me quite some, it might be a very good foundation and it might cover a little more of machining than I think it does.
Bilal Hafeez (43:59):
Yeah. And so you said with decision trees the basic version, the risk is you over fit, you become so precise on the training set that it just doesn’t work on a different sort of data set. So what are some of the common ways you can overcome that?
Stefan Jansen (44:15):
Well, I mean clearly you don’t want to build a decision tree as deep and fine grained as possible. So you want to restrict yourself at some point and all these, which is a little bit like these penalise lights or these shrinkage version of linear regression. In that case you want to limit the parameters from becoming too extreme and to react too proactively to the data that comes in. Decision tree is a little bit similar in the sense that in order to create a new node, you want to overcome some hurdle. It really has to reduce your loss function, the one that you’re trying to optimise by sufficient to really make it worthwhile. Because if you don’t impose any constraint, you just keep adding notes to the tree and you’re back in the situation where you classify each data point separately and that has work. So you have to find some middle ground between zero notes in your tree and as many as you have data points.
And then of course you have the ability to do that because maybe a decision tree that is much smaller is actually not such a perfect predictor because it’s very simple. And the reality of what happens between the input data and the output is a little more complex than that, then you have the ability to actually combine a bunch of different trees. And there are different algorithms that allow you to do that. And these type of algorithms are for your standard tabular data set tend to be the workhorses and the best go-to solutions to get decent predictions relatively quickly.
Bilal Hafeez (45:45):
An example would you say is random forest where you basically have a whole series of these decision trees and it’s like what’s called an ensemble model. So you’re I guess averaging all of them together to get these sort of results so it’s less overfitting.
Stefan Jansen (45:59):
Yeah, exactly.
Bilal Hafeez (46:00):
And it tends to work really quite well. Like random forest, I find at least it’s almost like the benchmark to some extent that you need to beat.
Stefan Jansen (46:08):
Yeah, random forests are great. Literally they operate on randomly sampled versions of your dataset. It’s called bootstrapping. So you kind of recreate your dataset by sampling with replacement, which you not may remember from statistics. And so basically if you have 10,000 observations, you just, each tree is built on a different version of these 10,000 observations which may have some original data points repeated and others missing. And then you just build a tree. You may have other constraints in there, but it’s also fast because if you want to combine a hundred trees, you can just build all of those in parallel in the background. You just do your sampling and then you just build your a hundred trees and voila, you have your ensemble model and then each tree predicts something and then as you say, you average those with regressing average dose, or you have the vote for should it be class zero or class one, et cetera.
And there’s gradient boosting, which also has been very, become extremely popular. I find myself using a ton because this is sort of sequentially, it builds new trees against the results of the previous tree. It tries to sort of minimise errors that the previous trees produced so that you can’t run it parallel. So that’s a bit of a cost sort of computationally how fast this all works, but it often tends to produce a little better results than grading boosting, but they’re really tend to be very close. It really depends a little on the data what works better. But if you look at Kegel, the platform that we use all these data science competitions, gradient boosting is certainly very popular and it has been at finance as well.
But one thing I just wanted to mention since we talked about neural networks, if you look at finance, I think you realise that now finance is at a point where finance research is producing much more domain specific neural networks that take into account that you have very much noise in the data, but you often only have a limited number of underlying factors that drive investment returns, the entire idea of factor investment. So you have more networks that are designed from an architectural point of view to extract these factors in sort of implicitly and use those to make predictions. You really see how the field is advancing from these initial sort of kitchen sink approaches. Where you try to, well now we have these high dimensional models, let’s just throw everything at them because they can shrink it and deal with the noise to some, but let’s actually try to incorporate what we know about the domain into the model so that the model then becomes more efficient because maybe it needs fewer parameters. Sometimes you can substitute knowledge for parameter complexity and they tend to be perform a lot better.
So and this is fairly recent, progress has been slower in finance for sure than in other areas, at least the published research because in finance publications of really interesting things tend to be a little less forthcoming than say in computer vision. So that’s an important thing to keep in mind. But you clearly see a similar trend to, like I mentioned earlier, in image recognition, you now have domain specific convolutional networks and language. You have the transformers in finance now you also have new networks that clearly aim to and do so successfully incorporate what we have learned about the domain of stock prices and how different elements or different pieces of information may drive those returns.
Using Neural Networks in Finance
Bilal Hafeez (49:38):
And right at the beginning we talked about who in the financial sector is most amenable to using machine learning? And obviously the quants are people who kind of systematic factor type investing, they might be open to it, but we talk about fundamental type people who are more kind of qualitative or discretionary, I mean what would you say to them about machine learning and this whole field? How would you tell them should they use machine learning in some way, what would you say to them?
Stefan Jansen (50:10):
I think what’s really, and do you learn this more as you work as in different industries or different contexts, machine learning is just a tool that you can use to support certain elements of a process. In the investment context it’s very often, should I use it to run my entire hedge fund or my entire investment decision process end to end or not at all? Whereas even if you are a fundamental investor, you may very well be predicting certain inputs into your decision process, be there corporate earnings or what have you, right? I mean you work on the macro front. So people are predicting things on the macro front and maybe machine learning can help with certain pieces of that process. So to better understand that machine learning gives you a pretty colourful toolbox of to-do things mostly on the predictive form, on the predictive front in the form of supervised learning. But they’re also, there’s unsupervised learning on various other things that you could potentially leverage to automate certain things, process information more efficiently, then you can start screening it.
That’s what, so when I work with a company that has nothing to do with investment and we literally start screening that processes and see where could machine learning actually be utilised? And that doesn’t mean we’re going to close the company down and have one machine learning run the company instead, but here’s somebody who might just run some model once a day or once a month to get some forecast for something or we automate some process entirely. So there’s a whole spectrum of applications. So it really depends on the context, and this is kind of what people I think are learning now that it’s a much more nuanced approach is required, a much more context specific to kind of learn how to use it.
And then you don’t have any more of these, oh, we have these 100 use cases than it’s, well, it’s a tool that everybody uses to their specific liking and some more less, but most people will use it to some extent at some point. I’m pretty sure there’s this overestimation in the short run, underestimated in the long run kind of tendency with these new technologies. And I think this is a prime example, Bilal, see this in some industries where the kind of effect has already taken hold very much. Others are lagging a little bit. But eventually, and I’m talking 10 years, you will see this much more common. People will be familiar with this. You come out of university, you better have a good idea conceptually what all of that means, there will be less need to code because we have ChatGPTs and the like. So not everybody has to sit there and do Python, but conceptually this is something that will be part of our day-to-day, like many other technologies that we have absorbed.
Bilal Hafeez (52:54):
Yeah. Now I did want to round off with a couple of personal questions. You did mention university students. So one question I have for you is if you’re a young person who’s about to leave university to enter the job market, start their career, what advice would you give them?
Stefan Jansen (53:08):
I mean, so for starters, digitization is not going back. So data will be part of all job processes everywhere. So just learn about how data is useful in your environment. These predictive data science type of techniques, understand them conceptually, not everybody is has the interest and the passion to spend a lot of time coding and the like. And you don’t have to. Absolutely not. There are many other ways. What’s really important is to know when to use which tool, where can create value. That is something that you should really understand. It’s not that hard. It’s more something that’s really useful whereby sort of missing out on that opportunity, you put yourself at a disadvantage. There was a time when there was this, oh, everybody should learn how to programme and the like, and I’m a little sceptical about this. I think it also has waned a little bit, but understanding conceptually what’s happening and what exists and what they do is definitely important. I think it’s happening naturally much more today already than just 10 years ago.
Bilal Hafeez (54:11):
And another question, what’s the best investment advice you’ve ever received from anyone?
Stefan Jansen (54:15):
If it sounds too good to be true, then it probably is. Just the last few years, once again, testament to that. Yeah.
Bilal Hafeez (54:29):
No, that’s a good one. And then on books, yeah, we had a conversation earlier about this, but it’d be good to hear your thoughts. I mean, I always ask this, the guest, what’s the best book you read recently or books that influenced you, but you have a particular take on this question.
Stefan Jansen (54:40):
Yeah, so we were discussing earlier that my reaction was a little bit ah books, maybe a little overrated. I was a big reader. I loved reading when I was young. I did a lot of it. I realised at some point I don’t get the same joy out of it, which naturally is because you just don’t have this immersive experience on the novel side. Like reading novels, used to devour them in one sitting. These things are somehow not as easily compatible with the day-to-day anymore. So I switched to audio books, getting a little bit of fun there. But also on the non-fiction side, it seems that there is so much shorter form content that is very good, there’s a very solid competition for these 200, 300, 400-page books that try to sell some idea or some piece of information. It’s often just too much of a good thing. So you said there’s an 800 page book I wrote, so who am I to sit in the glass house and throw the stones? But generally, and I, so that book of course is totally exempt here from my…
Bilal Hafeez (55:43):
I would say make a distinction with your book it’s more, as we were saying earlier, I mean it’s a cookbook almost, like a reference book, a manual. I think there is a value in that where you basically need this kind of reference, give you structure so that you can deploy something. I think these non-fiction books, which have one big idea about the world, like de-globalisation or geopolitics or AI is going to kill everybody. Those sorts of things. I often feel like you could just write a thousand-word essay, probably that’ll cover it.
Stefan Jansen (56:14):
Yeah, very often. Exactly. That’s why I just find myself, even if I try to read some, I just put them away after some time because the interest just almost, you have to force yourself to say, oh, I read this book to the very end, but clearly the joy, the information, the sort of the marginal gain kind of goes down quite a bit very often. But there’s so many other, just look at Twitter, right? We said earlier, yeah. How much more value is a 300-page book compared to a Twitter thread that has, I don’t know, 1500 characters? So I mean, I think it’s somewhat obvious, but yeah, demonstrated by what we do most of the time, because the books are still there, but the readership is dwindling.
Bilal Hafeez (56:56):
Yeah. And let’s not talk about TikTok. That’s what probably most people spend their time on these days. One final question is if people wanted to follow your work, reach out to you, work with you in some capacity, what’s the best way for them to do that?
Stefan Jansen (57:08):
I mean, it’s obviously LinkedIn generally for professional context, it’s easy to get in touch. Specifically on the book, there’s a GitHub repository that has all of the codes. If you just look up my name and look on GitHub, should be easy to find, has almost 8,000 stars. So it’s become fairly popular. There is a third edition in the works that I hope to make some progress on that hopefully will kind of rekindle interest in this more. There’s also community. There’s a website I have around the book that has a community on it where people can exchange ideas around the subject. So if you just Google my name and the book, that comes up pretty quickly.
Bilal Hafeez (57:42):
Yeah, well I have links to all of those things, by the way, in the show notes.
Stefan Jansen (57:46):
Awesome.
Bilal Hafeez (57:46):
So it’s easy for people to access.
Stefan Jansen (57:48):
Awesome.
Bilal Hafeez (57:49):
So great, so excellent. So thanks a lot. I learned a lot as usual whenever I speak to you, so it’s great speaking to you and good luck with everything, including the third edition.
Stefan Jansen (57:57):
Yeah, thank you. I hope the audience found it useful and is still around at this point. Either way, I thank you very much for having me on, and I look forward to talking to you again soon.
Bilal Hafeez (58:09):
Great. Thanks. Thanks for listening to the episode. Please subscribe to the podcast show on Apple, Spotify, or wherever you listen to podcast. Leave a five-star rating, a nice comment, and let other people know about the show. We’ll be very, very grateful. Finally, sign up for our free newsletter at macrohive.com/free. We’ll be back soon. So tune in then.