Skip to content

Large Language Models and operational risk


In recent months there has been a huge amount of coverage on the advancements made in Large Language Models (LLMs), such as ChatGPT (OpenAI) or Bard (Google). In this blog we attempt to pull out the significant points for operational risk, particularly in the context of our banking and insurance members. 

What do LLMs do?

At a basic level, LLMs are doing nothing more complicated than predicting the most suitable next word in a sentence when given some context. But, because they appear to do this so well, they are adept at creating realistic text, summarising and translating documents, and even writing computer code. 

One thing to first understand is the difference between discriminative and generative AI models. Discriminative AI models are used to discriminate between categories or labels, while generative AI models can be used to generate new examples within a category.

Discriminative AI could be used to alert fraudulent activity, whereas generative AI would be used to create examples of what a fraud may look like, or even commit it. LLMs are generative, and it is this creative element that leads to provocative headlines such as “AI could replace equivalent of 300 million jobs." From a risk perspective, it is this generative ability, for example the ability to automatically generate convincing content or to create malware, that brings new threats or threats at a scale never seen before.   

Neural networks

LLMs are complex, underpinned by massive neural networks. Although neural networks are not new, it is only relatively recently that a mixture of modern architectures and training algorithms, computing power, and available data have collided to rapidly advance LLMs. From a risk perspective, this complexity makes their behaviour near impossible to fully understand. This is in contrast to so-called “explainable AI” models such as those based on decision trees.

Training LLMs

Training of LLMs is largely unsupervised, taking huge corpora of textual data. It is then fine-tuned using a more deliberate, often human-generated, set of information. This can be thought of as taking a pre-existing model of how language works, often referred to as a foundational model, and applying it to a specific problem (the tuning). This blend means that LLMs bring with them any stereotypical bias present in the mass of foundational training data. LLMs also use “human in the loop” reinforcement learning methods to improve their outputs, presenting human testers with multiple responses, so they can choose the best and provide feedback.

It is worth re-capping that the goal of the model is to predict the next word in a sentence, to be convincing, not necessarily correct. This promotes stereotypical, rather than logical, outcomes.

What are the opportunities?

Large Language Models have clear applications in the support and automation of text-heavy tasks, such as customer service, loan applications, onboarding and claim assessments. You can also expect to see easier personalisation where the products and interactions are more easily customised in real-time.

In a very different application, internal Federal Reserve research set out to understand how good LLMs are at identifying the sentiment of regulatory announcements – so called Fedspeak, the “term used to describe the technical language used by the Federal Reserve to communicate on monetary policy decisions.” This gave impressive results: “The analysis presented in this paper shows that GPT models demonstrate a strong performance in classifying Fedspeak sentences, especially when fine-tuned.”

What are the risks of LLMs?

As with all new technology, there are always risks. A useful taxonomy of the risks posed by LLMs has been created by leaders in the field. These are shown in ORX Reference Taxonomy language in the download below. In some cases, the concepts are overlapping, and often they are more of a cause than the risk event itself, but they offer a good starting set of concepts to consider.



A more detailed look at the risks

In this section, we expand on some of the risks, with reference to real-life examples.

Model risks

LLMs do get things wrong in very convincing ways, and in an attempt to sound convincing can respond poorly to leading questions. As with any AI model, they can transmit any bias present in training data. LLMs can also be manipulated into creating toxic content, for example through ‘injection attacks’ where training data is manipulated such as to produce malicious outputs. LLMs are typically developed with safety layers to prevent undesirable outputs, but prompt injection attacks can be used to manipulate the model to circumvent safety protocols.

A recent example of false output is the case of a public official in Australia suing OpenAI following an incorrect ChatGPT claim that he was imprisoned for bribery (the opposite was in fact true, he was this whistle-blower).

Data management

Given that information security is an unwavering top concern in the operational risk space, firms are understandably troubled by the potential for exposing sensitive data. Use of LLMs has skyrocketed, with ChatGPT setting records for the fastest growing user base, reaching 1 million users in just two months

Some organisations have already taken steps to pre-emptively ban access to these tools. There have been some notable examples of incidents already. For example, a recent case where Samsung leaked sensitive product data, and Italy’s privacy watchdog banning ChatGPT from accessing user data.

Reputational impacts

There are also reputational impacts to be considered when using LLMs. As previously mentioned, models may reinforce and amplify any bias present in training data, leading to potentially unethical outcomes. Debate is also ongoing around the training data used to create these models, often scraped from the web, potentially including copyrighted material.

Training such a model also requires significant compute resources, resulting in a huge carbon footprint. Public perception of LLMs isn’t entirely positive, with members of the public alarmed by headlines calculating the number of jobs at risk due to their adoption.

Concentration and contagion risks

Given the resources needed to train and operate foundational models, it is unlikely that many individual institutions would be able to create their own, leaving everyone to rely on a small number of market leaders. Reliance brings obvious concentration risks, but also an element of contagion if those foundational models are found to be deficient or are compromised. For this reason, the UKs competition and markets authority has launched an initial review into the “competition and consumer protection considerations in the development and use of AI foundation models." 

Existential risks

There is also a lot of discussion about the risks that giant AI models (e.g., LLMs) pose to the society we operate in and even humanity itself. Perhaps most well known was a call via an open letter to pause giant AI experiments until the risks are better understood.

Furthermore, one of the most prominent AI researchers, has voiced their concerns of the use by "bad actors," citing the ability for digital systems to distribute and replicate means they will be able to exceed human knowledge. This is a debate that is set to run for a long time and will likely influence regulation in the advancement and use of AI.

What do LLMs say?

Any blog on large language models would not be complete without asking the models themselves what they think. In this download, we show the responses you get if you ask ChatGPT and Bard:

What are the operational risks of using large language models in banking and insurance?




ORX Membership

See how your firm can benefit from ORX Membership.

Find out more