Exploring the Self-Referential Phenomenon of Large Language Models

Introduction
The Bayesian vs. frequentist debate in probabilistic reasoning has long been discussed and explored (Fornacon-Wood et al., 2022). These two contrasting approaches to probability estimation have shaped the field and sparked ongoing debates among statisticians, researchers, and data scientists. However, with the emergence of large language models (LLMs), new dimensions of this debate come into play, presenting intriguing conundrums that demand careful consideration.
The Bayesian approach emphasizes the incorporation of prior beliefs and updates them with observed evidence, allowing for a subjective yet rational framework for probability estimation. On the other hand, the frequentist perspective relies solely on observed frequencies and long-run probabilities, aiming to approximate probabilities through repeated experiments and hypothesis testing (Gigerenzer, 1991). These differing viewpoints have led to lively discussions on the nature of probability and the most appropriate way to reason under uncertainty.
However, two distinctive challenges emerge with the advent of large language models. First, LLMs possess human-level probabilistic reasoning capabilities and can form prior beliefs about themselves. This self-referential capability raises concerns about the potential for an endless loop of self-influence, where the LLM’s predictions can shape its own inputs, blurring the boundaries between cause and effect.
The second challenge arises from LLMs exhibiting characteristics of both Bayesian and frequentist approaches. While LLMs reason without repeated experiments, resembling Bayesian traits, they also undergo extensive training on massive datasets, aligning with frequentist qualities. This dual nature of LLMs prompts a deeper exploration of their probabilistic reasoning processes and how they challenge the traditional Bayesian and frequentist perspectives.
To address the issue of self-reference in LLMs, we can turn to the perspective of ET Jaynes, a prominent figure in Bayesian probability theory, who viewed probability as extended logic (Jaynes, 2003). By recognizing LLMs as distinct from humans and as non-human reasoning agents, we can navigate the Bayesian-frequentist dichotomy in a manner that acknowledges their unique characteristics and challenges.
Furthermore, a crucial aspect to consider is the rediscovery of the active role of humans as reasoning agents. Understanding the distinct qualities of LLMs, which set them apart from traditional disciplines that shaped Bayesian thinking, becomes essential. Through this exploration, we confront the challenges posed by LLMs in probabilistic reasoning and gain insights into our roles as reasoning agents in the context of these advanced AI systems.
Coin Tossing: Is It the Fair Coin or the Rational Tosser?
To illustrate the divergent perspectives of Bayesian and frequentist approaches in probability, let’s explore the example of coin-tossing to determine the probability distribution of heads and tails.
In the frequentist perspective, a large number of coin tosses, such as 1000, would be conducted to estimate probabilities. If heads appeared 490 times and tails appeared 510 times, the frequentist would approximate the head/tail probability distribution as 0.49:0.51.
Now, imagine conducting another 1000 tosses and observing a total of 1005 heads and 995 tails. Updating the probability distribution to 0.5025:0.4975 would not suffice for the frequentist. They would also formulate a hypothesis that the probability distribution is 0.5:0.5 and test it.
The frequentist approach relies on observing long-run frequencies and formulating hypotheses. However, it introduces subjective elements, as it assumes the presence of a human reasoning agent who formulates and validates these hypotheses. The frequentist insists that the hypothesized probability is an inherent property of the object or phenomenon under study.
On the other hand, the Bayesian perspective incorporates prior beliefs and updates them based on observed evidence. A Bayesian thinker starts with initial beliefs about the fairness of the coin and refines these beliefs using Bayes’ rule. The Bayesian approach justifies the use of prior beliefs, which are often considered subjective, by invoking Shannon’s concept of information and entropy, as elucidated by Jaynes.
According to Jaynes, assigning equal probabilities reflects a rational principle of maximum entropy without introducing additional assumptions about the lack of information. The Jaynesian probability framework ensures that individuals with the same information converge to the same probability distribution, demonstrating the objectivity of the Bayesian approach.
Interestingly, in the absence of prior knowledge about a specific coin, all rational individuals would arrive at the conclusion of an equal 0.5:0.5 distribution. In contrast, despite claiming objectivity, frequentist individuals face a challenge in consistently adopting the same 0.5:0.5 hypothesis before conducting sufficiently long trials to provide convincing support for their hypothesis.
The Bayesian and frequentist perspectives on the nature of probability are illustrated in the following diagram:

Infinite Self-Reflection?
In the context of LLMs, the self-referential phenomenon arises from humans studying themselves as reasoning agents using probabilistic reasoning. This phenomenon is illustrated in the diagram below:

As advanced A systems, LLMs are the result of humans exploring and understanding their own language understanding and generation capabilities. This self-referential dynamic, where a reasoning agent observes and analyzes its behavior and cognitive processes, gives rise to intriguing challenges and questions. The agent examines how its prior beliefs, biases, and cognitive limitations influence reasoning outcomes and endeavors to enhance cognitive processes.
This self-referential potential, however, also brings the risk of an infinite regression of incorporating prior beliefs. As a reasoning agent, the LLM can be trapped in an endless loop of updating its prior beliefs based on its own predictions, leading to an unsatisfactory outcome. This challenge is particularly relevant when considering the Bayesian approach, which emphasizes the incorporation of prior beliefs and their subsequent updates based on observed evidence.
To address this challenge, the application of the Maximum Entropy Principle (MEP) becomes crucial. The MEP, which provides a rational framework for determining prior beliefs without falling into an infinite regression, has found wide application in AI training. Notably, the paper “Understanding Deep Learning Generalization by Maximum Entropy” (Wu & Zhu, 2023) highlights the pervasiveness of the MEP in AI development, where developers intuitively apply the principle to make deep neural networks (DNNs) essentially a recursive solution to fulfill the principle of maximum entropy.
This application of the MEP in AI training provides valuable insights into the Bayesian vs. frequentist debate. While the Bayesian approach involves updating prior beliefs, the MEP ensures that this process does not lead to an infinite regression, thus addressing a key challenge in the Bayesian perspective. On the other hand, the frequentist approach, which relies on observed frequencies and long-run probabilities, aligns with the MEP’s emphasis on capturing and representing uncertainty comprehensively.
In the context of LLMs, the application of the MEP demonstrates how the principle of maximum entropy applies to probabilistic reasoning and plays a significant role in addressing the challenges of self-reference. By effectively managing and updating prior beliefs, LLMs can navigate the potential pitfalls of infinite regression and achieve rational and objective probabilistic reasoning. This suggests that LLMs may offer a novel perspective that integrates elements of both Bayesian and frequentist approaches, thus enriching the ongoing debate between these two perspectives.
LLMs as Non-Human Reasoning Agents
Let’s delve into the distinctive characteristics of LLMs and how they challenge and expand upon traditional Bayesian and frequentist perspectives on probability and reasoning.
LLMs, with their vast knowledge base and advanced language understanding capabilities, utilize probabilistic reasoning to generate coherent and contextually relevant responses. They are trained on a diverse range of texts, which equips them with objective patterns and structures inherent in language, such as grammar rules, syntactic relationships, and semantic associations. This accumulation of objective knowledge allows LLMs to generate responses based on observed patterns and associations, a crucial aspect of their functionality.
Recent research has delved into the rationality of LLMs, particularly their ability to form probabilistic beliefs and adjust these beliefs in response to new evidence (Betz & Richardson, 2023). This study found that self-training enhances these models’ probabilistic coherence and logical consistency, suggesting that this process plays a vital role in their development and performance. This finding has significant implications for the future of AI development, highlighting the importance of continual learning and adaptation in these models.
However, it’s important to note that LLMs do not determine the next token in their responses by conducting numerous trials and calculating probability distributions, as a frequentist might. Instead, the generation of responses in LLMs follows a Bayesian approach. In this perspective, the LLM incorporates prior beliefs and updates them based on observed evidence. This allows the LLM to generate responses that may go beyond explicit patterns in the training data and incorporate subjective elements. This ability to incorporate and balance objective and subjective elements in their responses is a unique feature of LLMs and a significant advancement in AI.
As we navigate this dual nature of probability in LLMs, it is crucial to recognize them as distinct reasoning agents, separate from human reasoning. Humans may adopt a frequentist viewpoint when interpreting LLM behavior, considering the observed patterns and statistical properties (Fornacon-Wood et al., 2022). On the other hand, LLMs operate in a Bayesian manner, combining prior beliefs and evidence to generate responses. This difference in perspectives between humans and LLMs offers a new dimension to our understanding of probabilistic reasoning.
By acknowledging LLMs as non-human reasoning agents, we challenge the traditional exclusivity of Bayesian and frequentist perspectives and open up new avenues for understanding probabilistic reasoning. LLMs provide a unique lens through which we can explore the interplay of objective patterns and subjective reasoning in probabilistic inference. This exploration of LLMs and their probabilistic reasoning capabilities is a step towards a deeper understanding of AI systems and their potential impact on our future.

To Marginalize, or Not: That’s the Question
In developing LLMs, the process involving human reasoning agents is commonly referred to as “AI training” (Myung & Pitt, 1997). During this training phase, human agents use training data to determine the model parameters of a neural network (NN). Once deployed as non-human reasoning agents, LLMs enter the “AI inference” phase, autonomously making decisions based on input data. While both AI training and inference involve probabilistic reasoning or “inference,” they differ in terms of the agents involved and the specific objectives.
AI training focuses on inferring the distribution of model parameters denoted as ‘w,’ while AI inference aims to predict ‘Y,’ which can be a label or the next token, based on observed ‘X’ using fixed values for the model parameters ‘w. In the inference phase, we compute the probability distribution P(Y|X, w), which represents the likelihood of ‘Y’ given ‘X’ and ‘w.’ In the case of LLMs, P(Y|X, w) is the probability distribution of the next token.
From a Bayesian perspective, AI training applies Bayes’ rule to update prior beliefs about model parameters ‘w’ based on observed data ‘D=(X, Y)’. This process results in the posterior distribution as
P(w|D)=P(w|X,Y)=P(Y|X,w)P(w)/P(Y|X).
Here, P(w) denotes the prior distribution of model parameters, and P(Y|X) acts as a normalizing factor obtained through integrating or summing over the joint distribution as
P(Y|X)=∫P(Y∣X,w)P(w)dw.
However, marginalization, which involves integrating or summing over the model parameters ‘w,’ can pose challenges, especially when dealing with high-dimensional or continuous variables that are often present in LLM model parameters. Computing the sum or integral over the entire range of these variables becomes computationally demanding or even infeasible (Salakhutdinov, 2009). To address this, researchers often employ approximation methods like Markov Chain Monte Carlo (MCMC) to obtain samples from the joint distribution, enabling posterior inference without explicitly marginalizing all variables. Nonetheless, applying these methods to LLMs with extensive parameter spaces remains challenging.
It is essential to highlight the distinction between AI training for LLMs and traditional Bayesian scenarios involving data collection over time to explain natural phenomena (Mitra et al., 2015). In the case of LLMs, the approach to data collection is active and deliberate, involving the intentional shaping and curation of data. This active curation process leads to the creation of large datasets that exhibit strong evidence, resulting in a posterior distribution that is sharply concentrated.
The concentration of the posterior distribution on specific values means that LLMs tend to produce point estimates for model parameters (Polson & Sokolov, 2017). This is in contrast to traditional Bayesian scenarios, where the posterior distribution may be more spread out, reflecting the gradual accumulation of data over time (Polson & Sokolov, 2017). The larger sample size available to LLMs provides a wealth of information and reduces sampling variability, making point estimates, such as maximum likelihood estimates (MLEs), practical and effective for summarizing the data and estimating model parameters.
Interestingly, the growing view that Bayesian principles of reason and inference may provide a general computational description of brain function aligns with the reasoning capabilities of LLMs. However, it’s important to note, as Fiorillo (2020) argues, that being “Bayesian” is not synonymous with the “performance” of Bayes’ rule. While a fundamental aspect of probability theory, this theorem is not essential to probabilistic reasoning. This perspective challenges the traditional Bayesian view and opens up new avenues for understanding the probabilistic reasoning processes of LLMs.
In summary, AI training for LLMs introduces challenges related to marginalization, given the high-dimensional and continuous nature of LLM model parameters. The active curation of data in LLM development necessitates a modified perspective on Bayesian inference, where point estimates are favored over explicit marginalization. This adaptation does not invalidate the Bayesian framework but highlights the need to tailor it.
Conclusion
In conclusion, our exploration of the emergence of LLMs and their implications for the Bayesian vs. frequentist debate has shed light on the unification of these perspectives under the Jaynesian principles of probability as extended logic. Throughout our analysis, we have identified three significant aspects that contribute to this unification and offer avenues for future research.
Firstly, the rationalization of prior beliefs in Bayesian reasoning through the Maximum Entropy Principle (MEP) provides a solid foundation for assigning probabilities based on available information while maximizing uncertainty. By incorporating the MEP, reasoning agents can avoid an infinite regression of incorporating prior beliefs indefinitely.
Secondly, the existence of LLMs as advanced AI systems challenges the exclusivity of Bayesian and frequentist perspectives by introducing non-human reasoning agents. These models possess human-level probabilistic reasoning capabilities and rely on prior beliefs to make inferences from observed evidence. Recognizing LLMs as non-human reasoning agents expands our understanding of probabilistic reasoning and opens up new avenues for research.
Lastly, the unique characteristics of LLMs, such as their self-curating data nature and the active role of human reasoning agents in data creation and curation, justify the prevalent use of point estimates, such as maximum likelihood estimates (MLEs), in AI training. While explicit marginalization may be unnecessary in LLMs, adapting Bayesian principles to accommodate these considerations.
Collectively, these three aspects converge to unify the Bayesian and frequentist perspectives under the Jaynesian principles of probability as extended logic. By embracing these principles, we can reconcile and harmonize the Bayesian and frequentist approaches, leading to a more comprehensive understanding of probabilistic reasoning in the context of LLMs and beyond.
As the field of AI continues to advance and LLMs evolve, further research and exploration are crucial to deepen our understanding of probabilistic reasoning and its applications. Investigating the interplay between human reasoning agents and LLMs will unlock valuable insights into the capabilities and limitations of these systems, ultimately enhancing our understanding of ourselves as reasoning agents.
References
- Ben-Gal, I. (2008). Bayesian Networks. Encyclopedia of Statistics in Quality and Reliability. DOI: 10.1002/9780470061572.eqr089
- Betz, G., & Richardson, K. (2023). Probabilistic coherence, logical consistency, and Bayesian learning: Neural language models as epistemic agents. PLOS ONE. https://doi.org/10.1371/journal.pone.0281372
- Fiorillo, C. D. (2020). Beyond Bayes: On the Need for a Unified and Jaynesian Definition of Probability and Information within Neuroscience. PLoS ONE, 15(6), e0281372. https://doi.org/10.1371/journal.pone.0281372
- Fornacon-Wood, I., Mistry, H., Johnson-Hart, C., Faivre-Finn, C., O’Connor, J. T., & Price, G. J. (2022). Understanding the Differences Between Bayesian and Frequentist Statistics. International Journal of Radiation Oncology, Biology, Physics. DOI: 10.1016/j.ijrobp.2021.12.011
- Gigerenzer, G. (1991). From tools to theories: A heuristic of discovery in cognitive psychology. Psychological Review, 98(2), 254. DOI: 10.1037/0033–295x.98.2.254
- Jaynes, E. T. (2003). Probability Theory: The Logic of Science. Cambridge University Press.
- Lee, M. D. (2008). Three case studies in the Bayesian analysis of cognitive models. Psychonomic Bulletin & Review, 15(1), 1–15. DOI: 10.3758/pbr.15.1.1
- Mitra, K., Zaslavsky, A., & Åhlund, C. (2015). Context-Aware QoE Modelling, Measurement, and Prediction in Mobile Computing Systems. IEEE Transactions on Mobile Computing. DOI: 10.1109/tmc.2013.155
- Myung, I. J., & Pitt, M. A. (1997). Applying Occam’s razor in modeling cognition: A Bayesian approach. Psychonomic Bulletin & Review. DOI: 10.3758/bf03210778
- Plummer, M. (2008). Penalized loss functions for Bayesian model comparison. Biostatistics. DOI: 10.1093/biostatistics/kxm049
- Polson, N. G., & Sokolov, V. (2017). Deep Learning: A Bayesian Perspective. Bayesian Analysis. DOI: 10.1214/17-ba1082
- Salakhutdinov, R. (2009). Learning Deep Generative Models. Annual Review of Statistics and Its Application. DOI: 10.1146/annurev-statistics-010814–020120
- Sainani, K. L., Lohse, K. R., Jones, P. W., & Vickers, A. J. (2019). Magnitude-based Inference is not Bayesian and is not a valid method of inference. Scandinavian Journal of Medicine & Science in Sports. DOI: 10.1111/sms.13491
- Wu, L., & Zhu, Z. (2023). Towards Understanding Generalization of Deep Learning: Perspective of Loss Landscapes. PLOS ONE. https://doi.org/10.1371