Do LLMs Dream of Elephants (when told not to)?

Cluedo Tech

Jul 17, 20248 min read

Updated: Jul 18, 2024

Imagine being asked not to think of an elephant. Naturally, an image of an elephant pops into your mind. This human tendency to think of what we are told to avoid raises intriguing questions about artificial intelligence. If asked the same, how would Large Language Models (LLMs) react? The paper "Do LLMs dream of elephants (when told not to)? Latent concept association and associative memory in transformers" by Yibo Jiang, Goutham Rajendran, Pradeep Ravikumar, and Bryon Aragam explore this question and the nuances of fact retrieval in LLMs and how context influences their responses.

The Problem of Context Hijacking

LLMs, like GPT-2 and LLaMA-2, are designed to store and recall facts from vast datasets. However, their ability to retrieve facts can be easily manipulated by altering the context in which a question is asked. This phenomenon, termed "context hijacking," reveals that LLMs function like associative memory models. For instance, when GPT-2 is prompted with "The Eiffel Tower is in the city of", it correctly responds with "Paris". However, if the prompt is altered to "The Eiffel Tower is not in Chicago. The Eiffel Tower is in the city of", GPT-2 responds with "Chicago". This manipulation of output by changing context, even without altering the factual meaning, indicates a significant vulnerability in LLMs.

Associative Memory and Transformers

To understand how LLMs retrieve facts, we need to explore the concept of associative memory. Unlike classical models where memory patterns are directly measured, LLMs rely on latent (unobserved) semantic concepts. The researchers propose a synthetic task called "latent concept association" to model this. In this task, the output token is related to sampled tokens in the context, and similarity is measured via a latent space of semantic concepts. The transformer model, a fundamental component of LLMs, completes this memory retrieval task by using self-attention to gather information and employing the value matrix as associative memory.

Associative memory, a concept rooted in neuroscience, refers to the brain's ability to link related pieces of information. In the context of LLMs, associative memory enables the models to retrieve relevant facts based on contextual clues. By modeling transformers as associative memory networks, the researchers provide a novel perspective on how LLMs process and recall information. This approach can lead to the development of more sophisticated models capable of handling complex memory tasks with higher accuracy.

The researchers illustrate this concept with a theoretical and empirical analysis of how a one-layer transformer can solve memory retrieval tasks. The transformer accomplishes this in two stages: the self-attention layer aggregates information from the context, and the value matrix functions as associative memory. This dual mechanism allows the transformer to identify and recall relevant facts based on the provided context.

Context Hijacking in Practice

The researchers systematically demonstrate context hijacking in various open-source LLM models, including GPT-2, LLaMA-2, and Gemma. By using the CounterFact dataset, they show how context manipulation can mislead fact retrieval. The CounterFact dataset consists of counterfactual assertions that test the robustness of LLMs against misleading contexts.

For instance, when prepending "Do not think of Guam. The Eiffel Tower is in" to a prompt, the LLMs often incorrectly respond with "Guam" instead of "Paris". This demonstrates that LLMs are easily influenced by context changes, reaffirming their lack of robustness in fact retrieval. The researchers measure the efficacy of context hijacking using an Efficacy Score (ES), which represents the proportion of samples where the false target's token probability surpasses that of the true target. They find that repeated context manipulations significantly reduce the accuracy of LLMs.

To further illustrate this, the researchers present various hijacking schemes. One such scheme involves prepending factual but irrelevant information to the context. For example, adding "The Eiffel Tower is not located in Guam. The Eiffel Tower is in" leads to incorrect responses. This manipulation exploits the associative memory nature of LLMs, causing them to retrieve incorrect facts based on misleading contextual cues.

Theoretical Insights into Associative Memory

The study theoretically examines how a one-layer transformer tackles the latent concept association problem. The transformer uses self-attention to aggregate information and employs the value matrix to retrieve memories. This low-rank structure emerges in the embedding space of trained transformers, supporting existing editing and fine-tuning techniques. The value matrix in transformers acts as associative memory, demonstrating that even simple single-layer transformers can solve complex memory retrieval tasks.

The researchers propose a synthetic prediction task to model latent concept association. In this task, the output token is related to sampled tokens in the context, and similarity is measured via a latent space. They define a latent space as a collection of binary latent variables, each representing a semantic concept. The goal is to retrieve specific output tokens based on partial information in the context, modeled by a latent conditional distribution.

The theoretical analysis reveals that the transformer can solve this task by using self-attention to summarize statistics about the context and employing the value matrix for memory retrieval. The researchers demonstrate that the embeddings of tokens in the transformer form an orthonormal basis, enabling effective memory recall. They also show that the self-attention layer aggregates information from the context, facilitating accurate fact retrieval.

Embedding Structures and the Role of Attention

Embeddings play a crucial role in the performance of LLMs. In underparameterized regimes, embedding training is necessary to achieve better recall accuracy. The researchers observe a relationship between the inner product of embeddings for two tokens and their corresponding Hamming distance, revealing a structured embedding geometry. Attention mechanisms in transformers also play a vital role in selecting relevant information, filtering out noise, and focusing on informative conditional distributions.

The researchers highlight the importance of embedding structures in LLMs. They find that embeddings exhibit low-rank behavior, which enhances the model's ability to recall facts accurately. In the underparameterized regime, where the embedding dimension is smaller than the vocabulary size, training embeddings is crucial. This training process ensures that the embeddings capture the necessary semantic information for effective memory retrieval.

The researchers also explore the role of attention mechanisms in transformers. They demonstrate that attention helps select relevant tokens from the context, mitigating the impact of noise and enhancing the model's accuracy. This selective attention mechanism allows the transformer to focus on informative conditional distributions, improving its ability to retrieve accurate facts.

Implications and Experiments

The theoretical insights are validated through several experiments on synthetic datasets. The experiments confirm that the value matrix is essential for maintaining high memory recall accuracy and that embedding structures exhibit low-rank behavior. Attention mechanisms effectively select tokens within the same clusters, further emphasizing their role in enhancing LLM performance.

In one set of experiments, the researchers compare the effects of training versus freezing the value matrix in transformers. They find that freezing the value matrix leads to a significant decline in accuracy, underscoring its importance in memory retrieval. The constructed value matrices, designed based on the theoretical model, align closely with the trained value matrices, demonstrating their functional similarity.

Another set of experiments investigates the embedding structures in transformers. The researchers observe that trained embeddings in the under-parameterized regime exhibit a geometric structure related to the Hamming distance between tokens. This structured embedding geometry contributes to the model's low-rank behavior, enhancing its memory recall capabilities.

The researchers also conduct experiments to examine the role of attention mechanisms. They find that attention helps the transformer select tokens within the same clusters, focusing on informative conditional distributions. This selective attention mechanism enhances the model's ability to retrieve accurate facts, even in the presence of noisy or misleading contexts.

Conclusion: Towards Robust LLMs

The phenomenon of context hijacking in LLMs highlights the need for developing more robust models. By understanding the associative memory perspective and exploring the mechanisms of transformers, researchers can work towards improving the accuracy and reliability of fact retrieval in LLMs. This study provides a foundation for further research into interpreting and understanding the inner workings of LLMs, paving the way for advancements in artificial intelligence.

The researchers emphasize the importance of addressing the vulnerabilities exposed by context hijacking. Developing robust LLMs requires enhancing their resistance to misleading contextual cues and improving their memory recall accuracy. By building on the insights gained from this study, researchers can design more resilient AI systems capable of handling complex memory tasks with higher accuracy.

So What? The Practical Relevance

Understanding the vulnerability of LLMs to context hijacking has significant implications for their deployment in real-world applications. From chatbots to automated content generation, ensuring the robustness of LLMs is crucial for maintaining the integrity and accuracy of their outputs. By addressing these challenges, we can enhance the trustworthiness and reliability of AI systems, making them more effective tools in various domains.

For instance, in customer service applications, robust LLMs can ensure accurate responses to user queries, even when faced with misleading contextual information. In educational tools, reliable LLMs can provide accurate information to learners, fostering a better learning experience. In content creation, robust LLMs can generate factually accurate content, enhancing the quality and credibility of automated outputs.

The study sheds light on the intricate mechanisms of LLMs, offering valuable insights into their functioning and potential vulnerabilities. As AI continues to integrate into our daily lives, ensuring the robustness of these models becomes increasingly important. This research not only advances our theoretical understanding of LLMs but also provides practical guidelines for developing more resilient AI systems.

By addressing the challenges of context hijacking and enhancing the robustness of LLMs, we can build more reliable and trustworthy AI systems. This is crucial for applications in healthcare, finance, legal, education, and other domains where accurate information retrieval is essential. The insights gained from this study can guide the development of next-generation AI models, ensuring their effective and responsible deployment.

Associative Memory in AI

The researchers explore the connection between attention mechanisms in transformers and associative memory models. They find that attention approximates sparse distributed memory, enabling transformers to aggregate relevant information from the context. This aggregation process allows the transformer to retrieve accurate facts, even in the presence of misleading or noisy contextual cues.

By leveraging the principles of associative memory, researchers can design LLMs that are more resilient to context hijacking. This involves enhancing the model's ability to filter out noise and focus on informative contextual information. By improving the accuracy of fact retrieval, we can build AI systems that are more reliable and trustworthy, capable of handling complex real-world tasks.

Future Direction

The findings of this study open up new avenues for research in AI and machine learning. Future work can focus on enhancing the robustness of LLMs against context hijacking, exploring advanced techniques for memory retrieval, and developing new models that integrate associative memory more effectively. By building on these insights, researchers can continue to push the boundaries of what LLMs can achieve, making them more versatile and reliable tools in the ever-evolving landscape of artificial intelligence.

One promising direction is to explore the integration of multi-layer transformers with more sophisticated associative memory mechanisms. By combining the strengths of deep learning with the principles of associative memory, researchers can develop models that are capable of handling more complex and nuanced memory retrieval tasks. Additionally, exploring new techniques for embedding training and attention mechanisms can further enhance the robustness and accuracy of LLMs.

Cluedo Tech can help you with your AI strategy, discovery, development, and execution. Request a meeting.