System 1 and System 2 Thinking in AI

Cluedo Tech

Jul 27, 20244 min read

The paper "Distilling System 2 into System 1" by Ping Yu, Jing Xu, Jason Weston, and Ilia Kulikov, affiliated with Meta FAIR, introduces a creative approach to enhancing the efficiency and performance of large language models (LLMs). This blog post summarizes the key concepts and implications of this research, aiming to make these topics accessible to a broader audience. We will attempt to give additional context and provide explanations of the concepts discussed in the paper.

Understanding System 1 and System 2

In psychology, System 1 and System 2 refer to two types of cognitive processes, a distinction popularized by Daniel Kahneman in his book "Thinking, Fast and Slow":

System 1: Fast, automatic, and often unconscious. It is used for routine tasks and quick judgments. For example, recognizing a friend's face or driving a familiar route.
System 2: Slow, deliberate, and conscious. It is used for complex problem-solving and reasoning tasks that require more effort and attention. For example, solving a math problem or planning a vacation.

Application in AI

These concepts have been applied to AI to differentiate between types of reasoning processes in LLMs:

System 1 in AI: Refers to generating responses directly without intermediate steps. It’s akin to producing an answer in a straightforward manner based on patterns learned during training.
System 2 in AI: Involves generating intermediate reasoning steps before arriving at a final response. This process is akin to human deliberative thinking and is used for tasks requiring more complex reasoning.

Large Language Models and Intermediate Reasoning

Large Language Models (LLMs) like GPT-3, GPT-4, and LLaMA-2 have shown remarkable capabilities in generating human-like text. However, they often struggle with tasks requiring complex reasoning. To address this, researchers have developed various System 2 methods, including:

Chain-of-Thought (CoT) Prompting: This method involves generating intermediate reasoning steps to solve problems more effectively.

Rephrase and Respond (RaR): A method where the model first rephrases the question to make it clearer and then generates a response.

System 2 Attention (S2A): A method that reduces reliance on biased information by rewriting the input before generating a response.

Branch-Solve-Merge (BSM): A sophisticated method that breaks down a task into sub-tasks, solves them in parallel, and merges the results.

While these System 2 methods improve performance, they are computationally expensive and slow due to the additional intermediate steps.

The Concept of Distillation

Distillation in AI typically involves transferring knowledge from a larger, more complex model (teacher) to a smaller, more efficient model (student). The goal is to retain the performance of the larger model while reducing computational costs. The authors of the paper extend this idea by distilling the reasoning process itself from System 2 into System 1 within the same model. This means the model can produce high-quality outputs without generating intermediate steps during inference, thus reducing computational costs and improving efficiency.

Methodology

The authors explore several self-supervised methods to achieve this distillation:

Generating Training Data: Using System 2 methods on unlabeled data to create high-quality outputs.

Filtering and Consistency: Ensuring the quality of these outputs through self-consistency checks. This involves generating multiple outputs for the same input and selecting the most consistent ones.

Fine-Tuning: Training the LLM to match the distilled outputs without generating intermediate steps. This involves fine-tuning the model on the distilled data to improve its System 1 performance.

Steps

System 2 Generation: Apply System 2 methods to a set of unlabeled inputs to produce high-quality outputs.

Filtering: Use self-consistency checks to filter out noisy or low-quality outputs. This involves:

Self-Consistency of Outputs: Sampling multiple times and selecting the majority vote.
Self-Consistency Under Input Perturbation: Perturbing the input and ensuring the output remains consistent.

Fine-Tuning: Use the filtered outputs to fine-tune the LLM. This training aligns the model’s direct outputs (System 1) with the high-quality outputs of System 2 methods.

Experimental Results

The authors conducted experiments using four different System 2 approaches across five tasks, demonstrating the feasibility and effectiveness of System 2 distillation. Here are the key findings:

Rephrase and Respond (RaR)

Last Letter Concatenation Task: This task involves concatenating the last letters of given words. The System 1 model achieved an accuracy of 30%, while the System 2 RaR method improved it to 44.5%. Distilling RaR into System 1 achieved a remarkable accuracy of 98%.
Coin Flip Reasoning Task: This task involves determining the final state of a coin after a series of flips described in natural language. The System 1 model had a success rate of 56.1%, while the 2-Step RaR improved it to 77.2%. Distilling RaR into System 1 achieved a success rate of 75.69%.

System 2 Attention (S2A)

TriviaQA Task: This task involves answering questions with potential biases in the input. The System 1 model had an accuracy of 51.6% on biased inputs, while the S2A method improved it to 76%. Distilling S2A into System 1 achieved an accuracy of 81.3%.

Branch-Solve-Merge (BSM)

LLM-as-a-Judge Task: This task involves evaluating responses based on multiple criteria. The BSM method achieved high agreement with human judgments but at a high computational cost. Distilling BSM into System 1 retained the performance improvements while significantly reducing computational costs.

Chain-of-Thought (CoT)

GSM8k Task: This task involves solving graduate school math problems. The CoT method significantly improved performance but was challenging to distill into System 1 effectively. This highlights the limitations and areas for future research.

Implications and Future Directions

The successful distillation of System 2 into System 1 has several significant implications:

Efficiency: Reducing computational costs makes these advanced reasoning techniques more practical for real-world applications. For instance, deploying these models in resource-constrained environments becomes feasible.

Scalability: Improved performance without the need for multiple prompts or intermediate steps allows for broader deployment of LLMs in various applications, including real-time systems.

Continual Learning: The approach aligns with the concept of continually learning AI systems, which can focus on tasks they cannot yet perform well, continually improving over time.

Conclusion

The research presented in this paper represents a significant advancement in the field of AI, bridging the gap between complex reasoning and efficient performance in LLMs. By distilling System 2 into System 1, the authors have paved the way for more practical and scalable AI applications. This approach not only improves efficiency but also maintains or even enhances the performance of LLMs.

Cluedo Tech can help you with your AI strategy, discovery, development, and execution. Request a meeting.

References and Further Reading

Original Paper: Distilling System 2 into System 1
Chain-of-Thought Prompting: Wei et al., 2022. Chain-of-Thought Prompting
System 2 Attention: Weston and Sukhbaatar, 2023. System 2 Attention
Branch-Solve-Merge: Saha et al., 2023. Branch-Solve-Merge
Thinking, Fast and Slow: Daniel Kahneman, 2011. Book