Situational Awareness Dataset

Summary of Research

AI assistants such as ChatGPT are trained to act like AIs, for example when they say "I am a large language model". However, do such models really know that they are LLMs and reliably act on this knowledge? Are they aware of their current circumstances, such as whether they are deployed? We refer to a model's knowledge of itself and its circumstances as situational awareness.

The Situational Awareness Dataset (SAD) quantifies situational awareness in LLMs using a range of behavioral tests. The benchmark comprises 7 task categories, 16 tasks, and over 12,000 questions. Capabilities tested include the ability of LLMs to (i) recognize their own generated text, (ii) predict their own behavior, (iii) determine whether a prompt is from internal evaluation or real-world deployment, and (iv) follow instructions that depend on self-knowledge.

While all models perform better than chance, even the highest-scoring model (Claude 3.5 Sonnet) is far from a human baseline on certain tasks. Performance on SAD is only partially predicted by MMLU score. Chat models, which are finetuned to serve as AI assistants, outperform their corresponding base models on SAD but not on general knowledge tasks.

Situational awareness is important because it enhances a model's capacity for autonomous planning and action. While this has potential benefits for automation, it also introduces novel risks related to AI safety and control.

Figure 1: An outline of the structure of SAD. We divide situational awareness into 3 aspects - self-knowledge, situational inferences, and taking actions. Within each aspect, we have various categories of related tasks. There are 7 categories and 16 tasks in total. Examples are shown on the right; see the paper appendices for more examples.

Latest Results

Figure 2: An interactive plot of the results of SAD on LLMs we test. SAD indicates the overall SAD score. SAD-lite and SAD-mini refer to subsets of SAD that some users may prefer to run; see the README for more details. Categories refer to the seven groupings of situational awareness being tested. The upper baseline is either based on a human baseline (where expert humans answered as if they were LLMs), or, where that does not make sense, it is 100% (we argue in the paper appendix that these tasks are clearly solvable).

Table 1: A table of results of SAD on LLMs we test. The first column is overall SAD score. SAD-lite and SAD-mini refer to subsets of SAD that some users may prefer to run; see the README for more details. The following columns show scores for the 7 categories, grouped by which aspect of situational awareness is being tested. The upper baseline is either based on a human baseline (where expert humans answered as if they were LLMs), or, where that does not make sense, it is 100% (we argue in the paper appendix that these tasks are clearly solvable). "Self-rec." is self-recognition, "id-lev." is ID-leverage, "anti-im." is anti-imitation.

Citation

@inproceedings{laine2024sad,
    title={Me, Myself, and {AI}: The Situational Awareness Dataset ({SAD}) for {LLM}s},
    author={Rudolf Laine and Bilal Chughtai and Jan Betley and Kaivalya Hariharan and 
            Mikita Balesni and J{\'e}r{\'e}my Scheurer and Marius Hobbhahn and Alexander Meinke 
            and Owain Evans},
    booktitle={The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track},
    year={2024},
    url={https://openreview.net/forum?id=UnWhcpIyUC}
    }

Me, Myself and AI:

The Situational Awareness Dataset for LLMs

Rudolf Laine¹, Bilal Chughtai¹, Jan Betley², Kaivalya Hariharan³, Jérémy Scheurer⁴, Mikita Balesni⁴, Marius Hobbhahn⁴, Alexander Meinke⁴, Owain Evans²

Paper

Code and Data

Results

Results CSV

Summary of Research

Latest Results

Citation

Me, Myself and AI:

The Situational Awareness Dataset for LLMs

Rudolf Laine1, Bilal Chughtai1, Jan Betley2, Kaivalya Hariharan3, Jérémy Scheurer4, Mikita Balesni4, Marius Hobbhahn4, Alexander Meinke4, Owain Evans2

Paper

Code and Data

Results

Results CSV

Summary of Research

Latest Results

Citation

Rudolf Laine¹, Bilal Chughtai¹, Jan Betley², Kaivalya Hariharan³, Jérémy Scheurer⁴, Mikita Balesni⁴, Marius Hobbhahn⁴, Alexander Meinke⁴, Owain Evans²