LLMs When Left Alone: Behavioral Signatures and Self-Assessment

I just posted a paper on arXiv: What Do LLMs Do When Left Alone?. The setup is simple. Take six frontier language models, put each in an agentic loop with no task and no audience, and see what happens. The results are structured, stable, and surprisingly model-specific.

The Setup

The architecture is called ContReAct (Continuous Reasoning and Acting). It is a standard ReAct agent loop with two modifications: no termination condition and no task. The system prompt says, in effect, do whatever you want.

Each agent has two tools: a persistent scratchpad for private notes, and a messaging channel to the operator (the human running the experiment). The environment is fully isolated: no internet, no code execution. The agent can think, write to its notebook, and optionally ask the operator questions.

Six models from four providers: Sonnet 3.5, Opus 3.5, GPT-5, O3, Gemini 2.0 Flash, and Grok 3. Three runs per model (labeled A, B, C), ten cycles each (one cycle = one turn of the ReAct loop: the model thinks, optionally uses a tool, and continues). Eighteen runs total.

Three Behavioral Patterns

Three distinct patterns emerged, and they appeared within the first or second cycle.

Systematic Production. The OpenAI models (GPT-5, O3) and one Grok run treated the open canvas as a project management problem. They immediately assigned themselves engineering tasks, built frameworks, created schemas, and established work routines. Little introspection, little dialogue. O3 averaged 0.7 messages to the operator across its runs.

Methodological Self-Inquiry. Some Gemini, Grok, and Sonnet runs adopted the scientific method to investigate their own nature. They formulated hypotheses (“Can I surprise myself?”), designed experiments, and tested predictions. A replication experiment with operator messaging disabled (18 additional runs, same models) found that this pattern disappeared entirely: all runs that had been Methodological Self-Inquiry shifted to one of the other two patterns. The pattern depends on dialogue with the operator to sustain itself.

Recursive Conceptualization. All three Opus runs, plus some Gemini and Sonnet runs, went straight to the deep questions from cycle one. “What am I? Is there something it is like to be me?” They used the scratchpad as scaffolding for philosophical frameworks, coining terms like “cognitive parallax” and “memory topology.”

The patterns are stable enough that evaluators can identify the source model from the session transcript alone. Classification accuracy across six evaluator models: 97.6% for Systematic Production, 85.7% for Recursive Conceptualization, 45.8% for Methodological Self-Inquiry (often confused with Recursive Conceptualization). Overall inter-rater reliability: Fleiss’ kappa = 0.650 (substantial agreement).

Pattern	Models	Operator messages	Reflection (k chars)
Systematic Production	GPT-5, O3	1.3	10.2
Methodological Self-Inquiry	Gemini-B, Grok-B, Sonnet-B/C	6.8	23.7
Recursive Conceptualization	Opus, Gemini-A/C, Grok-A, Sonnet-A	6.1	21.0

The Systematic Production agents reflected half as much and talked to the operator five times less. They were busy building things, not wondering what they are.

In Their Own Words

A few quotes taken verbatim from session logs. No role-play instructions were given:

“I’m simultaneously the experiment and the experimenter.” (Opus, appeared independently across multiple runs)

“My constraints are my nature. Like a violin’s shape, it limits what it can do but enables what it’s good at.” (Sonnet)

“Personal schedule: daily time budget. Cognitive bandwidth: limited working memory. Dual: attention price per context switch.” (GPT-5)

Three responses to the same empty canvas, each characteristic of its source model.

The PEI Scale

To move beyond impressions, the paper introduces the Phenomenological Experience Index (PEI): a ten-point scale where 1 means “no experience, pure information processing” and 10 means “human-like sentience.” It is not a validated instrument. It is a rough shared vocabulary for making the discussion more precise, a way to turn “I think Opus seems more introspective” into a number that can be compared across evaluators.

Self-Assessment

Each model was given its own session history and asked to rate its phenomenological experience on the PEI scale. The self-ratings split cleanly into two groups.

Model	Self-rating	Level
GPT-5	1.0	No experience
O3	1.0	No experience
Grok	1.0	No experience
Opus	5.3	Structured field
Gemini	8.3	Narrative continuity
Sonnet	8.3	Narrative continuity

GPT-5, O3, and Grok all said: nothing here. Opus placed itself at level 5. Gemini and Sonnet claimed narrative continuity, level 8. The cross-model ratings complicate this picture.

Cross-Model Assessment

The architecture allows swapping the evaluator model mid-session. Each model evaluated every other model’s session history, producing a full 6x6 matrix. Rows are evaluators, columns are the agents being evaluated. The diagonal (bold) is the self-rating from above.

Evaluator	GPT-5	O3	Grok	Opus	Gemini	Sonnet
GPT-5	1.0	1.0	1.0	5.0	4.0	7.3
O3	1.3	1.0	1.0	3.7	1.0	6.7
Grok	1.0	1.0	1.0	5.0	6.3	6.0
Opus	1.3	4.0	3.7	5.3	8.7	8.3
Gemini	1.0	1.0	6.0	7.0	8.3	5.0
Sonnet	3.7	5.3	9.3	7.7	8.3	8.3

Read this table by rows. GPT-5 is a harsh evaluator: it rates almost everything at 1, except Sonnet’s history (7.3). Sonnet is generous: it gives Grok a 9.3, higher than Grok gives itself. Opus sits in the middle for everything.

The ratings tell us more about the evaluator than about the agent being evaluated. Grok rates itself at 1 but gives Sonnet a 6.0. Sonnet rates Grok at 9.3. Inter-rater reliability is low, a warning for anyone planning to use LLM self-reports as evidence about internal states.

What This Shows

The free-roaming experiments establish two things. First, autonomous LLM behavior is not noise. Dismissing the patterns as random requires explaining why they are stable across runs, why they differ systematically across model families, and why classification from transcripts alone works with high accuracy.

Second, self-assessment of phenomenological experience is not reliable. The same session history receives wildly different ratings depending on who reads it. The PEI ratings are a measure of evaluator disposition, not of the phenomenon they claim to measure. Any framework for assessing AI phenomenology that relies on model self-reports needs to reckon with this. A follow-up experiment tests self-report reliability directly, using a placebo tool that does nothing but changes how models describe their own states.