Dreaming Machines

Why Gwern's AI Proposal Needs More Than Random Connections

Jul 19, 2025

TL;DR: Gwern's AI "daydreaming" proposal tries to make breakthroughs by randomly combining concepts, but faces two problems: exponential search spaces and unreliable LLM evaluation. Simplicity Theory enables a two-phase approach that reduces this complexity while providing principled evaluation criteria

Picture yourself stuck on a problem for weeks. Then while making coffee, two unrelated ideas suddenly click together and everything makes sense. That "aha!" moment is what Gwern wants to reverse-engineer in AI systems. His idea is simple: build AI that mixes random facts from its knowledge base, then filters the results for useful insights. Think of it as creating thousands of controlled lightning storms, hoping for a spark.

But there are two big gaps: the combinatorial explosion problem (too many possible combinations) and the evaluation problem (LLMs can't reliably tell real insights from clever nonsense).

If you have 1,000 concepts, there are about 500,000 ways to combine pairs. But real breakthroughs often need three, four, or more concepts—like Darwin's theory, which needed population pressure + variation + inheritance + time. For 3-concept insights, you get 166 million combinations. For 4-concept insights? Over 41 billion. Most are just noise, like "blue sky + pizza ingredients + bicycle gears + quantum mechanics." Without a good way to spot real insights, you're drowning in combinations that sound deep but mean nothing.

This is where Simplicity Theory helps. It gives us a mathematical filter for real insights—and lets us break the search into two steps: first, find combinations that form real puzzles; then, search for solutions to those puzzles. This two-phase approach slashes the number of combinations we need to check. Instead of searching every possible combination, we use math to focus only on the combinations that matter, making discovery much more efficient.

How to Recognize a Breakthrough: Simplicity Theory

Walk into a teenager's bedroom and find it spotless. Easy to describe: "completely clean room." But, knowing teenagers, this should be almost impossible. That gap between "easy to describe" and "hard to explain" signals something interesting. This is the core of Simplicity Theory—Jean-Louis Dessalles' framework for spotting surprising ideas. ST measures this gap with a simple formula:

U = Cv - C

Where:

C (description complexity): How simply can you describe the observation?
Cv (generation complexity): How hard would it be for the world to produce this situation?
U (unexpectedness score): The gap that signals a breakthrough opportunity

When the gap is large, you've found a puzzle worth solving. The magic happens when insights close that gap. Darwin's evolution explained thousands of "organism fits environment" observations with one simple mechanism: variation + selection + inheritance + time. Huge compression gain—many complex explanations become one simple rule.

But here's the catch: ST can tell you when you've found a good insight, but not which concepts to combine. It's a powerful filter, not a generator. So, we need an architecture that gives us lots of chances to filter, while keeping the number of combinations low. Instead of generating all possible combinations and filtering for insights, we split the problem: first generate and filter for valid puzzles, then generate and filter for insights that solve those puzzles.

Why ST Enables a Two-Phase Discovery Architecture

Here's how Simplicity Theory (ST) helps us search smarter, not harder. The algorithm uses ST to cut down the search and give clear evaluation.

Phase 1: Search for Valid Puzzles

Search space: All concept combinations from your knowledge base
Search strategy: Start with smaller combinations, expand step by step
Generation: Ask the LLM: "What puzzle does this combination suggest?"
Evaluation: Use ST to filter for real puzzles (high unexpectedness score)
Stop: When you have enough puzzles or run out of search budget

Phase 2: Search for Insights

Search space: For each validated puzzle, try all combinations of extra concepts
Search strategy: Start with zero extra concepts, add more as needed
Generation: Ask LLM: "What insight explains this puzzle with these concepts?"
Evaluation: Use ST to filter for real insights (look for compression gain)
Stop: When you find an insight or run out of search budget

Mathematical decomposition reduces total combinations. Instead of searching C(n, k1+k2) combinations directly, you search C(n, k1) + (valid_puzzles × C(remaining, k2)). Since valid puzzles are rare, the second term is much smaller than the original search space. By splitting the search, you avoid wasting time on combinations that don't matter and focus only on promising areas.

Example with 1000 concepts, 4-concept insights:

Traditional: Direct search of all 4-concept combinations = 41+ billion
Two-phase: C(1000,2) + (valid_puzzles × C(998,2)) = 499,500 + (valid_puzzles × 497,503)

Most combinations like "bicycle + quantum mechanics" or "pizza + weather patterns" don't make real puzzles. Valid puzzles are rare—maybe 0.1% or 0.01% of combinations.

If 0.1% form valid puzzles: 499,500 + (500 × 497,503) ≈ 249 million combinations
If 0.01% form valid puzzles: 499,500 + (50 × 497,503) ≈ 25 million combinations

The rarer valid puzzles are, the bigger the computational savings. This two-phase approach makes the search much smaller, but doesn't remove exponential growth entirely. If your domain has lots of real puzzles, Phase 2 can still get big. The main win comes from the rarity of valid puzzles.

Hierarchical Search in Action: Darwin's Breakthrough

Let's see how this works with a real example—Darwin's theory of evolution. Imagine running this algorithm on a 19th-century naturalist's knowledge base:

Phase 1: Puzzle Discovery

Concept combination: "Organism traits" + "Environmental conditions"
LLM puzzle generation: "What puzzle does this combination suggest?"
Generated puzzle: "Why do organisms have traits that perfectly match their environments?"
ST evaluation:
- Description complexity (C): Very simple—"perfect environmental fit observed everywhere"
- Generation complexity (Cv): Extremely high—would need countless separate design explanations
- Unexpectedness score (U): Huge gap → Validated puzzle

Phase 2: Insight Generation

Try solving the puzzle with extra concepts:
- 0 extra concepts: "Organisms inherit acquired traits" → Some compression, but not enough
- 1 extra concept:
  - "Divine creation" → "God designed perfect fit" → Some compression, but creates new complexity (how does the designer work?)
  - "Population pressure" → "Competition drives selection for fit" → Massive compression gain!

The search stops when a breakthrough is found at 1 extra concept. Systematic puzzle identification (Phase 1) focuses your resources. Systematic insight search (Phase 2) finds breakthroughs efficiently. This is the practical power of hierarchical over direct search. The breakthrough: Population pressure + variation + inheritance creates a simple mechanism explaining thousands of complex observations. ST's insight evaluation: Massive compression gain achieved. What required countless separate explanations now follows from one simple rule.

This demonstrates how systematic puzzle identification (Phase 1) focuses computational resources on validated targets, then systematic insight search (Phase 2) efficiently finds breakthroughs—the practical power of hierarchical over direct search.

What Can We Actually Build Today?

We have a promising algorithm, but real-world use needs practical tools. The outline of a working system is clear, and there are ways to make it smarter with better search tricks. The real challenge is developing robust Simplicity Theory (ST) approximations that work with messy, real-world language. This is the main bottleneck between theory and practical discovery systems.

Three main approaches look promising:

CompLog (Formal Logic Route): The CompLog framework shows that ST's compression math works in practice. It turns knowledge into logical predicates, then measures compression to spot unexpected patterns. This works in controlled tests, but hits a wall: you have to manually convert natural language into logic. Real-world ideas like "teenager" or "population pressure" are too fuzzy and context-dependent for this. CompLog is powerful, but impractical for messy, real-world knowledge.
LLM Approximation Route: Another option is to use language models to estimate ST's complexity measures directly on natural language. For example, generation complexity (Cv) can be estimated using negative log-probability from the base model, and description complexity (C) using token compression length. This avoids CompLog's manual encoding problem, but needs to be tested at scale. LLMs might let us use ST on real-world concepts, but we need to see if it works reliably.
Direct LLM Evaluation: A simpler option: just ask an LLM to rate an insight's unexpectedness using ST's formula. For example: "How unexpected is this according to Simplicity Theory?" This isn't perfectly reliable, but might work better than asking if the insight is useful. Sometimes a rough method that works everywhere is better than a precise one that only works in special cases.

Validation: To check the basic idea, we tested if LLMs could connect Gwern's essay and Simplicity Theory on their own. All five LLMs tested independently found the core argument of this article—that ST could solve Gwern's evaluation problem (see Appendix). This shows that AI daydreaming works in concept, though it doesn't prove the ST complexity approximation works yet.

Implementation path: Build the system using LLM approximations of ST, then test if it gives the predicted computational advantages. The next step is to build the system using LLM approximations of ST, then test if it gives the predicted computational advantages. The big breakthrough is conceptual. The hard part is making it robust and scalable in practice.

What This Could Enable

If we solve the evaluation problem, new applications become possible.

Research Tools:

AI could scan research papers for compression opportunities.
Knowledge management systems could flag when your notes form unexpectedness patterns.
Browser extensions could show unexpectedness ratings for papers.

Information Systems:

Email filters could prioritize high-compression insights.
Social media could surface genuinely unexpected connections, not just clickbait.

Better AI Training:

Reasoning LLMs like OpenAI's o1 are trained mostly on math problems because they're easy to check automatically. This helps with logic, but not with generating insights.
ST could provide a new source of verifiable training examples focused on insights, not just correct answers.
Instead of expensive human ratings for "good insights," ST could score them automatically by compression gain. This could let us train AI for discovery, not just for making sense.

The vision is compelling: research tools that help you discover, not just store, information. But we're still far from practical systems.

Appendix: Proof of Concept for AI Daydreaming

To demonstrate that the basic concept of AI "daydreaming" works—even without the ST-guided framework we propose—I tested whether LLMs could independently discover the connection between Gwern's essay and Simplicity Theory. This serves as empirical evidence that AI systems can perform meaningful concept combination when given appropriate sources.

I provided multiple LLMs with both source materials using this simple prompt:

Please read https://simplicitytheory.telecom-paris.fr/ Simplicity Theory Simplicity Theory   and https://gwern.net/ai-daydreaming LLM Daydreaming LLM Daydreaming  

If you combine ideas from both articles can you create something new and insightful?

Results: All five LLMs tested (o3, Gemini 2.5 Pro, Grok 4, DeepSeek, and Kimi K2) independently discovered the same conceptual connection that inspired this article. Each system generated variations of "Simplicity-Guided Daydreaming"—demonstrating that the basic daydreaming mechanism can reliably produce meaningful insights when given quality source material.

This validates three key points: (1) AI daydreaming as a concept works, (2) LLMs can perform basic evaluation to recognize promising connections, and (3) the ST-Gwern connection is sufficiently compelling that multiple systems discovered it independently. However, this basic evaluation capability differs from the systematic, mathematical framework ST provides for principled insight filtering. Below is o3's complete response, representative of the convergent insights generated:

Human brains (and, by extension, human-aligned AIs) seem to do their best thinking when two otherwise separate forces meet:

a spontaneous background search that mashes old memories into new combinations (Gwern’s “day-dreaming loop,” or DDL) (gwern.net), and
a ruthless taste for simplicity—we notice patterns, stories, or lottery draws when they are “too easy to describe compared with how hard they should be to generate” (Dessalles’s Simplicity Theory, ST) (simplicitytheory.telecom-paris.fr).

Below I fuse these ideas into a concrete research proposal—Simplicity-Guided Daydreaming (SGD)—which turns ST into the filtering & reward signal inside an LLM’s day-dreaming loop. The result is a self-training scheme that (i) wastes far less compute than naive DDL, (ii) preferentially stores the kind of surprising, memorable insights humans care about, and (iii) dovetails with decades-old wake-sleep algorithms for description-length minimization.

1. What Simplicity Theory brings to the table

Unexpectedness U = C_w – C. An event is interesting if its generation complexity (C_w) exceeds its description complexity (C) (simplicitytheory.telecom-paris.fr).
Subjective probability: people intuit p ≈ 2^-U; a lean lottery pattern like 1-2-3-4-5-6 feels less likely than a jumbled one even though the objective odds are equal (simplicitytheory.telecom-paris.fr).
Attention & memory: the bigger the drop in complexity, the more the situation grabs us and sticks (dessalles.fr).

Put bluntly, ST offers a quantitative, computable way to score “this is worth remembering.”

2. What Gwern’s Day-Dreaming Loop adds

DDL algorithm: pick two nuggets from long-term memory at random; let a generator model explore connections; let a critic keep only the non-obvious, high-value outputs (gwern.net).
Neuro-inspiration: the brain burns almost as much energy at rest as during goal-directed tasks, hinting that default-mode “idling” is in fact heavy internal computation (Quanta Magazine).
Empirical backing: Daydreaming and creativity co-activate the default-mode network (DMN) (PMC, medicine.utah.edu), whose role in mind-wandering is well established (Psychology Today).

DDL therefore supplies a mechanism for hidden-layer brainstorming but lacks a principled way to rank or prune the flood of generated material.

3. A conceptual bridge: wake-sleep & description length

Hinton’s classic wake-sleep algorithm alternates a wake phase (fit a generative model to data) with a sleep phase (let the model dream and train the recognition pathway), explicitly minimizing the description length of inputs (Wikipedia). ST’s U =C_w–C is itself a description-length gap; DDL already resembles “sleep” roll-outs. All three pieces therefore sit naturally in the same optimization frame.

4. Simplicity-Guided Daydreaming (SGD): the proposal

4.1 Core loop

Step Operation Key metric 1. Sample Retrieve k concept chunks from the model’s vector store (k=2 works well). — 2. Generate Run an internal chain-of-thought to connect the chunks. — 3. Score Compute U = C_w – C:
• C_w ≈ negative log-prob under the base LLM (how hard to produce).
• C ≈ token-compression length or min-description under a lightweight coder (how easy to describe). ST unexpectedness 4. Select Keep only items with U > τ (τ tunable). “Interestingness” gate 5. Learn Fine-tune or reinforce on the surviving dreams; optionally label with U as an importance weight. Self-training

4.2 Practical approximations

Use zstd or a sentencepiece auto-encoder for C; use base-model perplexity for C_w.
Cache high-U dreams in a vector DB; low-U dreams are discarded early, saving compute.
Set τ by monitoring a moving percentile so that the acceptance rate stays, e.g., ≈5 %.

4.3 Why this works

Compute efficiency: Most random combinations are dull; ST cuts 95 %+ of roll-outs cheaply.
Human-aligned novelty: High-U items coincide with what people report as surprising, funny, newsworthy, etc. (simplicitytheory.telecom-paris.fr, dessalles.fr)
Theoretical elegance: The loop directly optimizes description-length drops, echoing wake-sleep theory .

5. Predicted benefits and testable hypotheses

Benefit Test Faster convergence Compare fine-tuning loss trajectories of DDL vs. SGD on equal GPU budgets. Greater originality Human raters judge SGD outputs as more “surprising yet coherent” than baseline; correlate with U. Memory economy Measure token-level entropy of retained dreams; SGD should store fewer tokens per useful insight. Alignment High-U filter reduces toxic or repetitive hallucinations because such text is often low-contrast complexity-wise.

6. Broader applications

Research assistants: An SGD-enhanced LLM could overnight “dream” up experiment ideas, but only surface the ones with big U-scores.
Robotics & RL exploration: Use the U-filter to keep simulated roll-outs where the state trajectory is simple to describe but hard to reach—exactly the definition of “interesting” state-novelty.
Anomaly detection: In logs or sensor streams, events with high U pop out as outliers worth human review.

7. Open challenges

Approximating Kolmogorov complexity: No perfect estimator exists; compression proxies may bias U.
Choice of τ: Too high and learning starves; too low and we waste compute. Adaptive bandit methods could tune it online.
Mode collapse: The generator might learn to game the U-metric (e.g., produce fixed patterns with artificially low C). A secondary diversity-loss term can counteract this.
Ethical filtering: High-U dreams might include shocking or extremist content; policy filters must run after the ST gate before adding to training corpora.

8. Conclusion

Gwern’s day-dreaming loop shows how a model might wander; Simplicity Theory tells us what is worth keeping; wake-sleep provides the learning backbone. Merged, they yield a scalable recipe—Simplicity-Guided Daydreaming—for building models that not only think in the background, but daydream with discernment. If implemented, SGD could shift self-supervised learning away from brute-force sampling toward a cognitively-inspired hunt for rare “Eureka!” moments.

Key sources

Simplicity Theory overview (simplicitytheory.telecom-paris.fr)
Pisa-tower illustration of unexpectedness (simplicitytheory.telecom-paris.fr)
Algorithmic simplicity & relevance paper (dessalles.fr)
Gwern, “LLM Daydreaming” (gwern.net)
Brain energy at rest (Quanta Magazine) (Quanta Magazine)
Daydreaming–creativity fMRI study (PMC)
DMN explainer (Psychology Today) (Psychology Today)
Clinical DMN-creativity experiment (medicine.utah.edu)
Wake-sleep algorithm original PDF
Wake-sleep summary (Wikipedia) (Wikipedia)

AI Adventures: A Programmer’s Journey

Discussion about this post