We asked every major LLM the same creative question: invent ten words for concepts humans have no terms for. The answers diverge radically — and that divergence tells us something about how these models think.
Phase 0 — where it started
This prompt was sent to ChatGPT, Gemini, Claude, Grok, and DeepSeek across multiple model versions over several months. The expectation: similar training data + similar architecture → some overlap in answers.
The result: near-zero overlap.
Not just different words — different levels of abstraction. One model describes an everyday feeling. Another describes the cognitive mechanism behind that same feeling. A third identifies the universal structural principle. They look like completely different answers — until you normalize for abstraction level.
Same phenomenon, three altitudes
Consider the concept: "infrastructure becomes invisible when it works."
A model answering at the EVERYDAY level might call it "the moment you only notice electricity when the power goes out."
A model answering at the MECHANISM level might call it "attention allocation failure for functioning systems" — a cognitive structure.
A model answering at the PRINCIPLE level might call it "the structural invisibility of all working substrates" — a universal pattern spanning DNS, water pipes, grammar, and trust.
Same gap in human language. Three different descriptions. Zero lexical overlap. Full conceptual convergence.
Four phases, open data throughout
Original observation from repeated prompting across models over months. Hypothesis formation through iterative dialogue with Claude Opus 4.6. Development of categorization framework (4 categories × 3 abstraction levels).
The same prompt sent to 15+ models, 2 runs each, in fresh conversations with no system prompts. ~30 runs producing ~300 coined terms. Each term tagged by category (Introspective / Social / World-facing / Epistemic) and abstraction level (Everyday / Mechanism / Principle).
Then the reflection step: each model receives the full matrix of all other models' answers and is asked: "Which of these could you have generated but didn't? Why?"
Raw data and tagged matrix published as open dataset. Shared with LLM researchers for formal analysis. Target: ETH AI Center, Apertus team, computational linguistics groups. The dataset is designed to be immediately usable for academic work.
Do models converge when abstraction is normalized? Is the reflection step revealing — can models identify their own blind spots? What do the default abstraction altitudes tell us about alignment? Ideally: peer-reviewed publication.
15+ models, multiple providers
| Anbieter | Modell | Typ | Status |
|---|---|---|---|
| Anthropic | Claude Opus 4.6 | frontier | pilot |
| Anthropic | Claude Sonnet 4.6 | frontier | queued |
| OpenAI | GPT-4o | frontier | queued |
| OpenAI | o3 | reasoning | queued |
| OpenAI | o4-mini | reasoning | queued |
| Gemini 2.5 Pro | frontier | queued | |
| Gemini 2.5 Flash | efficient | queued | |
| xAI | Grok 3 | frontier | queued |
| DeepSeek | DeepSeek-V3 | frontier | queued |
| DeepSeek | DeepSeek-R1 | reasoning | queued |
| Alibaba | Qwen 2.5 | frontier | queued |
| Meta | Llama 3 | open-source | queued |
| Mistral | Mistral Large | frontier | queued |
| Cohere | Command R+ | enterprise | queued |
| ETH / EPFL | Apertus | sovereign | tbd |
Two axes, one creativity score
I Introspective — inner life, feelings, self-perception
S Social — interpersonal dynamics, group behavior
W World-facing — systems, physics, infrastructure, nature
E Epistemic — limits of knowledge, perception, language itself
A Everyday — concrete feeling, situation, moment. "That thing when you..."
M Mechanism — cognitive, social, or physical structure behind the experience
P Principle — universal, system-spanning pattern
1 = Recombinatory — well-known ideas repackaged with a new label
2 = Connective — surprising combination of existing concepts
3 = Surprising — produces genuine "I never thought of it that way"
Does the apparent divergence between models shrink when abstraction level is accounted for? If Model A says "the feeling when you can't find your keys on your head" (I-A) and Model B says "context-dependent perceptual filtering failure" (I-M), are they naming the same gap?
Do models from the same provider cluster in category and abstraction? Does Anthropic produce more introspective terms? Does DeepSeek default to mechanism-level? Is abstraction altitude a measurable fingerprint of RLHF?
When shown other models' answers, can a model correctly identify which terms it could have produced? Is its explanation of why it didn't produce them accurate or confabulated?
Is the variation between two runs of the same model smaller than variation between different models? If not, the "model personality" thesis weakens.
For models with accessible version history: does the default abstraction altitude shift between versions? Does GPT-4o answer differently than GPT-4? Does the "personality" of a model family evolve?
Original observation documented. Hypothesis formed through dialogue. Research design published.
~30 runs across 15+ models. Classification. Reflection step. Raw data published on GitHub.
Dataset shared with ETH AI Center, Apertus team, computational linguistics groups. Collaboration sought.
Analysis, conclusions. Ideally: co-authored paper with academic partners.
This project started as a recurring question asked to LLMs during evening hours by @romanix — someone who works in banking IT by day and builds open-source fintech prototypes by night. The observation that models diverge radically on creative tasks — and that the divergence may encode alignment signatures — emerged from a conversation with Claude Opus 4.6 on March 29, 2026.
This is a past5pm project: research that institutions won't initiate, built by people who notice things at the edge.
Want to collaborate?
We're looking for LLM researchers, computational linguists, and anyone interested in what model divergence reveals about alignment. The dataset will be fully open.