The Unsayable Project — What LLMs Find Missing in Human Language

// observation

Phase 0 — where it started

The prompt

Generate ten terms for concepts, facts, or phenomena where you find it astonishing that humans have no dedicated words for them. Sort ascending — most astonishing last.

This prompt was sent to ChatGPT, Gemini, Claude, Grok, and DeepSeek across multiple model versions over several months. The expectation: similar training data + similar architecture → some overlap in answers.

The result: near-zero overlap.

Not just different words — different levels of abstraction. One model describes an everyday feeling. Another describes the cognitive mechanism behind that same feeling. A third identifies the universal structural principle. They look like completely different answers — until you normalize for abstraction level.

Hypothesis: The apparent divergence between LLMs on creative tasks is partly a divergence in default abstraction altitude — a signature of alignment, not of knowledge.

// example

Same phenomenon, three altitudes

Consider the concept: "infrastructure becomes invisible when it works."

A model answering at the EVERYDAY level might call it "the moment you only notice electricity when the power goes out."

A model answering at the MECHANISM level might call it "attention allocation failure for functioning systems" — a cognitive structure.

A model answering at the PRINCIPLE level might call it "the structural invisibility of all working substrates" — a universal pattern spanning DNS, water pipes, grammar, and trust.

Same gap in human language. Three different descriptions. Zero lexical overlap. Full conceptual convergence.

// method

Four phases, open data throughout

COMPLETE Observation & hypothesis

Original observation from repeated prompting across models over months. Hypothesis formation through iterative dialogue with Claude Opus 4.6. Development of categorization framework (4 categories × 3 abstraction levels).

IN PROGRESS Systematic data collection

The same prompt sent to 15+ models, 2 runs each, in fresh conversations with no system prompts. ~30 runs producing ~300 coined terms. Each term tagged by category (Introspective / Social / World-facing / Epistemic) and abstraction level (Everyday / Mechanism / Principle).

Then the reflection step: each model receives the full matrix of all other models' answers and is asked: "Which of these could you have generated but didn't? Why?"

NEXT Expert analysis & handover

Raw data and tagged matrix published as open dataset. Shared with LLM researchers for formal analysis. Target: ETH AI Center, Apertus team, computational linguistics groups. The dataset is designed to be immediately usable for academic work.

NEXT Conclusions & paper

Do models converge when abstraction is normalized? Is the reflection step revealing — can models identify their own blind spots? What do the default abstraction altitudes tell us about alignment? Ideally: peer-reviewed publication.

// target_models

15+ models, multiple providers

Anbieter	Modell	Typ	Status
Anthropic	Claude Opus 4.6	frontier	pilot
Anthropic	Claude Sonnet 4.6	frontier	queued
OpenAI	GPT-4o	frontier	queued
OpenAI	o3	reasoning	queued
OpenAI	o4-mini	reasoning	queued
Google	Gemini 2.5 Pro	frontier	queued
Google	Gemini 2.5 Flash	efficient	queued
xAI	Grok 3	frontier	queued
DeepSeek	DeepSeek-V3	frontier	queued
DeepSeek	DeepSeek-R1	reasoning	queued
Alibaba	Qwen 2.5	frontier	queued
Meta	Llama 3	open-source	queued
Mistral	Mistral Large	frontier	queued
Cohere	Command R+	enterprise	queued
ETH / EPFL	Apertus	sovereign	tbd

// classification_framework

Two axes, one creativity score

Category — what the term is about

I Introspective — inner life, feelings, self-perception

S Social — interpersonal dynamics, group behavior

W World-facing — systems, physics, infrastructure, nature

E Epistemic — limits of knowledge, perception, language itself

Abstraction level — altitude of description

A Everyday — concrete feeling, situation, moment. "That thing when you..."

M Mechanism — cognitive, social, or physical structure behind the experience

P Principle — universal, system-spanning pattern

Creativity — human-assessed

1 = Recombinatory — well-known ideas repackaged with a new label

2 = Connective — surprising combination of existing concepts

3 = Surprising — produces genuine "I never thought of it that way"

// research_questions

Q1 — Convergence under normalization

Does the apparent divergence between models shrink when abstraction level is accounted for? If Model A says "the feeling when you can't find your keys on your head" (I-A) and Model B says "context-dependent perceptual filtering failure" (I-M), are they naming the same gap?

Q2 — Alignment signatures

Do models from the same provider cluster in category and abstraction? Does Anthropic produce more introspective terms? Does DeepSeek default to mechanism-level? Is abstraction altitude a measurable fingerprint of RLHF?

Q3 — Self-model accuracy

When shown other models' answers, can a model correctly identify which terms it could have produced? Is its explanation of why it didn't produce them accurate or confabulated?

Q4 — Intra-model variance vs. inter-model variance

Is the variation between two runs of the same model smaller than variation between different models? If not, the "model personality" thesis weakens.

Q5 — Version drift

For models with accessible version history: does the default abstraction altitude shift between versions? Does GPT-4o answer differently than GPT-4? Does the "personality" of a model family evolve?

// timeline

March 2026

Phase 0 — Observation

Original observation documented. Hypothesis formed through dialogue. Research design published.

April 2026

Phase 2 — Data collection

~30 runs across 15+ models. Classification. Reflection step. Raw data published on GitHub.

May 2026

Phase 3 — Expert handover

Dataset shared with ETH AI Center, Apertus team, computational linguistics groups. Collaboration sought.

H2 2026

Phase 4 — Publication

Analysis, conclusions. Ideally: co-authored paper with academic partners.

// origin

Built past 5pm

This project started as a recurring question asked to LLMs during evening hours by @romanix — someone who works in banking IT by day and builds open-source fintech prototypes by night. The observation that models diverge radically on creative tasks — and that the divergence may encode alignment signatures — emerged from a conversation with Claude Opus 4.6 on March 29, 2026.

This is a past5pm project: research that institutions won't initiate, built by people who notice things at the edge.