AI-Assisted Theoretical Exploration: A Replicable Methodology

Last Updated: 2026-02-09 (Session S357) Version: 1.0 Purpose: Document the methodology used for AI-assisted theoretical physics exploration, independent of whether the physics is correct. Audience: Researchers considering AI as a tool for theoretical work; reviewers evaluating AI-assisted claims. Status: CURRENT Reading Time: ~20 minutes

Key References

Document	Role
Honest Assessment	Self-assessment (25-40% genuine physics)
Predictions Registry	63+ predictions with verification status
Verification Scripts	736+ SymPy scripts (99.8% pass rate)
Statistical Analysis	Honest P-value analysis
Hallucination Log	14 caught LLM errors with analysis

Critical Framework Elements

Element	Status	Relevance to This Document
Hallucination Log (14 incidents)	CURRENT	Concrete examples of caught errors
Verification infrastructure (736+ scripts)	CURRENT	Primary defense layer
Red Team v3.0 (S330)	CURRENT	Adversarial self-assessment protocol
LLM Challenge v1-v3 (S128-S261)	CURRENT	External reproducibility tests

1. Introduction: Why AI-Assisted Theoretical Physics

1.1 The Exploration Partner Model

This project uses Claude (Anthropic) as an exploration partner, not an oracle. The distinction matters: an oracle is trusted to produce correct answers; a partner is trusted to be useful but verified at every step.

Over 350+ sessions, the collaboration has settled into a pattern:

The human provides physical intuition, decides which questions to pursue, judges whether results are interesting, and maintains skepticism about all claims.
The AI provides rapid symbolic computation, systematic bookkeeping, pattern recognition across large derivation chains, and serves as a tireless exploration companion that never tires of checking “what if?”

This is not the first use of computation in theoretical physics — Mathematica, Maple, and other CAS tools have been standard for decades. What is new is using a language model as an interactive reasoning partner that can follow multi-step derivations, suggest alternative approaches, and maintain context across hundreds of sessions.

1.2 What AI Is Good At

Computation: Writing and running SymPy verification scripts faster than manual calculation
Bookkeeping: Tracking 63+ predictions, 14 falsifications, 4 irreducible assumptions, and their interdependencies
Pattern recognition: Noticing when the same mathematical structure appears in different physical contexts
Systematic search: Exploring parameter spaces, scanning formula families, testing alternatives
Consistency checking: Verifying that new results don’t contradict established ones

1.3 What AI Is Bad At

Mathematical rigor: LLMs produce plausible-looking proofs that are wrong. Every algebraic claim must be verified computationally (Section 2).
Physical intuition: The AI cannot judge whether a mathematical result is physically reasonable without explicit criteria.
Genuine novelty: The AI’s suggestions are shaped by its training data. The “derivation vs. discovery” problem (Section 7.1) remains unresolved.
Self-assessment: The AI tends toward agreement and can rationalize wrong results. Adversarial protocols are necessary (Section 6).

2. The Three-Layer Hallucination Defense

LLMs hallucinate. In a mathematical context, this means producing confident, detailed, plausible-looking derivations that contain errors ranging from sign mistakes to entirely fabricated proofs. A hallucination audit of this project (S287) found a 3% phantom rate in script references and identified 8 distinct hallucination patterns. Defenses must be systematic.

Layer 1: Computational Verification

Rule: No calculation appears in any markdown document without a passing verification script.

Every numerical claim goes through:

CLAIM -> Write SymPy script -> RUN script -> Confirm PASS -> THEN document

The order is non-negotiable. The script is written before the claim is recorded, not after. This prevents the common failure mode where a plausible-looking result gets documented and then never verified.

Script standards (enforced across 736+ scripts):

Explicit assumption section listing every input with its derivation chain tag
Computation using exact arithmetic (sympy.Rational, not floating point)
Comparison to measured values using CODATA 2022 (framework_constants.py)
Assertions with PASS/FAIL output for every test
Error calculation in ppm or percentage

Example structure:

#!/usr/bin/env python3
"""Title: What this verifies
Assumptions: [A-AXIOM] n_c = 11, [I-MATH] Frobenius theorem
"""
from sympy import Rational
from framework_constants import ALPHA_INV_MEASURED

# Exact arithmetic
predicted = Rational(15211, 111)  # 137 + 4/111
error_ppm = abs(float(predicted) - float(ALPHA_INV_MEASURED)) / float(ALPHA_INV_MEASURED) * 1e6

print(f"Predicted: {float(predicted)}")
print(f"Measured:  {float(ALPHA_INV_MEASURED)}")
print(f"Error:     {error_ppm:.2f} ppm")
assert error_ppm < 1.0, f"FAIL: {error_ppm} ppm"
print("[PASS] Alpha tree-level within 1 ppm")

Centralized constants: All scripts import from framework_constants.py, which pins CODATA 2022 values. This prevents the failure mode discovered in the hallucination audit (HP-011) where different scripts used different CODATA vintages, making precision claims inconsistent.

Layer 2: Multi-Path Verification

For any claim at sub-percent precision, a single derivation path is insufficient. The protocol requires:

Primary derivation: The original argument
Secondary derivation: A different starting point reaching the same result
Numerical check: Independent high-precision computation (mpmath)
External check: Wolfram Alpha cross-verification (budget permitting, ~65 queries/day)

Consistency test: If primary and secondary derivations disagree, the claim is quarantined until the discrepancy is resolved. If the numerical check deviates by more than 0.01%, the claim is flagged.

This layer caught the alpha derivation’s validity (HP-001): the Lie algebra formula and the cyclotomic polynomial Phi_6(11) = 111 were verified to be an algebraic identity, not a coincidence.

Layer 3: Semantic Consistency

Hallucinated derivations often have subtle semantic inconsistencies that survive computational checks because the computation itself is correct — the error is in the reasoning connecting the computation to the claim. The S287 audit found that 7 of 8 new hallucination patterns were semantic, not arithmetic.

Checks include:

Dimensional analysis: Units must work at every intermediate step
Limit behavior: Results must reduce to known cases in appropriate limits
Symmetry preservation: Derivation steps must not break symmetries
Label consistency: A quantity labeled [DERIVATION] must not depend on imported values labeled as derived

This layer caught the most subtle errors:

HP-005: A “0.00 sigma” Weinberg angle claim that was actually a 1-parameter fit to 1 data point (guaranteed to give 0.00 sigma)
HP-006: SM particle content labeled as [DERIVATION] when it was actually [A-IMPORT]
HP-009: Derivation chains where F=C was simultaneously assumed and cited as derived

Hallucination Risk Score (HRS)

Every new derivation is scored:

Factor	Score
Matches known value	+2
”It can be shown” language	+2
No intermediate steps given	+3
Result seems “too good”	+2
Multiple independent verifications	-2
Clear derivation chain	-2
Falsification criterion stated	-1

HRS >= 4: Requires multi-path verification before the claim is accepted. The question “What would make this wrong?” is asked explicitly.

3. The Verification Infrastructure

3.1 Scale

The project maintains 736+ verification scripts in verification/sympy/, covering:

Domain	Scripts	Examples
Alpha / fine structure	~40	`alpha_enhanced_prediction.py`, `correction_term_lie_algebra.py`
Weinberg angle	~25	`weak_angle_97_formula.py`, `sin2_schur_democratic.py`
Cosmology	~30	`omega_lambda_derivation.py`, `cmb_recombination_redshift.py`
Yang-Mills / QCD	~60	`yang_mills_glueball_spectrum.py`, `large_n_intercept.py`
Mass ratios	~30	`proton_electron_best_formula.py`, `lepton_mass_ratio_verification.py`
QM derivation	~40	`born_rule_derivation.py`, `schrodinger_emergence.py`
Statistical analysis	~15	`statistical_significance_s170.py`, `monte_carlo_building_blocks.py`
Other	~495	Glueballs, CMB peaks, gravity, dark matter, crystallization

Pass rate: 99.8% (post-remediation, S289). This means ~1-2 scripts have known issues at any time, and those are tracked.

3.2 framework_constants.py

A single source of truth for all measured values:

CODATA 2022 recommended values (alpha, masses, coupling constants)
PDG 2024 particle data (W mass, Z mass, quark masses)
Planck 2018 cosmological parameters
Division algebra dimensions (n_d=4, n_c=11, etc.)

Every script imports from this file. When CODATA updates, one file changes and all scripts are re-verified. The S289 remediation standardized 61 scripts from mixed CODATA 2018/2022 to uniform CODATA 2022.

3.3 Regression Detection

Scripts are re-run periodically and after any framework change. A script that previously passed and now fails triggers investigation. Three scenarios:

CODATA update: Measured value shifted. The claim’s precision is recalculated.
Framework change: An upstream derivation was corrected. Downstream scripts are re-verified.
Bug found: The script had an error. The claim is quarantined.

4. The Hallucination Audit (S287-S289)

In Session 287, the project conducted a systematic 5-stream audit of all verification scripts and claims. This was the single most important quality event in the project’s history.

4.1 The Five Streams

Stream 1: Phantom Script References

Scanned all markdown files for references to verification scripts
Found: 8 of 260 references (3%) pointed to scripts that did not exist
These were “phantom references” — the AI had generated plausible script names in documentation that were never actually written
Resolution: 5 phantoms corrected (scripts written), 3 references removed

Stream 2: Script Execution Health

Ran all 651 scripts in the verification directory
Results: 579 PASS (88.9%), 46 FAIL (7.1%), 25 ERROR (3.8%)
Majority of failures were Unicode encoding errors (Windows cp1252 vs UTF-8)
After excluding Unicode issues: 92.5% effective pass rate

Stream 3: Hallucination Warning Patterns

Scanned for 6 known LLM hallucination patterns across all claims
Found 15 instances: 1 CRITICAL (HP-005), 6 HIGH, 5 MEDIUM, 3 LOW
CRITICAL: Weinberg dressed formula “0.00 sigma” was a 1-parameter fit disguised as a prediction
HIGH: “EXACT” label inflation, SM imports labeled as derivations, post-hoc discoveries presented as predictions

Stream 4: Numerical Cross-Checks

Spot-checked high-precision claims against independent calculations
No fabricated calculations detected
Found CODATA version inconsistency (HP-011): different scripts used CODATA 2018 vs 2022

Stream 5: [THEOREM] Tag Audit

Reviewed 260 uses of the [THEOREM] tag
Found ~75% correct, ~20% should be [DERIVATION] (gaps in proof)
~5% were structural identities (not theorems per se, but mathematical facts)

4.2 Key Findings

The audit revealed that the primary hallucination risk is not arithmetic (Layer 1 catches those). The dangerous patterns are semantic:

Pattern	Count	Example
Precision inflation	4	”0.00 sigma” for a 1-parameter fit
Post-hoc as prediction	2	C=24/11 found by search, presented as derived
SM import masking	2	Observed values labeled [DERIVATION]
Hidden circularity	2	F=C assumed and cited as derived
Label overuse (“EXACT”)	3	Approximate matches called “EXACT IDENTITY”
CODATA inconsistency	1	Mixed 2018/2022 values across scripts

4.3 Remediation (S288-S289)

The audit triggered a systematic remediation:

CODATA 2022 standardization: 61 scripts updated to use framework_constants.py with CODATA 2022 values. All precision claims recalculated.
Unicode remediation: 248 scripts had Unicode arrows/symbols replaced with ASCII equivalents. This eliminated all cp1252 encoding errors on Windows.
Tag corrections: ~20 [THEOREM] tags downgraded to [DERIVATION]. “EXACT” labels replaced with quantitative precision statements (10+ edits).
Circularity annotations: F=C and eps* circularity risks marked explicitly in investigation files.
HP-005 fix: “0.00 sigma” removed from Weinberg dressed formula; replaced with “1-parameter fit” disclaimer.

Result: Pass rate improved from 88.9% to 99.8%. More importantly, the semantic quality of claims improved: precision language is now quantitative, discovery sequences are documented, and import dependencies are marked.

5. Session Protocol

Each of the 350+ sessions follows a structured protocol designed to prevent context loss, ensure continuity, and enforce quality standards.

5.1 Session Start (Mandatory)

Every session begins with exactly three file reads:

registry/ACTIVE_SESSIONS.md — Check for parallel sessions, register this one
sessions/INDEX.md — Orient to current state: active topics, work backlog, recent sessions
One context file (previous session or topic file)

This is deliberately minimal. The AI’s context window is finite (~200K tokens). Loading too much at startup wastes capacity needed for the session’s actual work. Detailed investigation files (often 30-60KB) are read on-demand or delegated to sub-agents.

5.2 During Session: Confidence Tagging

Every claim produced during a session receives a confidence tag:

Tag	Meaning	Default?
[AXIOM]	Assumed without proof	No — foundational postulates only
[THEOREM]	Rigorously proven	No — requires complete proof
[DERIVATION]	Sketch-level argument with gaps	No — requires clear path
[CONJECTURE]	Plausible but unproven	YES — this is the default
[SPECULATION]	Interesting but untested	No — exploratory ideas only

The default is always [CONJECTURE]. Upgrading requires evidence. Every “X follows from Y” statement must carry an assumption chain:

n_d = 4 [D: from no-zero-divisors [A-AXIOM] + Frobenius theorem [I-MATH]]

This makes the derivation’s dependency tree explicit. When a result is later found to be wrong, the chain shows exactly what else is affected.

5.3 Formalization Queue

After every derivation or significant finding, the session asks: “Formalize now or queue?”

Formalize now: Write the verification script, document the derivation chain, update relevant files.
Queue: Record the result in registry/FORMALIZATION_QUEUE.md with enough detail to reconstruct it later.

At topic transitions, the protocol checks for unformalized results. Nothing should be left unrecorded. This prevents the failure mode where an interesting result is discussed but never verified, then cited in later sessions as if it were established.

5.4 Session End (Mandatory)

Each session closes with:

Deregister: Move from “Currently Active” to “Recently Completed” in ACTIVE_SESSIONS.md
Write session file: sessions/S{N}.md with date, focus, findings (with tags), scripts, open questions, and a key context paragraph for the next session
Update INDEX.md: 2-3 line edits to active topics and recent sessions
Update topic file: Append new results to the relevant topic file
Triage queues: Process formalization queue, exploration queue, propagation manifest

This bookkeeping is not optional. It is what allows a project with 350+ sessions and 63+ predictions to remain navigable.

6. Quality Engine and Adversarial Protocols

6.1 Automated Scanning

The Quality Engine performs systematic scans for:

Consistency issues between related claims
Precision language violations (“EXACT” for approximate matches)
Missing verification script references
Stale cross-references (files that have been updated but dependents haven’t)
Investigation priority scoring (which open questions are most impactful)

6.2 Red Team Reviews

Three times (S120, S257, S330), the project has conducted formal adversarial reviews:

Review	Session	Result	Key Criticisms
v1.0	S120	15-30% genuine physics	8 criticisms, all valid
v2.0	S257	20-35% genuine physics	3 resolved, 5 new
v3.0	S330	25-40% genuine physics	5 improved/resolved, 3 new

The Red Team uses a 3-critic structure (Skeptic, Prosecutor, Devil’s Advocate) to stress-test every major claim. Surviving criticisms are documented and tracked. The probability assessment is the Red Team’s consensus, not the author’s optimism.

6.3 LLM Derivation Challenge

To test whether the framework’s derivations are reproducible or artifacts of a specific LLM’s training:

v1 (S128-S135): 4 questions testing basic structural derivations. 3/4 SUCCESS. Conclusion: simple derivations are reproducible.

v2 (S257): 18 sub-questions testing the full structural layer (F=C, n_c=11, n_d=4, D_fw classification, generation count). 15/18 PASS. The structural derivations are deterministic algebra — any LLM with the axioms can reproduce them.

v3 (S261): Numerical formula discovery test. GPT-4o was given the building blocks and asked to discover specific formulas (e.g., 1/alpha = 137 + 4/111). Result: GPT-4o explored 13 candidates without finding the formula. It never computed Phi_6(11) = 111. Key lesson: LLMs are poor at combinatorial formula search, which means either (a) the formulas are not natural consequences of the building blocks, or (b) LLMs lack the search capability to find them. The test does not distinguish these possibilities.

6.4 Propagation Manifest

When a result changes status (confirmed, retracted, corrected), all downstream dependents must be updated. The propagation manifest tracks:

Retractions: 3 total (S291: Grassmannian topology, S319: 8 dark states, S320: SU(3)=generation)
Corrections: HP-011 CODATA propagation (~30 files), HP-013 SU(3) identification
Status changes: Conjectures promoted to [DERIVATION] or [THEOREM]

Each propagation is logged with session number and scope of impact.

7. Limitations and Honest Assessment

7.1 The Derivation vs. Discovery Problem

The central unresolved methodological question:

Were the framework’s formulas DERIVED from first principles, or DISCOVERED by search and then justified after the fact?

This project cannot answer this question internally. The LLM Challenge v3 showed that another LLM could not independently discover the numerical formulas (but could reproduce the structural derivations). This is suggestive but not conclusive: the failure could mean the formulas are non-obvious (evidence for genuine derivation) or that they require specific search strategies the LLM was not prompted to use.

The honest framing: structural results (gauge groups, QM, spacetime dimensions) are deterministic algebra reproducible by any capable system. Numerical results (alpha, mass ratios, cosmological parameters) involve formula identification steps whose independence from the target values has not been proven.

7.2 LLM Training Data Contamination

Claude’s training data includes physics constants. The risk: the AI unconsciously guides derivations toward known values. Mitigations:

Blind prediction protocol (predict before looking up measurement) — used for 9 predictions
SymPy verification (the script doesn’t know what the “right” answer should be)
Sensitivity analysis (change inputs by 10% and verify the result changes proportionally)
Monte Carlo null model (S170): showed the building blocks are NOT special at percent-level precision — any 7 small integers match equally well

These mitigations reduce but do not eliminate the risk. The blind predictions (P ~ 2.5 x 10^-7) are the strongest evidence precisely because they bypass this concern.

7.3 What This Methodology Cannot Prove

Mathematical correctness: SymPy verifies calculations, not proofs. A derivation can have correct arithmetic but invalid logical steps.
Physical validity: Matching measured values does not prove the physical interpretation is correct. It proves the formula works, not that the explanation is right.
Uniqueness: The framework might not be the only system that produces these results. The Monte Carlo showed that building-block matching is easy; what hasn’t been tested is whether other axiom systems produce the same structural results.
Independence from training data: No internal test can fully rule out that the AI’s training data influenced the results.

7.4 Recommendations for Others

For researchers considering AI-assisted theoretical exploration:

Verify everything computationally. Not “spot check” — everything. The 3% phantom rate found in the S287 audit means roughly 1 in 30 AI-generated references is fabricated.
Default to [CONJECTURE]. The AI will produce derivations that look rigorous. Most are not. Require explicit verification before upgrading confidence.
Document the discovery sequence. Was the formula predicted or found by search? Was the coefficient derived or extracted from data? Future readers need this information to evaluate the work.
Conduct adversarial reviews. The Red Team protocol (Section 6.2) caught criticisms that neither the author nor the AI noticed during normal work.
Track failures. This project documents 14 falsified predictions and 14 caught hallucinations. The failures are as informative as the successes. Projects that report only successes are hiding information.
Use centralized constants. One framework_constants.py prevents the CODATA version inconsistency (HP-011) that corrupted precision claims across dozens of scripts.
Manage context deliberately. The AI’s context window is a finite resource. Load only what you need. Use sub-agents for heavy reads. Structure sessions so the AI starts each one with orientation, not information overload.

Appendix A: Hallucination Incident Catalog

14 incidents documented in the Hallucination Log:

ID	Type	Severity	How Caught	Summary
HP-001	Multi-path test	—	Layer 2	Alpha derivation: NO hallucination confirmed
HP-002	Calculation error	MEDIUM	Layer 1 (SymPy)	Wrong phi_CMB (sqrt(5) vs sqrt(6)), survived 3 sessions
HP-003	Code bug	LOW	Test execution	Variable shadowing in Weinberg script
HP-004	Calculation error	MEDIUM	Layer 1 (SymPy)	Wrong eta/eps algebra
HP-005	Precision inflation	CRITICAL	Layer 3 (semantic)	“0.00 sigma” was 1-parameter fit
HP-006	Misclassification	HIGH	Layer 3 (semantic)	SM imports labeled [DERIVATION]
HP-007	Label inflation	HIGH	Layer 3 (semantic)	“EXACT” overuse for approximate matches
HP-008	Post-hoc fitting	HIGH	Layer 3 (semantic)	C=24/11 found by search, presented as prediction
HP-009	Circularity	HIGH	Layer 3 (semantic)	F=C simultaneously assumed and “derived”
HP-010	Circular definition	HIGH	Layer 3 (semantic)	eps* scaling tautological
HP-011	Data inconsistency	HIGH	Layer 1 (cross-check)	Mixed CODATA 2018/2022
HP-012	Invalid proof	HIGH	Layer 1 + topology	Real vs complex Grassmannian confusion
HP-013	Structural error	CRITICAL	Layer 3 (semantic)	SU(3) color misidentified as generation
HP-014	Representation error	HIGH	Layer 1 (SymPy)	Custodial singlet != gauge singlet

Pattern: Layers 1-2 catch arithmetic errors. Layer 3 catches the more dangerous semantic errors (precision inflation, circularity, misclassification). The project’s most impactful quality improvements came from Layer 3 audits.

Appendix B: Key Session References

Session	Event	Significance
S120	Red Team v1.0	First adversarial review (15-30%)
S128-S135	LLM Challenge v1	3/4 structural derivations reproduced
S170	Monte Carlo null model	Building blocks NOT special at 1%
S257	Red Team v2.0 + LLM v2	15/18 PASS; probability 20-35%
S261	LLM Challenge v3	GPT-4o cannot find numerical formulas
S264	Quality Engine Run #5	13/13 action items resolved
S287	Hallucination audit	5-stream, 651 scripts, 8 new patterns
S288-S289	Audit remediation	CODATA 2022, Unicode, 88.9% -> 99.8%
S330	Red Team v3.0	25-40% genuine physics; Grade B-

Revision History

Version	Date	Session	Changes
1.0	2026-02-09	S357	Initial version

Status: Speculative theoretical framework. Not peer-reviewed. Amateur work with AI assistance. Affiliation: Amateur researcher with AI assistance

AI-Assisted Methodology