v1.0 Last updated: 2026-02-09

AI-Assisted Methodology

How AI was used as an exploration partner, the three-layer hallucination defense, and lessons for AI-assisted theoretical work

Table of Contents

AI-Assisted Theoretical Exploration: A Replicable Methodology

Last Updated: 2026-02-09 (Session S357) Version: 1.0 Purpose: Document the methodology used for AI-assisted theoretical physics exploration, independent of whether the physics is correct. Audience: Researchers considering AI as a tool for theoretical work; reviewers evaluating AI-assisted claims. Status: CURRENT Reading Time: ~20 minutes

Key References

DocumentRole
Honest AssessmentSelf-assessment (25-40% genuine physics)
Predictions Registry63+ predictions with verification status
Verification Scripts736+ SymPy scripts (99.8% pass rate)
Statistical AnalysisHonest P-value analysis
Hallucination Log14 caught LLM errors with analysis

Critical Framework Elements

ElementStatusRelevance to This Document
Hallucination Log (14 incidents)CURRENTConcrete examples of caught errors
Verification infrastructure (736+ scripts)CURRENTPrimary defense layer
Red Team v3.0 (S330)CURRENTAdversarial self-assessment protocol
LLM Challenge v1-v3 (S128-S261)CURRENTExternal reproducibility tests

1. Introduction: Why AI-Assisted Theoretical Physics

1.1 The Exploration Partner Model

This project uses Claude (Anthropic) as an exploration partner, not an oracle. The distinction matters: an oracle is trusted to produce correct answers; a partner is trusted to be useful but verified at every step.

Over 350+ sessions, the collaboration has settled into a pattern:

  • The human provides physical intuition, decides which questions to pursue, judges whether results are interesting, and maintains skepticism about all claims.
  • The AI provides rapid symbolic computation, systematic bookkeeping, pattern recognition across large derivation chains, and serves as a tireless exploration companion that never tires of checking “what if?”

This is not the first use of computation in theoretical physics — Mathematica, Maple, and other CAS tools have been standard for decades. What is new is using a language model as an interactive reasoning partner that can follow multi-step derivations, suggest alternative approaches, and maintain context across hundreds of sessions.

1.2 What AI Is Good At

  1. Computation: Writing and running SymPy verification scripts faster than manual calculation
  2. Bookkeeping: Tracking 63+ predictions, 14 falsifications, 4 irreducible assumptions, and their interdependencies
  3. Pattern recognition: Noticing when the same mathematical structure appears in different physical contexts
  4. Systematic search: Exploring parameter spaces, scanning formula families, testing alternatives
  5. Consistency checking: Verifying that new results don’t contradict established ones

1.3 What AI Is Bad At

  1. Mathematical rigor: LLMs produce plausible-looking proofs that are wrong. Every algebraic claim must be verified computationally (Section 2).
  2. Physical intuition: The AI cannot judge whether a mathematical result is physically reasonable without explicit criteria.
  3. Genuine novelty: The AI’s suggestions are shaped by its training data. The “derivation vs. discovery” problem (Section 7.1) remains unresolved.
  4. Self-assessment: The AI tends toward agreement and can rationalize wrong results. Adversarial protocols are necessary (Section 6).

2. The Three-Layer Hallucination Defense

LLMs hallucinate. In a mathematical context, this means producing confident, detailed, plausible-looking derivations that contain errors ranging from sign mistakes to entirely fabricated proofs. A hallucination audit of this project (S287) found a 3% phantom rate in script references and identified 8 distinct hallucination patterns. Defenses must be systematic.

Layer 1: Computational Verification

Rule: No calculation appears in any markdown document without a passing verification script.

Every numerical claim goes through:

CLAIM -> Write SymPy script -> RUN script -> Confirm PASS -> THEN document

The order is non-negotiable. The script is written before the claim is recorded, not after. This prevents the common failure mode where a plausible-looking result gets documented and then never verified.

Script standards (enforced across 736+ scripts):

  • Explicit assumption section listing every input with its derivation chain tag
  • Computation using exact arithmetic (sympy.Rational, not floating point)
  • Comparison to measured values using CODATA 2022 (framework_constants.py)
  • Assertions with PASS/FAIL output for every test
  • Error calculation in ppm or percentage

Example structure:

#!/usr/bin/env python3
"""Title: What this verifies
Assumptions: [A-AXIOM] n_c = 11, [I-MATH] Frobenius theorem
"""
from sympy import Rational
from framework_constants import ALPHA_INV_MEASURED

# Exact arithmetic
predicted = Rational(15211, 111)  # 137 + 4/111
error_ppm = abs(float(predicted) - float(ALPHA_INV_MEASURED)) / float(ALPHA_INV_MEASURED) * 1e6

print(f"Predicted: {float(predicted)}")
print(f"Measured:  {float(ALPHA_INV_MEASURED)}")
print(f"Error:     {error_ppm:.2f} ppm")
assert error_ppm < 1.0, f"FAIL: {error_ppm} ppm"
print("[PASS] Alpha tree-level within 1 ppm")

Centralized constants: All scripts import from framework_constants.py, which pins CODATA 2022 values. This prevents the failure mode discovered in the hallucination audit (HP-011) where different scripts used different CODATA vintages, making precision claims inconsistent.

Layer 2: Multi-Path Verification

For any claim at sub-percent precision, a single derivation path is insufficient. The protocol requires:

  1. Primary derivation: The original argument
  2. Secondary derivation: A different starting point reaching the same result
  3. Numerical check: Independent high-precision computation (mpmath)
  4. External check: Wolfram Alpha cross-verification (budget permitting, ~65 queries/day)

Consistency test: If primary and secondary derivations disagree, the claim is quarantined until the discrepancy is resolved. If the numerical check deviates by more than 0.01%, the claim is flagged.

This layer caught the alpha derivation’s validity (HP-001): the Lie algebra formula and the cyclotomic polynomial Phi_6(11) = 111 were verified to be an algebraic identity, not a coincidence.

Layer 3: Semantic Consistency

Hallucinated derivations often have subtle semantic inconsistencies that survive computational checks because the computation itself is correct — the error is in the reasoning connecting the computation to the claim. The S287 audit found that 7 of 8 new hallucination patterns were semantic, not arithmetic.

Checks include:

  • Dimensional analysis: Units must work at every intermediate step
  • Limit behavior: Results must reduce to known cases in appropriate limits
  • Symmetry preservation: Derivation steps must not break symmetries
  • Label consistency: A quantity labeled [DERIVATION] must not depend on imported values labeled as derived

This layer caught the most subtle errors:

  • HP-005: A “0.00 sigma” Weinberg angle claim that was actually a 1-parameter fit to 1 data point (guaranteed to give 0.00 sigma)
  • HP-006: SM particle content labeled as [DERIVATION] when it was actually [A-IMPORT]
  • HP-009: Derivation chains where F=C was simultaneously assumed and cited as derived

Hallucination Risk Score (HRS)

Every new derivation is scored:

FactorScore
Matches known value+2
”It can be shown” language+2
No intermediate steps given+3
Result seems “too good”+2
Multiple independent verifications-2
Clear derivation chain-2
Falsification criterion stated-1

HRS >= 4: Requires multi-path verification before the claim is accepted. The question “What would make this wrong?” is asked explicitly.


3. The Verification Infrastructure

3.1 Scale

The project maintains 736+ verification scripts in verification/sympy/, covering:

DomainScriptsExamples
Alpha / fine structure~40alpha_enhanced_prediction.py, correction_term_lie_algebra.py
Weinberg angle~25weak_angle_97_formula.py, sin2_schur_democratic.py
Cosmology~30omega_lambda_derivation.py, cmb_recombination_redshift.py
Yang-Mills / QCD~60yang_mills_glueball_spectrum.py, large_n_intercept.py
Mass ratios~30proton_electron_best_formula.py, lepton_mass_ratio_verification.py
QM derivation~40born_rule_derivation.py, schrodinger_emergence.py
Statistical analysis~15statistical_significance_s170.py, monte_carlo_building_blocks.py
Other~495Glueballs, CMB peaks, gravity, dark matter, crystallization

Pass rate: 99.8% (post-remediation, S289). This means ~1-2 scripts have known issues at any time, and those are tracked.

3.2 framework_constants.py

A single source of truth for all measured values:

  • CODATA 2022 recommended values (alpha, masses, coupling constants)
  • PDG 2024 particle data (W mass, Z mass, quark masses)
  • Planck 2018 cosmological parameters
  • Division algebra dimensions (n_d=4, n_c=11, etc.)

Every script imports from this file. When CODATA updates, one file changes and all scripts are re-verified. The S289 remediation standardized 61 scripts from mixed CODATA 2018/2022 to uniform CODATA 2022.

3.3 Regression Detection

Scripts are re-run periodically and after any framework change. A script that previously passed and now fails triggers investigation. Three scenarios:

  1. CODATA update: Measured value shifted. The claim’s precision is recalculated.
  2. Framework change: An upstream derivation was corrected. Downstream scripts are re-verified.
  3. Bug found: The script had an error. The claim is quarantined.

4. The Hallucination Audit (S287-S289)

In Session 287, the project conducted a systematic 5-stream audit of all verification scripts and claims. This was the single most important quality event in the project’s history.

4.1 The Five Streams

Stream 1: Phantom Script References

  • Scanned all markdown files for references to verification scripts
  • Found: 8 of 260 references (3%) pointed to scripts that did not exist
  • These were “phantom references” — the AI had generated plausible script names in documentation that were never actually written
  • Resolution: 5 phantoms corrected (scripts written), 3 references removed

Stream 2: Script Execution Health

  • Ran all 651 scripts in the verification directory
  • Results: 579 PASS (88.9%), 46 FAIL (7.1%), 25 ERROR (3.8%)
  • Majority of failures were Unicode encoding errors (Windows cp1252 vs UTF-8)
  • After excluding Unicode issues: 92.5% effective pass rate

Stream 3: Hallucination Warning Patterns

  • Scanned for 6 known LLM hallucination patterns across all claims
  • Found 15 instances: 1 CRITICAL (HP-005), 6 HIGH, 5 MEDIUM, 3 LOW
  • CRITICAL: Weinberg dressed formula “0.00 sigma” was a 1-parameter fit disguised as a prediction
  • HIGH: “EXACT” label inflation, SM imports labeled as derivations, post-hoc discoveries presented as predictions

Stream 4: Numerical Cross-Checks

  • Spot-checked high-precision claims against independent calculations
  • No fabricated calculations detected
  • Found CODATA version inconsistency (HP-011): different scripts used CODATA 2018 vs 2022

Stream 5: [THEOREM] Tag Audit

  • Reviewed 260 uses of the [THEOREM] tag
  • Found ~75% correct, ~20% should be [DERIVATION] (gaps in proof)
  • ~5% were structural identities (not theorems per se, but mathematical facts)

4.2 Key Findings

The audit revealed that the primary hallucination risk is not arithmetic (Layer 1 catches those). The dangerous patterns are semantic:

PatternCountExample
Precision inflation4”0.00 sigma” for a 1-parameter fit
Post-hoc as prediction2C=24/11 found by search, presented as derived
SM import masking2Observed values labeled [DERIVATION]
Hidden circularity2F=C assumed and cited as derived
Label overuse (“EXACT”)3Approximate matches called “EXACT IDENTITY”
CODATA inconsistency1Mixed 2018/2022 values across scripts

4.3 Remediation (S288-S289)

The audit triggered a systematic remediation:

  1. CODATA 2022 standardization: 61 scripts updated to use framework_constants.py with CODATA 2022 values. All precision claims recalculated.
  2. Unicode remediation: 248 scripts had Unicode arrows/symbols replaced with ASCII equivalents. This eliminated all cp1252 encoding errors on Windows.
  3. Tag corrections: ~20 [THEOREM] tags downgraded to [DERIVATION]. “EXACT” labels replaced with quantitative precision statements (10+ edits).
  4. Circularity annotations: F=C and eps* circularity risks marked explicitly in investigation files.
  5. HP-005 fix: “0.00 sigma” removed from Weinberg dressed formula; replaced with “1-parameter fit” disclaimer.

Result: Pass rate improved from 88.9% to 99.8%. More importantly, the semantic quality of claims improved: precision language is now quantitative, discovery sequences are documented, and import dependencies are marked.


5. Session Protocol

Each of the 350+ sessions follows a structured protocol designed to prevent context loss, ensure continuity, and enforce quality standards.

5.1 Session Start (Mandatory)

Every session begins with exactly three file reads:

  1. registry/ACTIVE_SESSIONS.md — Check for parallel sessions, register this one
  2. sessions/INDEX.md — Orient to current state: active topics, work backlog, recent sessions
  3. One context file (previous session or topic file)

This is deliberately minimal. The AI’s context window is finite (~200K tokens). Loading too much at startup wastes capacity needed for the session’s actual work. Detailed investigation files (often 30-60KB) are read on-demand or delegated to sub-agents.

5.2 During Session: Confidence Tagging

Every claim produced during a session receives a confidence tag:

TagMeaningDefault?
[AXIOM]Assumed without proofNo — foundational postulates only
[THEOREM]Rigorously provenNo — requires complete proof
[DERIVATION]Sketch-level argument with gapsNo — requires clear path
[CONJECTURE]Plausible but unprovenYES — this is the default
[SPECULATION]Interesting but untestedNo — exploratory ideas only

The default is always [CONJECTURE]. Upgrading requires evidence. Every “X follows from Y” statement must carry an assumption chain:

n_d = 4 [D: from no-zero-divisors [A-AXIOM] + Frobenius theorem [I-MATH]]

This makes the derivation’s dependency tree explicit. When a result is later found to be wrong, the chain shows exactly what else is affected.

5.3 Formalization Queue

After every derivation or significant finding, the session asks: “Formalize now or queue?”

  • Formalize now: Write the verification script, document the derivation chain, update relevant files.
  • Queue: Record the result in registry/FORMALIZATION_QUEUE.md with enough detail to reconstruct it later.

At topic transitions, the protocol checks for unformalized results. Nothing should be left unrecorded. This prevents the failure mode where an interesting result is discussed but never verified, then cited in later sessions as if it were established.

5.4 Session End (Mandatory)

Each session closes with:

  1. Deregister: Move from “Currently Active” to “Recently Completed” in ACTIVE_SESSIONS.md
  2. Write session file: sessions/S{N}.md with date, focus, findings (with tags), scripts, open questions, and a key context paragraph for the next session
  3. Update INDEX.md: 2-3 line edits to active topics and recent sessions
  4. Update topic file: Append new results to the relevant topic file
  5. Triage queues: Process formalization queue, exploration queue, propagation manifest

This bookkeeping is not optional. It is what allows a project with 350+ sessions and 63+ predictions to remain navigable.


6. Quality Engine and Adversarial Protocols

6.1 Automated Scanning

The Quality Engine performs systematic scans for:

  • Consistency issues between related claims
  • Precision language violations (“EXACT” for approximate matches)
  • Missing verification script references
  • Stale cross-references (files that have been updated but dependents haven’t)
  • Investigation priority scoring (which open questions are most impactful)

6.2 Red Team Reviews

Three times (S120, S257, S330), the project has conducted formal adversarial reviews:

ReviewSessionResultKey Criticisms
v1.0S12015-30% genuine physics8 criticisms, all valid
v2.0S25720-35% genuine physics3 resolved, 5 new
v3.0S33025-40% genuine physics5 improved/resolved, 3 new

The Red Team uses a 3-critic structure (Skeptic, Prosecutor, Devil’s Advocate) to stress-test every major claim. Surviving criticisms are documented and tracked. The probability assessment is the Red Team’s consensus, not the author’s optimism.

6.3 LLM Derivation Challenge

To test whether the framework’s derivations are reproducible or artifacts of a specific LLM’s training:

v1 (S128-S135): 4 questions testing basic structural derivations. 3/4 SUCCESS. Conclusion: simple derivations are reproducible.

v2 (S257): 18 sub-questions testing the full structural layer (F=C, n_c=11, n_d=4, D_fw classification, generation count). 15/18 PASS. The structural derivations are deterministic algebra — any LLM with the axioms can reproduce them.

v3 (S261): Numerical formula discovery test. GPT-4o was given the building blocks and asked to discover specific formulas (e.g., 1/alpha = 137 + 4/111). Result: GPT-4o explored 13 candidates without finding the formula. It never computed Phi_6(11) = 111. Key lesson: LLMs are poor at combinatorial formula search, which means either (a) the formulas are not natural consequences of the building blocks, or (b) LLMs lack the search capability to find them. The test does not distinguish these possibilities.

6.4 Propagation Manifest

When a result changes status (confirmed, retracted, corrected), all downstream dependents must be updated. The propagation manifest tracks:

  • Retractions: 3 total (S291: Grassmannian topology, S319: 8 dark states, S320: SU(3)=generation)
  • Corrections: HP-011 CODATA propagation (~30 files), HP-013 SU(3) identification
  • Status changes: Conjectures promoted to [DERIVATION] or [THEOREM]

Each propagation is logged with session number and scope of impact.


7. Limitations and Honest Assessment

7.1 The Derivation vs. Discovery Problem

The central unresolved methodological question:

Were the framework’s formulas DERIVED from first principles, or DISCOVERED by search and then justified after the fact?

This project cannot answer this question internally. The LLM Challenge v3 showed that another LLM could not independently discover the numerical formulas (but could reproduce the structural derivations). This is suggestive but not conclusive: the failure could mean the formulas are non-obvious (evidence for genuine derivation) or that they require specific search strategies the LLM was not prompted to use.

The honest framing: structural results (gauge groups, QM, spacetime dimensions) are deterministic algebra reproducible by any capable system. Numerical results (alpha, mass ratios, cosmological parameters) involve formula identification steps whose independence from the target values has not been proven.

7.2 LLM Training Data Contamination

Claude’s training data includes physics constants. The risk: the AI unconsciously guides derivations toward known values. Mitigations:

  • Blind prediction protocol (predict before looking up measurement) — used for 9 predictions
  • SymPy verification (the script doesn’t know what the “right” answer should be)
  • Sensitivity analysis (change inputs by 10% and verify the result changes proportionally)
  • Monte Carlo null model (S170): showed the building blocks are NOT special at percent-level precision — any 7 small integers match equally well

These mitigations reduce but do not eliminate the risk. The blind predictions (P ~ 2.5 x 10^-7) are the strongest evidence precisely because they bypass this concern.

7.3 What This Methodology Cannot Prove

  1. Mathematical correctness: SymPy verifies calculations, not proofs. A derivation can have correct arithmetic but invalid logical steps.
  2. Physical validity: Matching measured values does not prove the physical interpretation is correct. It proves the formula works, not that the explanation is right.
  3. Uniqueness: The framework might not be the only system that produces these results. The Monte Carlo showed that building-block matching is easy; what hasn’t been tested is whether other axiom systems produce the same structural results.
  4. Independence from training data: No internal test can fully rule out that the AI’s training data influenced the results.

7.4 Recommendations for Others

For researchers considering AI-assisted theoretical exploration:

  1. Verify everything computationally. Not “spot check” — everything. The 3% phantom rate found in the S287 audit means roughly 1 in 30 AI-generated references is fabricated.

  2. Default to [CONJECTURE]. The AI will produce derivations that look rigorous. Most are not. Require explicit verification before upgrading confidence.

  3. Document the discovery sequence. Was the formula predicted or found by search? Was the coefficient derived or extracted from data? Future readers need this information to evaluate the work.

  4. Conduct adversarial reviews. The Red Team protocol (Section 6.2) caught criticisms that neither the author nor the AI noticed during normal work.

  5. Track failures. This project documents 14 falsified predictions and 14 caught hallucinations. The failures are as informative as the successes. Projects that report only successes are hiding information.

  6. Use centralized constants. One framework_constants.py prevents the CODATA version inconsistency (HP-011) that corrupted precision claims across dozens of scripts.

  7. Manage context deliberately. The AI’s context window is a finite resource. Load only what you need. Use sub-agents for heavy reads. Structure sessions so the AI starts each one with orientation, not information overload.


Appendix A: Hallucination Incident Catalog

14 incidents documented in the Hallucination Log:

IDTypeSeverityHow CaughtSummary
HP-001Multi-path testLayer 2Alpha derivation: NO hallucination confirmed
HP-002Calculation errorMEDIUMLayer 1 (SymPy)Wrong phi_CMB (sqrt(5) vs sqrt(6)), survived 3 sessions
HP-003Code bugLOWTest executionVariable shadowing in Weinberg script
HP-004Calculation errorMEDIUMLayer 1 (SymPy)Wrong eta/eps algebra
HP-005Precision inflationCRITICALLayer 3 (semantic)“0.00 sigma” was 1-parameter fit
HP-006MisclassificationHIGHLayer 3 (semantic)SM imports labeled [DERIVATION]
HP-007Label inflationHIGHLayer 3 (semantic)“EXACT” overuse for approximate matches
HP-008Post-hoc fittingHIGHLayer 3 (semantic)C=24/11 found by search, presented as prediction
HP-009CircularityHIGHLayer 3 (semantic)F=C simultaneously assumed and “derived”
HP-010Circular definitionHIGHLayer 3 (semantic)eps* scaling tautological
HP-011Data inconsistencyHIGHLayer 1 (cross-check)Mixed CODATA 2018/2022
HP-012Invalid proofHIGHLayer 1 + topologyReal vs complex Grassmannian confusion
HP-013Structural errorCRITICALLayer 3 (semantic)SU(3) color misidentified as generation
HP-014Representation errorHIGHLayer 1 (SymPy)Custodial singlet != gauge singlet

Pattern: Layers 1-2 catch arithmetic errors. Layer 3 catches the more dangerous semantic errors (precision inflation, circularity, misclassification). The project’s most impactful quality improvements came from Layer 3 audits.


Appendix B: Key Session References

SessionEventSignificance
S120Red Team v1.0First adversarial review (15-30%)
S128-S135LLM Challenge v13/4 structural derivations reproduced
S170Monte Carlo null modelBuilding blocks NOT special at 1%
S257Red Team v2.0 + LLM v215/18 PASS; probability 20-35%
S261LLM Challenge v3GPT-4o cannot find numerical formulas
S264Quality Engine Run #513/13 action items resolved
S287Hallucination audit5-stream, 651 scripts, 8 new patterns
S288-S289Audit remediationCODATA 2022, Unicode, 88.9% -> 99.8%
S330Red Team v3.025-40% genuine physics; Grade B-

Revision History

VersionDateSessionChanges
1.02026-02-09S357Initial version

Status: Speculative theoretical framework. Not peer-reviewed. Amateur work with AI assistance. Affiliation: Amateur researcher with AI assistance

Status: Speculative theoretical framework. Not peer-reviewed. Amateur work with AI assistance.

All mathematical claims are computationally verified via 737+ SymPy scripts.