AI-Assisted Methodology
How AI was used as an exploration partner, the three-layer hallucination defense, and lessons for AI-assisted theoretical work
Table of Contents
AI-Assisted Theoretical Exploration: A Replicable Methodology
Last Updated: 2026-02-09 (Session S357) Version: 1.0 Purpose: Document the methodology used for AI-assisted theoretical physics exploration, independent of whether the physics is correct. Audience: Researchers considering AI as a tool for theoretical work; reviewers evaluating AI-assisted claims. Status: CURRENT Reading Time: ~20 minutes
Key References
| Document | Role |
|---|---|
| Honest Assessment | Self-assessment (25-40% genuine physics) |
| Predictions Registry | 63+ predictions with verification status |
| Verification Scripts | 736+ SymPy scripts (99.8% pass rate) |
| Statistical Analysis | Honest P-value analysis |
| Hallucination Log | 14 caught LLM errors with analysis |
Critical Framework Elements
| Element | Status | Relevance to This Document |
|---|---|---|
| Hallucination Log (14 incidents) | CURRENT | Concrete examples of caught errors |
| Verification infrastructure (736+ scripts) | CURRENT | Primary defense layer |
| Red Team v3.0 (S330) | CURRENT | Adversarial self-assessment protocol |
| LLM Challenge v1-v3 (S128-S261) | CURRENT | External reproducibility tests |
1. Introduction: Why AI-Assisted Theoretical Physics
1.1 The Exploration Partner Model
This project uses Claude (Anthropic) as an exploration partner, not an oracle. The distinction matters: an oracle is trusted to produce correct answers; a partner is trusted to be useful but verified at every step.
Over 350+ sessions, the collaboration has settled into a pattern:
- The human provides physical intuition, decides which questions to pursue, judges whether results are interesting, and maintains skepticism about all claims.
- The AI provides rapid symbolic computation, systematic bookkeeping, pattern recognition across large derivation chains, and serves as a tireless exploration companion that never tires of checking “what if?”
This is not the first use of computation in theoretical physics — Mathematica, Maple, and other CAS tools have been standard for decades. What is new is using a language model as an interactive reasoning partner that can follow multi-step derivations, suggest alternative approaches, and maintain context across hundreds of sessions.
1.2 What AI Is Good At
- Computation: Writing and running SymPy verification scripts faster than manual calculation
- Bookkeeping: Tracking 63+ predictions, 14 falsifications, 4 irreducible assumptions, and their interdependencies
- Pattern recognition: Noticing when the same mathematical structure appears in different physical contexts
- Systematic search: Exploring parameter spaces, scanning formula families, testing alternatives
- Consistency checking: Verifying that new results don’t contradict established ones
1.3 What AI Is Bad At
- Mathematical rigor: LLMs produce plausible-looking proofs that are wrong. Every algebraic claim must be verified computationally (Section 2).
- Physical intuition: The AI cannot judge whether a mathematical result is physically reasonable without explicit criteria.
- Genuine novelty: The AI’s suggestions are shaped by its training data. The “derivation vs. discovery” problem (Section 7.1) remains unresolved.
- Self-assessment: The AI tends toward agreement and can rationalize wrong results. Adversarial protocols are necessary (Section 6).
2. The Three-Layer Hallucination Defense
LLMs hallucinate. In a mathematical context, this means producing confident, detailed, plausible-looking derivations that contain errors ranging from sign mistakes to entirely fabricated proofs. A hallucination audit of this project (S287) found a 3% phantom rate in script references and identified 8 distinct hallucination patterns. Defenses must be systematic.
Layer 1: Computational Verification
Rule: No calculation appears in any markdown document without a passing verification script.
Every numerical claim goes through:
CLAIM -> Write SymPy script -> RUN script -> Confirm PASS -> THEN document
The order is non-negotiable. The script is written before the claim is recorded, not after. This prevents the common failure mode where a plausible-looking result gets documented and then never verified.
Script standards (enforced across 736+ scripts):
- Explicit assumption section listing every input with its derivation chain tag
- Computation using exact arithmetic (
sympy.Rational, not floating point) - Comparison to measured values using CODATA 2022 (
framework_constants.py) - Assertions with PASS/FAIL output for every test
- Error calculation in ppm or percentage
Example structure:
#!/usr/bin/env python3
"""Title: What this verifies
Assumptions: [A-AXIOM] n_c = 11, [I-MATH] Frobenius theorem
"""
from sympy import Rational
from framework_constants import ALPHA_INV_MEASURED
# Exact arithmetic
predicted = Rational(15211, 111) # 137 + 4/111
error_ppm = abs(float(predicted) - float(ALPHA_INV_MEASURED)) / float(ALPHA_INV_MEASURED) * 1e6
print(f"Predicted: {float(predicted)}")
print(f"Measured: {float(ALPHA_INV_MEASURED)}")
print(f"Error: {error_ppm:.2f} ppm")
assert error_ppm < 1.0, f"FAIL: {error_ppm} ppm"
print("[PASS] Alpha tree-level within 1 ppm")
Centralized constants: All scripts import from framework_constants.py, which pins CODATA 2022 values. This prevents the failure mode discovered in the hallucination audit (HP-011) where different scripts used different CODATA vintages, making precision claims inconsistent.
Layer 2: Multi-Path Verification
For any claim at sub-percent precision, a single derivation path is insufficient. The protocol requires:
- Primary derivation: The original argument
- Secondary derivation: A different starting point reaching the same result
- Numerical check: Independent high-precision computation (mpmath)
- External check: Wolfram Alpha cross-verification (budget permitting, ~65 queries/day)
Consistency test: If primary and secondary derivations disagree, the claim is quarantined until the discrepancy is resolved. If the numerical check deviates by more than 0.01%, the claim is flagged.
This layer caught the alpha derivation’s validity (HP-001): the Lie algebra formula and the cyclotomic polynomial Phi_6(11) = 111 were verified to be an algebraic identity, not a coincidence.
Layer 3: Semantic Consistency
Hallucinated derivations often have subtle semantic inconsistencies that survive computational checks because the computation itself is correct — the error is in the reasoning connecting the computation to the claim. The S287 audit found that 7 of 8 new hallucination patterns were semantic, not arithmetic.
Checks include:
- Dimensional analysis: Units must work at every intermediate step
- Limit behavior: Results must reduce to known cases in appropriate limits
- Symmetry preservation: Derivation steps must not break symmetries
- Label consistency: A quantity labeled [DERIVATION] must not depend on imported values labeled as derived
This layer caught the most subtle errors:
- HP-005: A “0.00 sigma” Weinberg angle claim that was actually a 1-parameter fit to 1 data point (guaranteed to give 0.00 sigma)
- HP-006: SM particle content labeled as [DERIVATION] when it was actually [A-IMPORT]
- HP-009: Derivation chains where F=C was simultaneously assumed and cited as derived
Hallucination Risk Score (HRS)
Every new derivation is scored:
| Factor | Score |
|---|---|
| Matches known value | +2 |
| ”It can be shown” language | +2 |
| No intermediate steps given | +3 |
| Result seems “too good” | +2 |
| Multiple independent verifications | -2 |
| Clear derivation chain | -2 |
| Falsification criterion stated | -1 |
HRS >= 4: Requires multi-path verification before the claim is accepted. The question “What would make this wrong?” is asked explicitly.
3. The Verification Infrastructure
3.1 Scale
The project maintains 736+ verification scripts in verification/sympy/, covering:
| Domain | Scripts | Examples |
|---|---|---|
| Alpha / fine structure | ~40 | alpha_enhanced_prediction.py, correction_term_lie_algebra.py |
| Weinberg angle | ~25 | weak_angle_97_formula.py, sin2_schur_democratic.py |
| Cosmology | ~30 | omega_lambda_derivation.py, cmb_recombination_redshift.py |
| Yang-Mills / QCD | ~60 | yang_mills_glueball_spectrum.py, large_n_intercept.py |
| Mass ratios | ~30 | proton_electron_best_formula.py, lepton_mass_ratio_verification.py |
| QM derivation | ~40 | born_rule_derivation.py, schrodinger_emergence.py |
| Statistical analysis | ~15 | statistical_significance_s170.py, monte_carlo_building_blocks.py |
| Other | ~495 | Glueballs, CMB peaks, gravity, dark matter, crystallization |
Pass rate: 99.8% (post-remediation, S289). This means ~1-2 scripts have known issues at any time, and those are tracked.
3.2 framework_constants.py
A single source of truth for all measured values:
- CODATA 2022 recommended values (alpha, masses, coupling constants)
- PDG 2024 particle data (W mass, Z mass, quark masses)
- Planck 2018 cosmological parameters
- Division algebra dimensions (n_d=4, n_c=11, etc.)
Every script imports from this file. When CODATA updates, one file changes and all scripts are re-verified. The S289 remediation standardized 61 scripts from mixed CODATA 2018/2022 to uniform CODATA 2022.
3.3 Regression Detection
Scripts are re-run periodically and after any framework change. A script that previously passed and now fails triggers investigation. Three scenarios:
- CODATA update: Measured value shifted. The claim’s precision is recalculated.
- Framework change: An upstream derivation was corrected. Downstream scripts are re-verified.
- Bug found: The script had an error. The claim is quarantined.
4. The Hallucination Audit (S287-S289)
In Session 287, the project conducted a systematic 5-stream audit of all verification scripts and claims. This was the single most important quality event in the project’s history.
4.1 The Five Streams
Stream 1: Phantom Script References
- Scanned all markdown files for references to verification scripts
- Found: 8 of 260 references (3%) pointed to scripts that did not exist
- These were “phantom references” — the AI had generated plausible script names in documentation that were never actually written
- Resolution: 5 phantoms corrected (scripts written), 3 references removed
Stream 2: Script Execution Health
- Ran all 651 scripts in the verification directory
- Results: 579 PASS (88.9%), 46 FAIL (7.1%), 25 ERROR (3.8%)
- Majority of failures were Unicode encoding errors (Windows cp1252 vs UTF-8)
- After excluding Unicode issues: 92.5% effective pass rate
Stream 3: Hallucination Warning Patterns
- Scanned for 6 known LLM hallucination patterns across all claims
- Found 15 instances: 1 CRITICAL (HP-005), 6 HIGH, 5 MEDIUM, 3 LOW
- CRITICAL: Weinberg dressed formula “0.00 sigma” was a 1-parameter fit disguised as a prediction
- HIGH: “EXACT” label inflation, SM imports labeled as derivations, post-hoc discoveries presented as predictions
Stream 4: Numerical Cross-Checks
- Spot-checked high-precision claims against independent calculations
- No fabricated calculations detected
- Found CODATA version inconsistency (HP-011): different scripts used CODATA 2018 vs 2022
Stream 5: [THEOREM] Tag Audit
- Reviewed 260 uses of the [THEOREM] tag
- Found ~75% correct, ~20% should be [DERIVATION] (gaps in proof)
- ~5% were structural identities (not theorems per se, but mathematical facts)
4.2 Key Findings
The audit revealed that the primary hallucination risk is not arithmetic (Layer 1 catches those). The dangerous patterns are semantic:
| Pattern | Count | Example |
|---|---|---|
| Precision inflation | 4 | ”0.00 sigma” for a 1-parameter fit |
| Post-hoc as prediction | 2 | C=24/11 found by search, presented as derived |
| SM import masking | 2 | Observed values labeled [DERIVATION] |
| Hidden circularity | 2 | F=C assumed and cited as derived |
| Label overuse (“EXACT”) | 3 | Approximate matches called “EXACT IDENTITY” |
| CODATA inconsistency | 1 | Mixed 2018/2022 values across scripts |
4.3 Remediation (S288-S289)
The audit triggered a systematic remediation:
- CODATA 2022 standardization: 61 scripts updated to use
framework_constants.pywith CODATA 2022 values. All precision claims recalculated. - Unicode remediation: 248 scripts had Unicode arrows/symbols replaced with ASCII equivalents. This eliminated all cp1252 encoding errors on Windows.
- Tag corrections: ~20 [THEOREM] tags downgraded to [DERIVATION]. “EXACT” labels replaced with quantitative precision statements (10+ edits).
- Circularity annotations: F=C and eps* circularity risks marked explicitly in investigation files.
- HP-005 fix: “0.00 sigma” removed from Weinberg dressed formula; replaced with “1-parameter fit” disclaimer.
Result: Pass rate improved from 88.9% to 99.8%. More importantly, the semantic quality of claims improved: precision language is now quantitative, discovery sequences are documented, and import dependencies are marked.
5. Session Protocol
Each of the 350+ sessions follows a structured protocol designed to prevent context loss, ensure continuity, and enforce quality standards.
5.1 Session Start (Mandatory)
Every session begins with exactly three file reads:
registry/ACTIVE_SESSIONS.md— Check for parallel sessions, register this onesessions/INDEX.md— Orient to current state: active topics, work backlog, recent sessions- One context file (previous session or topic file)
This is deliberately minimal. The AI’s context window is finite (~200K tokens). Loading too much at startup wastes capacity needed for the session’s actual work. Detailed investigation files (often 30-60KB) are read on-demand or delegated to sub-agents.
5.2 During Session: Confidence Tagging
Every claim produced during a session receives a confidence tag:
| Tag | Meaning | Default? |
|---|---|---|
| [AXIOM] | Assumed without proof | No — foundational postulates only |
| [THEOREM] | Rigorously proven | No — requires complete proof |
| [DERIVATION] | Sketch-level argument with gaps | No — requires clear path |
| [CONJECTURE] | Plausible but unproven | YES — this is the default |
| [SPECULATION] | Interesting but untested | No — exploratory ideas only |
The default is always [CONJECTURE]. Upgrading requires evidence. Every “X follows from Y” statement must carry an assumption chain:
n_d = 4 [D: from no-zero-divisors [A-AXIOM] + Frobenius theorem [I-MATH]]
This makes the derivation’s dependency tree explicit. When a result is later found to be wrong, the chain shows exactly what else is affected.
5.3 Formalization Queue
After every derivation or significant finding, the session asks: “Formalize now or queue?”
- Formalize now: Write the verification script, document the derivation chain, update relevant files.
- Queue: Record the result in
registry/FORMALIZATION_QUEUE.mdwith enough detail to reconstruct it later.
At topic transitions, the protocol checks for unformalized results. Nothing should be left unrecorded. This prevents the failure mode where an interesting result is discussed but never verified, then cited in later sessions as if it were established.
5.4 Session End (Mandatory)
Each session closes with:
- Deregister: Move from “Currently Active” to “Recently Completed” in
ACTIVE_SESSIONS.md - Write session file:
sessions/S{N}.mdwith date, focus, findings (with tags), scripts, open questions, and a key context paragraph for the next session - Update INDEX.md: 2-3 line edits to active topics and recent sessions
- Update topic file: Append new results to the relevant topic file
- Triage queues: Process formalization queue, exploration queue, propagation manifest
This bookkeeping is not optional. It is what allows a project with 350+ sessions and 63+ predictions to remain navigable.
6. Quality Engine and Adversarial Protocols
6.1 Automated Scanning
The Quality Engine performs systematic scans for:
- Consistency issues between related claims
- Precision language violations (“EXACT” for approximate matches)
- Missing verification script references
- Stale cross-references (files that have been updated but dependents haven’t)
- Investigation priority scoring (which open questions are most impactful)
6.2 Red Team Reviews
Three times (S120, S257, S330), the project has conducted formal adversarial reviews:
| Review | Session | Result | Key Criticisms |
|---|---|---|---|
| v1.0 | S120 | 15-30% genuine physics | 8 criticisms, all valid |
| v2.0 | S257 | 20-35% genuine physics | 3 resolved, 5 new |
| v3.0 | S330 | 25-40% genuine physics | 5 improved/resolved, 3 new |
The Red Team uses a 3-critic structure (Skeptic, Prosecutor, Devil’s Advocate) to stress-test every major claim. Surviving criticisms are documented and tracked. The probability assessment is the Red Team’s consensus, not the author’s optimism.
6.3 LLM Derivation Challenge
To test whether the framework’s derivations are reproducible or artifacts of a specific LLM’s training:
v1 (S128-S135): 4 questions testing basic structural derivations. 3/4 SUCCESS. Conclusion: simple derivations are reproducible.
v2 (S257): 18 sub-questions testing the full structural layer (F=C, n_c=11, n_d=4, D_fw classification, generation count). 15/18 PASS. The structural derivations are deterministic algebra — any LLM with the axioms can reproduce them.
v3 (S261): Numerical formula discovery test. GPT-4o was given the building blocks and asked to discover specific formulas (e.g., 1/alpha = 137 + 4/111). Result: GPT-4o explored 13 candidates without finding the formula. It never computed Phi_6(11) = 111. Key lesson: LLMs are poor at combinatorial formula search, which means either (a) the formulas are not natural consequences of the building blocks, or (b) LLMs lack the search capability to find them. The test does not distinguish these possibilities.
6.4 Propagation Manifest
When a result changes status (confirmed, retracted, corrected), all downstream dependents must be updated. The propagation manifest tracks:
- Retractions: 3 total (S291: Grassmannian topology, S319: 8 dark states, S320: SU(3)=generation)
- Corrections: HP-011 CODATA propagation (~30 files), HP-013 SU(3) identification
- Status changes: Conjectures promoted to [DERIVATION] or [THEOREM]
Each propagation is logged with session number and scope of impact.
7. Limitations and Honest Assessment
7.1 The Derivation vs. Discovery Problem
The central unresolved methodological question:
Were the framework’s formulas DERIVED from first principles, or DISCOVERED by search and then justified after the fact?
This project cannot answer this question internally. The LLM Challenge v3 showed that another LLM could not independently discover the numerical formulas (but could reproduce the structural derivations). This is suggestive but not conclusive: the failure could mean the formulas are non-obvious (evidence for genuine derivation) or that they require specific search strategies the LLM was not prompted to use.
The honest framing: structural results (gauge groups, QM, spacetime dimensions) are deterministic algebra reproducible by any capable system. Numerical results (alpha, mass ratios, cosmological parameters) involve formula identification steps whose independence from the target values has not been proven.
7.2 LLM Training Data Contamination
Claude’s training data includes physics constants. The risk: the AI unconsciously guides derivations toward known values. Mitigations:
- Blind prediction protocol (predict before looking up measurement) — used for 9 predictions
- SymPy verification (the script doesn’t know what the “right” answer should be)
- Sensitivity analysis (change inputs by 10% and verify the result changes proportionally)
- Monte Carlo null model (S170): showed the building blocks are NOT special at percent-level precision — any 7 small integers match equally well
These mitigations reduce but do not eliminate the risk. The blind predictions (P ~ 2.5 x 10^-7) are the strongest evidence precisely because they bypass this concern.
7.3 What This Methodology Cannot Prove
- Mathematical correctness: SymPy verifies calculations, not proofs. A derivation can have correct arithmetic but invalid logical steps.
- Physical validity: Matching measured values does not prove the physical interpretation is correct. It proves the formula works, not that the explanation is right.
- Uniqueness: The framework might not be the only system that produces these results. The Monte Carlo showed that building-block matching is easy; what hasn’t been tested is whether other axiom systems produce the same structural results.
- Independence from training data: No internal test can fully rule out that the AI’s training data influenced the results.
7.4 Recommendations for Others
For researchers considering AI-assisted theoretical exploration:
-
Verify everything computationally. Not “spot check” — everything. The 3% phantom rate found in the S287 audit means roughly 1 in 30 AI-generated references is fabricated.
-
Default to [CONJECTURE]. The AI will produce derivations that look rigorous. Most are not. Require explicit verification before upgrading confidence.
-
Document the discovery sequence. Was the formula predicted or found by search? Was the coefficient derived or extracted from data? Future readers need this information to evaluate the work.
-
Conduct adversarial reviews. The Red Team protocol (Section 6.2) caught criticisms that neither the author nor the AI noticed during normal work.
-
Track failures. This project documents 14 falsified predictions and 14 caught hallucinations. The failures are as informative as the successes. Projects that report only successes are hiding information.
-
Use centralized constants. One
framework_constants.pyprevents the CODATA version inconsistency (HP-011) that corrupted precision claims across dozens of scripts. -
Manage context deliberately. The AI’s context window is a finite resource. Load only what you need. Use sub-agents for heavy reads. Structure sessions so the AI starts each one with orientation, not information overload.
Appendix A: Hallucination Incident Catalog
14 incidents documented in the Hallucination Log:
| ID | Type | Severity | How Caught | Summary |
|---|---|---|---|---|
| HP-001 | Multi-path test | — | Layer 2 | Alpha derivation: NO hallucination confirmed |
| HP-002 | Calculation error | MEDIUM | Layer 1 (SymPy) | Wrong phi_CMB (sqrt(5) vs sqrt(6)), survived 3 sessions |
| HP-003 | Code bug | LOW | Test execution | Variable shadowing in Weinberg script |
| HP-004 | Calculation error | MEDIUM | Layer 1 (SymPy) | Wrong eta/eps algebra |
| HP-005 | Precision inflation | CRITICAL | Layer 3 (semantic) | “0.00 sigma” was 1-parameter fit |
| HP-006 | Misclassification | HIGH | Layer 3 (semantic) | SM imports labeled [DERIVATION] |
| HP-007 | Label inflation | HIGH | Layer 3 (semantic) | “EXACT” overuse for approximate matches |
| HP-008 | Post-hoc fitting | HIGH | Layer 3 (semantic) | C=24/11 found by search, presented as prediction |
| HP-009 | Circularity | HIGH | Layer 3 (semantic) | F=C simultaneously assumed and “derived” |
| HP-010 | Circular definition | HIGH | Layer 3 (semantic) | eps* scaling tautological |
| HP-011 | Data inconsistency | HIGH | Layer 1 (cross-check) | Mixed CODATA 2018/2022 |
| HP-012 | Invalid proof | HIGH | Layer 1 + topology | Real vs complex Grassmannian confusion |
| HP-013 | Structural error | CRITICAL | Layer 3 (semantic) | SU(3) color misidentified as generation |
| HP-014 | Representation error | HIGH | Layer 1 (SymPy) | Custodial singlet != gauge singlet |
Pattern: Layers 1-2 catch arithmetic errors. Layer 3 catches the more dangerous semantic errors (precision inflation, circularity, misclassification). The project’s most impactful quality improvements came from Layer 3 audits.
Appendix B: Key Session References
| Session | Event | Significance |
|---|---|---|
| S120 | Red Team v1.0 | First adversarial review (15-30%) |
| S128-S135 | LLM Challenge v1 | 3/4 structural derivations reproduced |
| S170 | Monte Carlo null model | Building blocks NOT special at 1% |
| S257 | Red Team v2.0 + LLM v2 | 15/18 PASS; probability 20-35% |
| S261 | LLM Challenge v3 | GPT-4o cannot find numerical formulas |
| S264 | Quality Engine Run #5 | 13/13 action items resolved |
| S287 | Hallucination audit | 5-stream, 651 scripts, 8 new patterns |
| S288-S289 | Audit remediation | CODATA 2022, Unicode, 88.9% -> 99.8% |
| S330 | Red Team v3.0 | 25-40% genuine physics; Grade B- |
Revision History
| Version | Date | Session | Changes |
|---|---|---|---|
| 1.0 | 2026-02-09 | S357 | Initial version |
Status: Speculative theoretical framework. Not peer-reviewed. Amateur work with AI assistance. Affiliation: Amateur researcher with AI assistance
Status: Speculative theoretical framework. Not peer-reviewed. Amateur work with AI assistance.
All mathematical claims are computationally verified via 737+ SymPy scripts.