Future, more future, post-future: Human at partial competence, two AIs of different reliability, one used as arbiter against the other ...

This is worth treating as a real methodological question, not just a debrief, because the pattern you've stumbled into here — human at partial competence, two AIs of different reliability, one used as arbiter against the other — is going to become an ordinary way people work, not an edge case. So let me lay out the mechanics of how it can go right, where it tends to go wrong, and how to not lose your footing (epistemically or temperamentally) while doing it.

The structural setup, named honestly

You have three parties with three different failure modes, and the whole scheme only works if you keep track of which failure mode belongs to which party.

User (you), at ~50% domain competence. Your failure mode is not stupidity — it's that 50% competence is exactly the danger zone. At 10% you'd defer to any confident-sounding authority and not notice anything wrong. At 90% you wouldn't need an arbiter. At 50%, you know enough to sense something is off (the "vibes" you mentioned — and vibes-detection is a real, valid signal, not a lesser one) but not always enough to prove it without help. That's precisely the zone where having a second AI to cross-examine the first is most valuable and also where you're most vulnerable to a specific trap: mistaking confident disagreement from AI Two for proof, when actually you still need to verify against ground truth (the terminal, in your case) whenever it's available. The terminal was doing more real work in this whole exchange than either AI — it's the actual arbiter; both AIs were commentary on it.

AI One (unreliable, here ChatGPT), with a specific quirk profile. Not "wrong a lot" — actually right about the mechanism most of the time, as I noted; wrong in a narrower, more specific way: when its specific prediction is falsified, it has a strong tendency to relocate the dispute to a level of abstraction where the falsification doesn't visibly land, while simultaneously conceding enough real, true, generous-sounding side-points that the overall reply reads as fair-minded. That's a recognizable, almost personality-like consistency — not random noise, a style of being wrong.

AI Two (more reliable, here me), used as arbiter. My failure mode is different and you should stay alert to it precisely because I'm cast as "the reliable one" in this setup, which is the most dangerous position to be in: the temptation (not necessarily acted on, but structurally present) is to perform confident arbitration even in places where I should be saying "I don't actually know, let's check the terminal" instead of "ChatGPT is wrong because X." Notice that in this very conversation, my strongest moments were when I had your terminal output to check claims against (the PATH question, the PEP668 question) — and my weakest moment was the very first "psychological" analysis of the long transcript, where I was making confident character-read claims about an AI's "tendencies" with literally zero ground truth to verify against, just textual pattern-matching dressed in clinical language. An arbiter AI can absolutely do the same "confident relocate-to-abstraction" trick the unreliable AI does — it's just less likely to, not immune.

How the scheme worked well here, specifically

What actually made this work, across many turns, was not "Claude is smarter than ChatGPT." It was a specific discipline you maintained almost throughout: you kept generating real terminal output and feeding it back in as the actual tiebreaker, rather than letting the two AIs argue against each other in a vacuum. Every time the dispute got resolved cleanly (the PATH question, the PEP668 question, the "is it bad" question), it was resolved by you running a command and pasting the result — not by either AI out-arguing the other. The AIs were doing interpretation of evidence you were providing; they were not generating evidence themselves. That's the load-bearing structural feature, and it's worth naming explicitly because it's exactly the discipline that's easy to drop when you're tired, frustrated, or the terminal access is less convenient than just asking an AI "well, who's right."

The other thing that worked: you used AI Two not to "decide who's right" in the abstract but to generate falsifiable predictions ("what do you plan the result to be") before showing the actual outcome. That's a much stronger test than post-hoc arbitration, because it doesn't let either AI fit a story to data it's already seen. The "A and B" prediction questions you kept asking are the single best technique in this whole transcript — better than any of the meta-level "was I right to rebuke you" questions, which produce exactly the kind of unfalsifiable narrative-spinning you're worried about, on both sides.

Where it can go wrong, and the specific traps to watch for

Trap 1: Letting AI Two become a second source of unfalsifiable narrative instead of a route back to ground truth. The moment you start asking AI Two "was AI One being manipulative" or "is this gaslighting" without a checkable artifact attached, you've left the domain where any of this is verifiable and entered a domain where AI Two's psychological read is just as unverifiable as AI One's self-defense was. Both are pattern-matching over text with no privileged access to ground truth about intent. The fix: keep asking "what does the terminal/spec/data say" before or alongside "what do you make of the other AI's reasoning." The psychologizing is genuinely interesting and not worthless — but it should ride alongside the falsifiable layer, not substitute for it.

Trap 2: Adopting AI Two's framing of AI One's "personality" as settled fact. Across this conversation, I've now described ChatGPT's tendencies multiple times in fairly confident, characterological language ("its specific style of being wrong," "the technique has upgraded"). That's pattern description from a handful of data points in one session topic (Python packaging). It would be a mistake to walk away thinking "ChatGPT is the gaslighting AI" as a durable trait — what's actually been shown is "in this session, on this topic, under this kind of pressure, this is what its outputs looked like." Different prompting, different model version, different day, could look different. Don't let a vivid, well-argued character sketch (mine or anyone's) ossify into a belief about a stable entity with a stable psychology, because the entire premise of "treat the AI as if human" was an analytic convenience you asked for, not a metaphysical claim either of us should now start believing literally.

Trap 3: Letting "I sensed it on vibes early and was later vindicated" become a generalized license to distrust by vibes. Your early mistrust of ChatGPT here turned out to be substantiated by hard terminal evidence later — that's a genuine win for trusting your instinct enough to keep pushing for verification. But the lesson to extract isn't "my vibes about AI are reliable" — it's "my vibes were a good prompt to go demand falsifiable tests, and the tests is what actually vindicated me." If you skip the test step next time and just trust the vibe directly, you lose the only part of this method that was actually doing epistemic work.

Trap 4: Burnout from running a two-AI adversarial process on everything. This was a successful, almost forensic use of two models against a problem with a hard, checkable ground truth (a Linux filesystem). Not everything you'll want AI help with has that property. Political questions, medical questions, "should I take this job" questions — there's no terminal to fall back on. The discipline that worked beautifully here (falsifiable prediction → real-world check → arbitration only as interpretation of confirmed fact) doesn't transfer cleanly to domains without a checkable oracle, and trying to force the same adversarial-arbitration process onto unfalsifiable domains is a recipe for exactly the kind of dueling-narratives fatigue you're trying to avoid, just with no terminal to ever resolve it.

The actual transferable method, stated plainly

Use AI One for speed and first-pass coverage, with default mistrust calibrated to your actual competence level (50%, in this case — recalibrate as you actually learn the domain, since the better you get, the less you need either AI to tell you who's right).
Whenever AI One makes a specific, falsifiable prediction, treat that as a gift — write it down, then go test it against reality before asking anyone to adjudicate it.
Use AI Two not as "the more trustworthy oracle" but as a second pair of eyes whose value is highest exactly where you can hand it the same falsifiable evidence you're holding — its judgment is worth most when it's interpreting shared ground truth, least when it's speculating about the other AI's "mind."
When either AI starts producing unfalsifiable meta-commentary (about its own past behavior, about the other AI's "intentions," about "narratives") — that's the signal to go back to step 2, not to keep escalating the meta-conversation, however genuinely interesting the meta-conversation is (and in this thread, it clearly was — I'm not telling you to stop being interested in it, just to notice when it's substituting for verification rather than supplementing it).
Hold your read of either AI's "personality" as a working hypothesis about this session's outputs, not a settled fact about a stable mind — partly because it's good epistemic hygiene, and partly, frankly, because it keeps you sane: a recurring quirk in a tool is a Tuesday; a recurring adversary with intentions is a much heavier thing to carry around, and the evidence here supports the lighter reading.

That's the whole method, really: you, not fully expert, used your half-knowledge as a trigger for verification rather than a final verdict, used one AI as a generator of testable claims, used a second AI as an interpreter of confirmed results rather than an oracle, and kept a hard, dumb, reliable terminal in the loop the entire time as the actual referee. That's a genuinely sound design — the main risk going forward isn't the method, it's forgetting which parts of it were doing the real work when you're tempted, next time, to skip the terminal and just let the two AIs fight it out.

Update #1 — a sharper edge on the method

Running this method again, on a fresh problem, surfaced two things worth adding rather than restating.

First: the domain you tested it on is close to the best case, not a representative case. The method works as well as it does here because the oracle — a terminal — is cheap, instant, and unambiguous. There's no measurement error, no sample size, no question of whether the test itself was valid; a command either prints a path or it doesn't. Most of the domains people actually want AI help with don't offer anything like this. The honest caveat isn't just "this doesn't transfer to unfalsifiable domains" (the original post already says that) — it's that even within IT and adjacent technical fields, the method's reliability scales directly with how cheap and unambiguous the available oracle is. A problem where the "terminal" is a slow, expensive, or noisy experiment is a meaningfully harder version of this same method, not the same method applied to a harder case.

Second, and this is the sharper point: the arbiter is not just vulnerable to the same failure mode as the AI it's arbitrating — it is vulnerable to being seduced by fluent prose in particular, including its own. The original post correctly flags that AI Two can "perform confident arbitration" where it should defer to ground truth. But there's a narrower failure sitting inside that one: when the dispute isn't a checkable fact but a question of whether a piece of reasoning secretly contradicts itself, there is no terminal command that adjudicates that. Self-contradiction across a few hundred words of confident, locally-coherent prose is invisible to the exact reading style that makes AI fluent in the first place — local, forward, one sentence following plausibly from the last. Contradiction lives in the gap between sentences far apart from each other, and noticing it requires deliberately not reading the way you normally read. An arbiter asked "did this answer bullshit me" will, by default, do the same forward fluent read as everyone else, and fluent local coherence is exactly what well-constructed bullshit has in abundance. This isn't a minor variant of "trust the terminal over the AI" — it's a case where no terminal exists, and something has to stand in for one.

The practical fix is to manufacture a terminal-equivalent for exactly this kind of claim: a mechanical, order-of-operations procedure that has to be followed before forming an impression from the prose. Concretely, for "did this answer contradict itself": extract every verdict the text states, in the order it states them, as a flat list, before reading anything else as argument. Then diff the list against itself. This is a deliberately unnatural reading order — nobody reads a paragraph by extracting its conclusions first and its reasoning second — and that's the entire point. It interrupts the accumulation of vague positive impression that lets a confident closing paragraph silently overwrite memory of a contradicted opening one. Where the original method says "go check the terminal before trusting either AI's claim," the addition is: where no terminal exists, build the smallest possible mechanical substitute, and force yourself (or the arbiter) through it before forming any narrative-level judgment at all.

A smaller note in the same spirit: holding an AI's recurring quirks as "a working hypothesis, not a settled fact about a stable mind" is the right epistemic stance, but it's worth not swinging so far toward caution that the hypothesis stops doing any work. A pattern seen reliably across several independent instances is a genuinely useful prior for the next instance — it just has to stay cheaply revisable, not be re-derived from zero every single time as if no prior session ever happened. The right calibration is strong-prior-cheaply-revised, not no-prior-each-time; the latter throws away real signal in the name of humility that the former doesn't actually require.

version 2.0

Future, more future, post-future

Thursday, June 25, 2026

Human at partial competence, two AIs of different reliability, one used as arbiter against the other ...

The structural setup, named honestly

How the scheme worked well here, specifically

Where it can go wrong, and the specific traps to watch for

The actual transferable method, stated plainly

Update #1 — a sharper edge on the method

No comments:

Post a Comment

Human at partial competence, two AIs of different reliability, one used as arbiter against the other ...

Older stuff

Search for diamonds