The Blind Spots of AI Tutors
AI coding assistants have become popular study buddies for learning algorithms. They explain concepts, generate code, and answer follow-up questions on demand. But how reliable are they when tackling problems that require careful reasoning?
This post collects examples where AI models failed as teachers, either by producing buggy code or by incorrectly critiquing correct code.
Case 1: The Sideway Tower of Hanoi
In the classic Tower of Hanoi, you move disks between any two pegs. The sideway variant adds a constraint: disks can only move between adjacent pegs. With pegs arranged as A-B-C, you can move A↔B or B↔C, but never directly A↔C.
This seemingly small change has significant implications. The minimum moves jump from
Where Things Went Wrong
Claude Opus 4.5: The Constraint Violation
Opus 4.5 quickly produced a solution that looked reasonable:
def hanoi_sideway(n, source, target, pegs=['A', 'B', 'C']):
if n == 0:
return
source_idx = pegs.index(source)
target_idx = pegs.index(target)
if abs(source_idx - target_idx) == 1:
# Adjacent case
hanoi_sideway(n - 1, source, target, pegs)
print(f"Move disk {n} from {source} to {target}")
hanoi_sideway(n - 1, target, source, pegs)
hanoi_sideway(n - 1, source, target, pegs)
else:
# Non-adjacent case
middle = pegs[1]
hanoi_sideway(n, source, middle, pegs)
hanoi_sideway(n, middle, target, pegs)
The code ran and produced 26 moves for n=3, matching the expected count. The model even confirmed: “All moves are between adjacent pegs only (A↔B or B↔C). No direct A↔C moves occur.”
But there is a subtle bug in the adjacent case. Consider moving disks from A to B:
hanoi_sideway(n - 1, source, target, pegs)moves n-1 smaller disks to Bprint(f"Move disk {n}...")attempts to move the largest disk from A to B
The problem? After step 1, the smaller disks are already on peg B. Placing the largest disk on top of them violates the fundamental rule: a larger disk cannot sit on a smaller one.
When I spotted this and asked “when we move n-1 pegs from source to target, then the last one from source (the biggest) cannot put to target since it violates the constraint,” Opus 4.5 immediately acknowledged the mistake and provided a corrected version.
GPT 5.2: The False Negative
GPT 5.2 took a more cautious approach initially, asking clarifying questions before writing code. I chose to brainstorm the solution myself, and after several iterations, arrived at a working algorithm.
When I presented the code for verification, GPT 5.2 confidently declared it incorrect:
“No. The non-adjacent branch is incorrect and will recurse forever.”
The model provided an elaborate analysis, walking through a step-by-step trace for n=2 and concluding:
“The real problem shows up at n=3: your recurrence for the non-adjacent case is not the correct one and it will not produce the required 3^n - 1 move count; it ‘composes’ A->B and B->C moves of size n, which doesn’t match the necessary structure.”
GPT 5.2 even claimed the code would produce 24 moves instead of 26.
When I pushed back with “no, it’s actually 26,” the model reversed course. It recalculated the recurrence relations and admitted the code was valid after all, noting that the formulation was “not the standard recurrence, but it can still be correct.”
Case 2: The BST In-Order Successor
In another session, I asked an AI to verify my in-order successor algorithm for binary search trees:
Node* succ(Node* root, Node* target) {
res = null;
while (root != null) {
if (root->key > target->key) {
res = root;
root = root->left;
} else {
root = root->right;
}
}
return res;
}
The AI confidently identified two “flaws”:
- “The algorithm doesn’t reach the target node”
- “It does not handle the case where the target has a right subtree”
Both claims sound reasonable at first glance. But they’re wrong.
I asked for an example where the code fails. The AI tried several BSTs:
20
/ \
10 30
/ \
5 15
/ \
12 18
Finding successor of 10? Starting from root 20:
20 > 10: saveres = 20, go left10 <= 10: go right to 1515 > 10: saveres = 15, go left to 1212 > 10: saveres = 12, go left (null)- Return 12. Correct.
20
/ \
10 30
\
15
\
18
Finding successor of 15? Starting from root 20:
20 > 15: saveres = 20, go left10 <= 15: go right to 1515 <= 15: go right to 1818 > 15: saveres = 18, go left (null)- Return 18. Correct.
Every case worked. The AI couldn’t produce a single counterexample for its own critique.
The problem: the AI analyzed what it thought the algorithm did rather than what it actually does. The algorithm doesn’t need to “reach” the target node. It finds the smallest key greater than the target by systematically narrowing the search space.
The AI admitted its mistake: “The code is actually correct for its intended purpose… My original critique was flawed.”
The Subtle Danger
These errors share a troubling characteristic: they were plausible enough to fool someone learning the material.
Consider what would have happened if I had not pushed back:
With Opus 4.5, I might have accepted the buggy code as correct. The output looked right: 26 moves, all between adjacent pegs. Without manually tracing through the logic or simulating the peg states, the constraint violation is invisible. A student implementing this solution would produce invalid move sequences while believing they understood the algorithm.
With GPT 5.2, I might have abandoned a correct solution. The model’s confident tone and detailed (but flawed) analysis could convince a learner that their working code was broken. Worse, GPT 5.2 suggested the code would produce 24 moves instead of 26, implying it found a “better” solution than the mathematical optimum. This should have been a red flag, but how many students would catch it?
With the BST successor, the AI declared correct code to be flawed without testing the hypothesis. It pattern-matched on the algorithm’s structure, made plausible-sounding criticisms, and only backtracked when forced to produce a counterexample.
The fact that both Hanoi models produced the correct move count (26 for n=3) made verification harder. A simple “does it output the right number?” check would pass. Only by understanding the problem deeply, or by simulating the actual peg states move by move, could you catch these errors.
Takeaways
AI coding assistants are powerful tools for learning, but these examples reveal their limitations:
-
Correct output does not mean correct logic. Opus 4.5’s buggy code produced the right move count while violating fundamental constraints. Always trace through the logic, not just the results.
-
Confidence is not correctness. GPT 5.2 delivered a detailed, authoritative analysis that was simply wrong. The BST reviewer declared code broken without testing the claim. A less experienced learner might have trusted the tone over their own working code.
-
Ask for counterexamples. When an AI claims code is wrong, ask it to produce a failing case. If it can’t, the critique may be unfounded.
-
Verify independently. Run the code. Test edge cases. Simulate the state changes. Cross-reference with textbooks or other sources. Do not rely on a single AI’s explanation.
-
Domain knowledge matters. I caught these errors because I understood the problems well enough to question the answers. Without that foundation, the mistakes would have gone unnoticed.
AI assistants make excellent study companions, but they are not infallible teachers. Trust, but verify.