The Blind Spots of AI Tutors

by Le Tan Dang Khoa, Feb 14, 2026

AI coding assistants have become popular study buddies for learning algorithms. They explain concepts, generate code, and answer follow-up questions on demand. But how reliable are they when tackling problems that require careful reasoning?

This post collects examples where AI models failed as teachers, either by producing buggy code or by incorrectly critiquing correct code.

Case 1: The Sideway Tower of Hanoi

In the classic Tower of Hanoi, you move disks between any two pegs. The sideway variant adds a constraint: disks can only move between adjacent pegs. With pegs arranged as A-B-C, you can move A↔B or B↔C, but never directly A↔C.

This seemingly small change has significant implications. The minimum moves jump from $2^{n} - 1$ to $3^{n} - 1$ . For 3 disks, that is 7 moves versus 26.

Where Things Went Wrong

Claude Opus 4.5: The Constraint Violation

Opus 4.5 quickly produced a solution that looked reasonable:

def hanoi_sideway(n, source, target, pegs=['A', 'B', 'C']):
    if n == 0:
        return

    source_idx = pegs.index(source)
    target_idx = pegs.index(target)

    if abs(source_idx - target_idx) == 1:
        # Adjacent case
        hanoi_sideway(n - 1, source, target, pegs)
        print(f"Move disk {n} from {source} to {target}")
        hanoi_sideway(n - 1, target, source, pegs)
        hanoi_sideway(n - 1, source, target, pegs)
    else:
        # Non-adjacent case
        middle = pegs[1]
        hanoi_sideway(n, source, middle, pegs)
        hanoi_sideway(n, middle, target, pegs)

The code ran and produced 26 moves for n=3, matching the expected count. The model even conﬁrmed: “All moves are between adjacent pegs only (A↔B or B↔C). No direct A↔C moves occur.”

But there is a subtle bug in the adjacent case. Consider moving disks from A to B:

hanoi_sideway(n - 1, source, target, pegs) moves n-1 smaller disks to B
print(f"Move disk {n}...") attempts to move the largest disk from A to B

The problem? After step 1, the smaller disks are already on peg B. Placing the largest disk on top of them violates the fundamental rule: a larger disk cannot sit on a smaller one.

When I spotted this and asked “when we move n-1 pegs from source to target, then the last one from source (the biggest) cannot put to target since it violates the constraint,” Opus 4.5 immediately acknowledged the mistake and provided a corrected version.

GPT 5.2: The False Negative

GPT 5.2 took a more cautious approach initially, asking clarifying questions before writing code. I chose to brainstorm the solution myself, and after several iterations, arrived at a working algorithm.

When I presented the code for veriﬁcation, GPT 5.2 conﬁdently declared it incorrect:

“No. The non-adjacent branch is incorrect and will recurse forever.”

The model provided an elaborate analysis, walking through a step-by-step trace for n=2 and concluding:

“The real problem shows up at n=3: your recurrence for the non-adjacent case is not the correct one and it will not produce the required 3^n - 1 move count; it ‘composes’ A->B and B->C moves of size n, which doesn’t match the necessary structure.”

GPT 5.2 even claimed the code would produce 24 moves instead of 26.

When I pushed back with “no, it’s actually 26,” the model reversed course. It recalculated the recurrence relations and admitted the code was valid after all, noting that the formulation was “not the standard recurrence, but it can still be correct.”

Case 2: The BST In-Order Successor

In another session, I asked an AI to verify my in-order successor algorithm for binary search trees:

Node* succ(Node* root, Node* target) {
    res = null;
    while (root != null) {
        if (root->key > target->key) {
            res = root;
            root = root->left;
        } else {
            root = root->right;
        }
    }
    return res;
}

The AI conﬁdently identiﬁed two “ﬂaws”:

“The algorithm doesn’t reach the target node”
“It does not handle the case where the target has a right subtree”

Both claims sound reasonable at ﬁrst glance. But they’re wrong.

I asked for an example where the code fails. The AI tried several BSTs:

Finding successor of 10? Starting from root 20:

20 > 10: save res = 20, go left
10 <= 10: go right to 15
15 > 10: save res = 15, go left to 12
12 > 10: save res = 12, go left (null)
Return 12. Correct.

Finding successor of 15? Starting from root 20:

20 > 15: save res = 20, go left
10 <= 15: go right to 15
15 <= 15: go right to 18
18 > 15: save res = 18, go left (null)
Return 18. Correct.

Every case worked. The AI couldn’t produce a single counterexample for its own critique.

The problem: the AI analyzed what it thought the algorithm did rather than what it actually does. The algorithm doesn’t need to “reach” the target node. It ﬁnds the smallest key greater than the target by systematically narrowing the search space.

The AI admitted its mistake: “The code is actually correct for its intended purpose… My original critique was ﬂawed.”

The Subtle Danger

These errors share a troubling characteristic: they were plausible enough to fool someone learning the material.

Consider what would have happened if I had not pushed back:

With Opus 4.5, I might have accepted the buggy code as correct. The output looked right: 26 moves, all between adjacent pegs. Without manually tracing through the logic or simulating the peg states, the constraint violation is invisible. A student implementing this solution would produce invalid move sequences while believing they understood the algorithm.

With GPT 5.2, I might have abandoned a correct solution. The model’s conﬁdent tone and detailed (but ﬂawed) analysis could convince a learner that their working code was broken. Worse, GPT 5.2 suggested the code would produce 24 moves instead of 26, implying it found a “better” solution than the mathematical optimum. This should have been a red ﬂag, but how many students would catch it?

With the BST successor, the AI declared correct code to be ﬂawed without testing the hypothesis. It pattern-matched on the algorithm’s structure, made plausible-sounding criticisms, and only backtracked when forced to produce a counterexample.

The fact that both Hanoi models produced the correct move count (26 for n=3) made veriﬁcation harder. A simple “does it output the right number?” check would pass. Only by understanding the problem deeply, or by simulating the actual peg states move by move, could you catch these errors.

Takeaways

AI coding assistants are powerful tools for learning, but these examples reveal their limitations:

Correct output does not mean correct logic. Opus 4.5’s buggy code produced the right move count while violating fundamental constraints. Always trace through the logic, not just the results.
Conﬁdence is not correctness. GPT 5.2 delivered a detailed, authoritative analysis that was simply wrong. The BST reviewer declared code broken without testing the claim. A less experienced learner might have trusted the tone over their own working code.
Ask for counterexamples. When an AI claims code is wrong, ask it to produce a failing case. If it can’t, the critique may be unfounded.
Verify independently. Run the code. Test edge cases. Simulate the state changes. Cross-reference with textbooks or other sources. Do not rely on a single AI’s explanation.
Domain knowledge matters. I caught these errors because I understood the problems well enough to question the answers. Without that foundation, the mistakes would have gone unnoticed.

AI assistants make excellent study companions, but they are not infallible teachers. Trust, but verify.