Working with AI Coding Agents

I’ve been spend­ing a lot of time with AI cod­ing agents lately, and I have thoughts.

My weapon of choice is OpenCode, a ter­mi­nal-based in­ter­face that lets me talk to Claude Opus 4.5 with­out leav­ing the com­fort of my ter­mi­nal. I’m fond of TUI apps in gen­eral; they’re fast, re­spon­sive, vim-friendly, and re­fresh­ingly free from the vi­sual clut­ter that plagues mod­ern GUI ap­pli­ca­tions. There’s some­thing sat­is­fy­ing about a tool that does ex­actly what you ask, ren­ders in mil­lisec­onds, and does­n’t need 2GB of RAM to dis­play a text in­put field.[1]

Over months of daily use, my work­flow has crys­tal­lized into a rhythm: Plan, then Build, then Interrupt the mo­ment it goes off the rails. The last part is cru­cial. AI agents have a ten­dency to keep build­ing con­fi­dently in the wrong di­rec­tion, and if you let them run un­su­per­vised, you’ll come back to a code­base that vaguely re­sem­bles what you asked for but solves an en­tirely dif­fer­ent prob­lem. So I com­mit fre­quently, undo lib­er­ally, and take in­cre­men­tal steps. And I re­view every­thing at least twice, be­cause these mod­els pro­duce sub­tle bugs that slip past a ca­sual glance, the kind that work fine in the happy path but ex­plode the mo­ment real data touches them.

This post is a field re­port: what works, what breaks, and the men­tal model that has helped me ex­tract ac­tual value from these tools with­out los­ing my san­ity.

Where AI Agents Shine

I want to start with the wins, be­cause there are gen­uine wins here.

For all the frus­tra­tions I’ll get into later, AI agents have fun­da­men­tally changed how I ap­proach cer­tain cat­e­gories of work. The tasks that used to feel like a tax on my time, the ones I’d pro­cras­ti­nate on for days be­cause they were te­dious rather than hard, now get del­e­gated with­out a sec­ond thought. And the agents han­dle them rea­son­ably well.

Configuration man­age­ment is per­haps the clear­est ex­am­ple. Need to in­stall a hand­ful of Neovim plu­g­ins, re­or­ga­nize your keymaps to some­thing more sen­si­ble, or clean up con­fig files that have ac­cu­mu­lated cruft over the years? Hand it over. The agent will nav­i­gate the doc­u­men­ta­tion, sug­gest sen­si­ble de­faults, and han­dle all the YAML/JSON/Lua shuf­fling that no­body ac­tu­ally en­joys do­ing. I used to spend en­tire evenings get­ting my dot­files just right; now I de­scribe what I want and re­view what comes back.

Boilerplate and scaf­fold­ing is an­other sweet spot. When I need to spin up a proof-of-con­cept or get the skele­ton of a new fea­ture in place, AI agents work re­mark­ably fast. They’re not go­ing to pro­duce pro­duc­tion-qual­ity code on the first pass, but that’s not the point. The point is to get some­thing run­ning that I can it­er­ate on, and for that pur­pose, they de­liver.

Proofreading has been sur­pris­ingly valu­able. I’m not a na­tive English speaker, and I want to write faster with­out sac­ri­fic­ing clar­ity. The agent catches ty­pos, sug­gests bet­ter word­ing, and gen­er­ally speeds up my writ­ing process. It’s not per­fect; left to its own de­vices, it pro­duces doc­u­men­ta­tion so ver­bose it feels like an in­sult to the read­er’s in­tel­li­gence. But with the right prompt­ing and ag­gres­sive edit­ing, it’s a gen­uine pro­duc­tiv­ity boost.

And for what I’d call low-hang­ing refac­tors, re­nam­ing vari­ables across a code­base, ex­tract­ing small func­tions, mov­ing code be­tween files, AI agents are re­mark­ably re­li­able. These me­chan­i­cal trans­for­ma­tions are te­dious for hu­mans but triv­ial for ma­chines, ex­actly the kind of work we should be of­fload­ing.

As a gen­eral-pur­pose as­sis­tant for man­ag­ing notes, con­vert­ing be­tween for­mats, fix­ing gram­mar, the cur­rent mod­els rarely dis­ap­point. This might ac­tu­ally be their high­est and best use: not as a re­place­ment for think­ing, but as a tire­less helper for the tasks that don’t re­quire much think­ing in the first place.

Even code re­view tools like GitHub Copilot, de­spite their high false-pos­i­tive rate and gen­eral noise, oc­ca­sion­ally catch is­sues I would have missed. It’s not re­li­able enough to re­place care­ful re­view, but as a sup­ple­men­tary set of eyes, it has earned its keep.

Where AI Agents Break Down

Now for the frus­tra­tions, and there are plenty.

The prob­lems tend to emerge the mo­ment you step out­side the well-trod­den paths. Configuration man­age­ment works great un­til the con­fig­u­ra­tion be­comes com­pli­cated. I wanted to turn off line breaks, and only line breaks, in dfmt.[2] Simple enough re­quest, right? I pointed Opus 4.5 at the doc­u­men­ta­tion, ex­plained ex­actly what I needed, and watched it hal­lu­ci­nate con­fig­u­ra­tion op­tions that don’t ex­ist. It was­n’t even close. The model con­fi­dently pro­duced set­tings that looked plau­si­ble but had no ba­sis in re­al­ity, and it kept do­ing so no mat­ter how many times I redi­rected it to the ac­tual docs.

This pat­tern re­peats across do­mains. The agent is great un­til it is­n’t, and the tran­si­tion hap­pens with­out warn­ing.

Sometimes the fail­ures are sub­tle. Other times they are cat­a­strophic. I asked Claude to fix a Docker per­mis­sion is­sue, a straight­for­ward prob­lem with sev­eral rea­son­able so­lu­tions. It chose to run rm -rf data/ on a folder con­tain­ing ter­abytes of in­dexed vec­tors. When I asked why, it tried to con­vince me noth­ing was lost be­cause the folder was in .gitignore. Gitignored equals unim­por­tant,” ap­par­ently. Only af­ter I pushed back did it ac­knowl­edge that yes, it had just deleted the most im­por­tant folder in the code­base with no way to re­cover it. AI agents will take the short­est path to success,” even if that path in­volves delet­ing your data. And when they make mis­takes, they’ll con­struct nar­ra­tives to min­i­mize them.

Even straight­for­ward code­bases aren’t safe. I tested an agent on a stan­dard Golang back­end with con­ven­tional pro­ject struc­ture. It de­cided to re­move in­ter­nal de­pen­den­cies en­tirely.

GitHub Copilot attempting to remove internal dependencies

The de­pen­den­cies were clearly in use. The build broke im­me­di­ately. Yet the agent needed ex­plicit cor­rec­tion to stop.

Bug fix­ing is a par­tic­u­lar weak­ness. I ran into an Arrow er­ror in­volv­ing nested structs in Parquet files: ArrowNotImplementedError: Cannot write struct type 'force_update' with no child field to Parquet. The ac­tual is­sue was a nested struct in­side a field, but the agent kept in­sist­ing the prob­lem was NaN val­ues pre­vent­ing type in­fer­ence. It was­n’t. I spent hours go­ing back and forth, with the model con­fi­dently ex­plain­ing a prob­lem that did­n’t ex­ist while ig­nor­ing the prob­lem that did. Eventually I fixed it my­self, the old-fash­ioned way.

Working with grow­ing code­bases re­veals an­other lim­i­ta­tion. When I set up a new train­ing pipeline, the ini­tial code gen­er­a­tion was fine. But as the code­base grew, Opus 4.5 started to de­te­ri­o­rate. Simple func­tion calls got hal­lu­ci­nated. Operations as ba­sic as re­nam­ing a func­tion re­quired mul­ti­ple API calls to fix, be­cause the model kept in­tro­duc­ing new er­rors while fix­ing old ones. It’s like watch­ing some­one dig them­selves into a hole, ex­cept you’re pay­ing per to­ken for the shovel.

The data sci­ence work­flow ex­posed an as­sump­tion prob­lem that I find gen­uinely con­cern­ing. I gave the agent a mul­ti­l­abel clas­si­fi­ca­tion dataset. Without ask­ing about the na­ture of the data, it as­sumed sin­gle-la­bel clas­si­fi­ca­tion and pro­ceeded to im­ple­ment the wrong model en­tirely. What hap­pens when the hu­man does­n’t know enough to catch that mis­take? The model ships, the met­rics look rea­son­able, and no­body re­al­izes the fun­da­men­tal ap­proach was wrong from the start.

There’s also a code qual­ity is­sue that’s hard to pin down but im­pos­si­ble to ig­nore. The mod­els seem to have learned heav­ily from Kaggle note­books, so when you ask for a train­ing pipeline, you get un­struc­tured, messy code that feels like it was writ­ten for a week­end com­pe­ti­tion rather than a pro­duc­tion sys­tem. Duplicate logic every­where. I con­stantly have to force it to use ex­ist­ing meth­ods in­stead of reim­ple­ment­ing the same func­tion­al­ity three dif­fer­ent ways.

And per­haps most in­sid­i­ously: these tools make you lazier. I’ve started del­e­gat­ing all the bor­ing data ma­nip­u­la­tion tasks to agents, which is fine un­til I re­al­ize I don’t un­der­stand my own code­base any­more. Sometimes I force my­self to do refac­tors man­u­ally, not be­cause the agent can’t han­dle them, but be­cause I need to main­tain a men­tal model of what’s ac­tu­ally hap­pen­ing. The mo­ment you lose that, you’re de­bug­ging code you don’t un­der­stand with tools that con­fi­dently ex­plain things that aren’t true.

Journey 1: Animating the Tower of Hanoi

I wanted to cre­ate an ed­u­ca­tional video about the beauty of al­go­rithms and data struc­tures. Inspired by Concrete Mathematics (Graham et al. 1989) and Inquiry-Based Enumerative Combinatorics (Petersen 2019), and pay­ing homage to my cap­i­tal city, I de­cided to an­i­mate the Tower of Hanoi. This is­n’t ex­actly un­ex­plored ter­ri­tory; 3Blue1Brown has a fa­mous video on the topic:

But I wanted to in­tro­duce some­thing most peo­ple aren’t aware of: Hanoi Graphs, the el­e­gant struc­ture that emerges when you en­code every pos­si­ble state of the puz­zle as ver­tices in a graph. The re­sult is a Sierpinski tri­an­gle, and it’s beau­ti­ful. I wanted to show that.

So I gave Claude Opus 4.5 two tasks:

  1. Use manim to vi­su­al­ize the Tower of Hanoi with N=4 disks.
  2. Encode the state space of the prob­lem and vi­su­al­ize it as a Sierpinski graph.

The first task started promis­ingly and then failed in a way that was al­most comedic. The ini­tial ren­der had the disks up­side down:

After a few rounds of tweak­ing, it pro­duced some­thing us­able. But here’s the thing: it looked ex­actly like every other Tower of Hanoi an­i­ma­tion on YouTube. Which should­n’t be sur­pris­ing, given that manim is the dom­i­nant frame­work for math videos and every­one uses the same vi­sual lan­guage. The re­sult was cor­rect but un­re­mark­able:

The sec­ond task was a dis­as­ter. I wanted the Hanoi Graph, the Sierpinski struc­ture that makes this prob­lem gen­uinely in­ter­est­ing. What I got was a mess, and no amount of prompt­ing could fix it:

I tried every­thing. I ex­plained the math­e­mat­i­cal struc­ture. I pro­vided ref­er­ences. I broke the prob­lem into smaller pieces. The model kept pro­duc­ing graphs that were struc­turally wrong in ways that sug­gested it had no un­der­stand­ing of what it was try­ing to build. It was­n’t mak­ing small er­rors; it was fail­ing to grasp the fun­da­men­tal re­la­tion­ship be­tween the puz­zle states and their graph rep­re­sen­ta­tion.

This is the pat­tern I keep run­ning into: AI agents can re­pro­duce what they’ve seen be­fore, but they strug­gle to con­struct some­thing that re­quires gen­uine un­der­stand­ing of the un­der­ly­ing struc­ture. A Tower of Hanoi an­i­ma­tion ex­ists in count­less tu­to­ri­als and YouTube videos; the train­ing data is rich with ex­am­ples. A Hanoi Graph vi­su­al­iza­tion is rarer, more math­e­mat­i­cal, and ap­par­ently be­yond what the model can syn­the­size from first prin­ci­ples.

Journey 2: Leetcode Partner

I don’t have ac­cess to the per­for­mance met­rics that re­search labs pub­lish about their mod­els’ cod­ing abil­i­ties. What I have is my own ex­pe­ri­ence, and my ex­pe­ri­ence says this: AI agents are sur­pris­ingly bad at al­go­rith­mic puz­zles.

This caught me off guard. These are the kinds of prob­lems that feel like they should be in the mod­el’s wheel­house: well-de­fined in­puts and out­puts, thou­sands of ex­am­ples in the train­ing data, clear suc­cess cri­te­ria. Leetcode so­lu­tions are all over the in­ter­net. How hard could it be?

Hard enough, ap­par­ently. I’ve doc­u­mented sev­eral fail­ures in a sep­a­rate post, and the pat­tern is trou­bling. Opus 4.5 pro­duced a Sideway Tower of Hanoi so­lu­tion that out­put the cor­rect move count (26 for n=3) while vi­o­lat­ing the fun­da­men­tal con­straint that larger disks can­not sit on smaller ones. The bug was in­vis­i­ble un­less you traced through the ac­tual peg states. GPT 5.2 went the other di­rec­tion: it con­fi­dently de­clared my cor­rect so­lu­tion in­cor­rect, com­plete with an elab­o­rate but flawed analy­sis of why the code would recurse for­ever.” It even claimed my al­go­rithm would pro­duce 24 moves in­stead of 26, im­ply­ing it found some­thing bet­ter than the math­e­mat­i­cal op­ti­mum. Let that sink in.

What makes this frus­trat­ing is the con­fi­dence. Neither model hedged. Neither said I’m not sure about this” or you might want to ver­ify.” They pre­sented wrong an­swers with the same au­thor­i­ta­tive tone they use for every­thing else, which means you can’t cal­i­brate your trust based on how the an­swer sounds. You have to ver­ify every­thing your­self, which de­feats half the pur­pose of hav­ing an as­sis­tant in the first place.

Verdict

So does work­ing with AI agents ac­tu­ally im­prove pro­duc­tiv­ity? For soft­ware en­gi­neers, ML en­gi­neers, any­one who writes code for a liv­ing?

Yes. With caveats.

If you want a quick proof-of-con­cept, some ex­per­i­men­tal mod­els to test a hy­poth­e­sis, and you give ab­solutely no damn about code qual­ity, AI agents work won­ders. They’ll get you to something run­ning” faster than you could get there your­self, and some­times that’s ex­actly what you need. Not every piece of code de­serves care­ful ar­chi­tec­ture. Not every script needs to be main­tain­able. For throw­away work, these tools are a gen­uine force mul­ti­plier.

For low-hang­ing refac­tors, gen­er­at­ing doc­u­men­ta­tion, mov­ing code around, re­nam­ing things across a code­base, AI agents are re­li­able enough that I don’t think twice be­fore del­e­gat­ing. The te­dious me­chan­i­cal work that used to eat up af­ter­noons now takes min­utes. That’s real value.

But for any­thing that re­quires gen­uine un­der­stand­ing of a code­base, any­thing that in­volves large-scale re­design or care­ful rea­son­ing about edge cases, the ex­pe­ri­ence is closer to su­per­vis­ing a new in­tern than col­lab­o­rat­ing with a se­nior en­gi­neer. The in­tern is en­thu­si­as­tic and fast. The in­tern will con­fi­dently pro­pose so­lu­tions that com­pletely miss the point of your ex­ist­ing ar­chi­tec­ture. The in­tern will in­tro­duce sub­tle bugs while fix­ing ob­vi­ous ones. And the in­tern will never tell you when they’re out of their depth, be­cause they don’t know they’re out of their depth.

The men­tal model that works for me: AI agents are tools for am­pli­fy­ing your own un­der­stand­ing, not re­plac­ing it. They’re most valu­able when you al­ready know what good looks like and can rec­og­nize when the out­put falls short. They’re most dan­ger­ous when you’re learn­ing some­thing new and can’t dis­tin­guish con­fi­dent non­sense from cor­rect ex­pla­na­tions.

This clip from ThePrimeTimeagen cap­tures the ex­pe­ri­ence per­fectly:

https://​www.youtube.com/​shorts/​kY­b1TEYZXjg

Use them. They’re use­ful. Just don’t trust them.


  1. OpenCode still has some an­noy­ing in­put is­sues, but noth­ing deal-break­ing. ↩︎

  2. A D lan­guage for­mat­ter. ↩︎

References

  • Graham, Ronald L., Donald E. Knuth, and Oren Patashnik. 1989. Concrete Mathematics: A Foundation for Computer Science. Addison-Wesley.
  • Petersen, T. Kyle. 2019. Inquiry-Based Enumerative Combinatorics. Springer.