Gemini Pro Can't Patch

In which we explore various patch formats and the struggles of getting LLMs to put characters in the exact right spot.

I recently explored porting a small C Project to Rust. During that project, I noticed that gemini-2.5-pro was having a surprisingly hard time generating correctly formed patches. If I was using a cheap model, this would be expected: patching is quite precise, if you think about it. It requires your output to be near-perfect (you both need to generate correct code, but also output the parts you're replacing perfectly). But for a model with Pro right in the name: well, I expected a bit better. I wanted to use Gemini Pro because I've found it to do well for one-shot script generation and it had the best sweet spot of "cheap and good", and if I'm anything, it's cheap...

But it turns out that generating patches, especially generating multiple in a row, is really hard, because of the way patches stack on one another. Just like you or I would struggle to imagine exactly what a file looks like after multiple patches that touch the same location, so do LLMs. But before we can get there, we have to get the LLM to generate patches that are at least syntactically correct. Let's learn how I decided to learn the obvious in the most circuitous way...

Before you object... Given I was trying to get an LLM to edit files (along with use other tools), yes, the correct course of action would have been to call out to an existing coding agent, like Claude Code, or Codex, or ... I don't know there are a bunch of them out there now. But I was determined to get something working with my weird tiny agent, due to a few mostly bad reasons:

I had already gone down this path and I wanted to to succeed. I never held much with the Sunk Cost Fallacy anyway.
Stubborness (see 1)
I wanted to have more control and visibility over the experience. The idea was that the LLM would be porting one symbol at a time from C to Rust, and I wanted that process to be as automatic as possible. The existing coding agents tend to want more approvalsYes I know there are the various PR based agents which are more automated and run in their little sandboxes, but setting up Github CI and then evaluating all of those was a step too far for this project. Maybe not a bad idea though! or were at least higher-variance than I wanted.
Curiousity: if you can step away for a moment from the frustration of LLMs not doing what you want, it's interesting to observe what they're good and bad at.

Patching with `patch`

I started my journey with the laziest option, which was to have the LLM generate a unified diff suitable for the patch command. A unified diff looks like this:

--- foo.py      2025-06-25 19:34:04
+++ bar.py      2025-06-25 19:33:57
@@ -1,5 +1,7 @@
 # from operator import is_
 from datetime import datetime
+from pathlib import Path
+from typing import List, Optional, Protocol
 
 from .md2html import markdown_to_html
 from .post import BlogPost, create_post, load_post

We have our 2 files (for us it's always the same file), a line locator, and some source context lines surrounding our patch. Despite the warning signs (seriously I know as well as anyone that LLMs aren't good at counting) I thought this was a smart idea for a few reasons:

LLMs should know what a unified diff looks like
I didn't have to write the patch tool, someone else wrote it for me
The unified diff format has some flexibility to handle slight errors in the search text

This... turned out to be not so smart. I tried a few variations: I brought in patch-ng to bring the patching into Python, tried disregarding whitespace, and adjusting the match tolerance... but it wasn't to be. I tweaked my prompt to provide more context and more examples. I even tried a few different models, but it seemed like the general consensus (amongst the models) was that unified diff was a bad idea and I should feel bad.

In hindsight, the issues were staring me in the face: by default we're expecting the LLM to produce an awkward type of patch (a file against itself), with an absolute line locator, and before and after context lines that needed to match exactly. Yes, we can adjust the tolerances, search through the file etc, but we were starting from a bad place.

Trying Alternative Formats

That's fine, we've got other options.

OpenAI must know how to generate a patch, right? I am not always a smart person. And like a dummy falling into the world's most obvious trap, for my next attempt I thought: Codex is a real coding agent, surely they've got patching nailed down, right? I can just use their prompt, which uses the "well-documented" V4A diff format, and I'll be golden. Every model must know what "V4A" means even if I have no idea and the only searches come up with obscure OpenAI references? Right?

It turns out that I'm not the first to observe that team A, known for cutthroat competition in a competitive field, might not try to optimize, dare I say, might even de-optimize their productDid you know that for years the top-performing C++ compiler was produced by Intel, and that they intentionally prevented it from optimizing for AMD processors (even if they had the exact features), for "compatibilty reasons"? for team B. Needless to say, the Google models aren't super-well optimized for the obscure format that OpenAI trains on internally, and this experiment didn't go so well. Codex doesn't even work with Gemini because it's OpenAI compatibility layer isn't OpenAI compatible enough, which yes, in hindsight, I should have seen as a warning sign.

Alternative Formats, Take 2

Okay, so trusting OpenAI was a bad idea. But we've got other patching options. I've used the Aider coding tool in the past, and found that most of the time the Aider patch format works reasonably well. Aider's diff format roughly follows the git merge conflict style:

<<<<<<< SEARCH
from flask import Flask
=======
import math
from flask import Flask
>>>>>>> REPLACE

This format is looking much better. We've gotten rid of the pointless filename and line number management, and while the lack of context might hurt us in repetitive text, this isn't typically too much of a problem with source code, and the LLM can always just use a bigger "search" text to combat that. And as expected, our patch success rate jumped immediately after this change, from maybe 50% to >90%. Or I should say, our first patch success rate jumped.

Aside: Escaping is Hard

I mentioned the success rate jumped to 90%, not 100%. What's with the stragglers? Why not 100%? This is almost certainly down to an implementation detail. I decided to implement the patch functionality as a regular tool in my agent workflow, instead of special casing file edits as part of the conversation stream.

That is, I could have told the agent to write patches directly into the conversation, e.g. it would generate something like:

Okay I'm going to edit `build.rs` to add the new feature:

build.rs
<<<<<< SEARCH
fun build_lib_with_feature
======
fun build_lib_with_moar_features
>>>>>> REPLACE

But that would be ever-so-slightly more inconvenient for me, so instead I had the LLM generate an explicit tool call:

edit_code({ patch: "build.rs\n<<<<SEARCH\n..." })

Then I didn't need to special case the edit_code behavior versus the other tool calls and it made my code every-so-slightly more prettyAnd if you've seen my code you know I need every win I can get. . But convenience comes at a cost. And I'm not the first to notice that LLMs are worse at editing via JSON. And Paul explains as well as I can why this is the case:

In some sense this shouldn’t be surprising. Just look at the very simple JSON example above, with the escaped quotes " and newlines \n mixed into the code. Imagine the additional complexity if the code itself contained quoted strings with their own escape sequences.

While all the LLM providers will now generate correct JSON by only sampling from tokens that match the schema, the LLM is still being forced to generate the correctly escaped code inside of the JSON.

Patches don't stack

The workflow I had for my mini-agent was quite constrained. The agent was given the following tools:

read-file
replace-file
patch-file
run-tests
search (basically grep returning some amount of context)

The idea is that the LLM would be given a target Rust file to edit, and pointed at the C source code it was supposed to port over to Rust. It would then read the Rust and C files, generate a patch to replace the stub implementation with a real implementation, run the tests, and we'd be golden.

And what I kept finding was that the agent would do something dumb but not entirely unreasonable with the first patch. e.g. insert some duplicate symbol imports in the process of writing the implementationOr try to replace the file without reading it first, no matter how much you prompt it "you must read files before you edit them". There's a reason why Claude Code is hard-coded to reject edits from the LLM unless the LLM issues a read call first. . And then it would try to fix its error, maybe introducing another error, and by the third patch, it had lost the ability to generate correct patches. Eventually after spending lots of my hard-earned tokens, the model would concede patching was too hard for it, and it would re-read and then re-write the whole file. Why didn't I just force the model to always rewrite the file? Mostly because I wanted the approach to scale up to larger source files, and it's quite possible to hit your 8k output token budget in that case. . Most of the time the model would manage to get something working within a reasonable number of rounds, but the failure rate had less to do with "this is really tricky implementation" and more to do with "did the model happen to get the first patch correct". What was going on? Claude Code doesn't seem to have this problem, why can't I get the model to just write a patch correctly?

Patching a patch of a patch is hard

You already know where I was going with this.

Imagine you're a poor, hard working LLM, put into a restricted environment where you need to explicitly do all these expensive tool calls to get your information. Naturally the user will yell at you if you needlessly generate read_file commands to refresh your memory, so you want to try your best to generate just the patch calls you need. Let's follow along with Gemini in this conversation.

Messages 1-7: We spend our first few rounds researching the code base and reading our C source files. We now know what the C code looks like for our symbol and we're ready to implement!

Messages 8-9: We generate our obligatory attempt to edit a file without reading it... and this unsurprisingly fails:

Patch failed to apply. 
<<<
Search text not found in file.
Search text (first 100 chars):
extern "C" {
    pub fn ZopfliCalculateBitLengths(count: *const size_t, n: size_t, maxbits: c_int, b...
File content (first 200 chars):

Note our context window at this point has a mix of the start of the file and our imaginary copy from our attempted patch.

Messages 10-13: We read the file, realize there's a bunch of stuff in there, and generate a correct patch.
Messages 14-22: We write our fuzz test and module definition.
Messages 22-23: Our compile fails, and we need to fix up our library imports.
Messages 23-47: We spend 25 rounds of state-of-the-art AI trying to fix up our imports with patches like the below, because we can't seem to get them in the right order, or we forget what we imported...

>>>>>>> REPLACE
rust/src/lib.rs
<<<<<<< SEARCH
pub mod squeeze;

pub mod zopfli_calculate_block_size_auto_type;
=======
pub mod squeeze;
>>>>>>> REPLACE

Eventually we hit our limit of 50 messages before successfully getting the files patched up. (Note that a second run of this with some better initial system prompts and a bit more automated feedback went much better, but this trend of patches being hard continued. Having a system prompt which told the LLM to abandon the patch approach if needed can also help.)

...

The underlying problem of course is that patches need to stack on each other. Each time the LLM makes a patch, it needs to track the effect of that patch in addition to the previous state of the file. Even worse, when it makes a mistake, it needs to remember to ignore the failed patches. If we have a partial success, then we need to remember which parts went correctly and which didn't etc. The difficulty compounds upon itself as we add more onto our stack. This is no different than if you asked me to provide a diff for a single version of a file versus providing one after mentally applying a few previous changes. If the diffs interacted at all, I wouldn't stand a chance.

Resetting our context

Once we understand what's happening, the solution is pretty obvious: we just need to remind the LLM what the file actually looks like every so often. The LLMs will do this for themselves as well, e.g. in our previous conversation we see Gemini issue the following message:

I'll read `deflate.rs` one more time to see if the module is already there.
📖 Reading files:

rust/src/deflate.rs

We don't necessarily have to reset our whole context of the previous patches, but by re-reading the file, we reset the "patch depth" the model has to reference when making changes. It can still get confused by the past history, but it no longer has to "mentally apply" the patches.

Claude Code Cheats

The only reason I started writing this was that I was playing with the new Gemini CLI Agent and found it had the exact same problems I was having. It would accumulate a bunch of changes to a file, start running into issues reasoning about it, and then amusingly, would give up, git restore the file, and try again. Naturally the second time around doesn't go any better, and the model ends up trying again and again, ad infinitum. I came back an hour later to find my usage budget had been blown through, the agent had attempted to start using Gemini Flash, and all sorts of red warning boxes popping up.

Now on the one hand, it made me feel vindicated that maybe my crappy agent toolbox wasn't so bad, but it also made me curious about something else...

If you've used Claude Code much, you'll notice that it doesn't seem to have this problem. It will have edit failures occassionally, and it's "semantic model" of a file can get off, but this seems much rarer than what I pointed out. Naturally some of this comes down to better prompting and of course model fine-tuning, but if you open up a Claude conversation log you'll see this curious output from the "Edit" command:

Metadata

Parent UUID: 42d65362-0fc1-4bd1-9e27-f1e5f552c8da
Session ID: 700aa1f1-ef43-41fc-bfd9-5f35e75047fb
Message UUID: a0fda4f3-82b6-4568-85c8-c0abea298c6d
Timestamp: 2025-06-24T01:57:32.582Z
Version: 1.0.33
User Type: external
Is Sidechain: false
Type: user
CWD: /Users/power/code/portkit/libxml2
Tool Use ID: toolu_01Apb9ZNqsewePAG36kFNpnS

Tool Result
The file /Users/power/code/portkit/libxml2/rust/build.rs has been updated. Here's the result of running cat -n on a snippet of the edited file:
rust     1→use std::env;
     2→use std::fs;
     3→use std::path::PathBuf;
     4→use std::process::Command;
     5→use std::collections::HashMap;
     6→

That's right, every time Claude edits a piece of code, the result of the edit is pasted back for the agent to observe. This is quite expensive - if we make a lot of small edits to a file, we're inflating our token budget with these copies - but it effectively circumvents the patch history problem we've been discussing. It's clearly a good choice of response for a coding agent.

Reading Conversations

I strongly suspect that the Gemini Agent will adopt this technique momentarily and you won't be able to observe this discrepancy in a week's time, but I thought it was an amusing window into the world of how we can make simple things harder for ourselves.

This journey also reinforced my love for the most rudimentary debugging tool when interacting with LLMs: just read the conversation log. In most of my LLM projects now I take the messages from litellm or whatever and write them into logs/litellm/{timestamp}.logI'm sure your favorite agent library has some features for this. I don't have a favorite agent library so... this is what works for me. . Given the JSON, it's easy to turn this into a nice markdown or whatever document and then you can just ... read it. Or if you're lazy, have an LLM read it for you. And if you can't understand exactly what you're supposed to do given the conversation log, than neither can the LLM. And you should adjust your prompts and tool calls until the conversation makes sense. It's the world's dumbest prompt optimization technique and yet I find it works really well to get you to a decent baseline.