None of these tools perform particularly well and all lack context to actually p...

dakshgupta · 2026-01-26T18:51:18 1769453478

I agree that none perform _super_ well.

I would argue they go far beyond linters now, which was perhaps not true even nine months ago.

To the degree you consider this to be evidence, in the last 7 days, the authors of a PR has replied to a Greptile comment with "great catch", "good catch", etc. 9,078 times.

onedognight · 2026-01-26T19:00:08 1769454008

I fully agree. Claude’s review comments have been 50% useful, which is great. For comparison I have almost never found a useful TeamScale comment (classic static analyzer). Even more important, half of Claude’s good finds are orthogonal to those found by other human reviewers on our team. I.e. it points out things human reviewers miss consistently and v.v.

Sharlin · 2026-01-26T19:52:22 1769457142

TBH that sounds like TeamScale just has too verbose default settings. On the other hand, people generally find almost all of the lints in Clippy's [1] default set useful, but if you enable "pedantic" lints, the signal-to-noise ratio starts getting worse – those generally require a more fine-grained setup, disabling and enabling individual lints to suit your needs.

[1] https://doc.rust-lang.org/stable/clippy/

blibble · 2026-01-26T19:29:55 1769455795

> To the degree you consider this to be evidence, in the last 7 days, the authors of a PR has replied to a Greptile comment with "great catch", "good catch", etc. 9,078 times.

do you have a bot to do this too?

BlackFly · 2026-01-27T13:27:00 1769520420

For it to be evidence, you would need to know the number of Greptile comments made and how many of those comments were instead considered to be poor. You need to contrast false positive rate with true positive rate to simply plot a single point along a classifier curve. You would then need to contrast that with a control group of experts or a static linter which means you would need to modify the "conservativeness" of the classifier to produce multiple points along its ROC curve, then you could compare whether the classifier is better or worse than your control by comparing the ROC curves.

Sample number of true positives says more or less nothing on its own.

boredtofears · 2026-01-26T19:16:54 1769455014

That sounds more like confirmation that greptile is being included in a lot agentic coding loops than anything

johnsillings · 2026-01-26T21:28:07 1769462887

I like number of "great catches" as a measure of AI code review effectiveness

mulmboy · 2026-01-26T19:38:43 1769456323

People more often say that to save face by implying the issue you identified would be reasonable for the author to miss because it's subtle or tricky or whatever. It's often a proxy for embarrassment

estimator7292 · 2026-01-26T22:20:58 1769466058

When mature, funtional adults say it, the read is "wow, I would have missed that, good job, you did better than me".

Reading embarrassment into that is extremely childish and disrespectful.

mulmboy · 2026-01-27T00:20:39 1769473239

What I'm saying is that a corporate or professional environment can make people communicate in weird ways due to various incentives. Reading into people's communication is an important skill in these kinds of environments, and looking superficially at their words can be misleading.

written-beyond · 2026-01-26T19:07:37 1769454457

I mean how far Rusts own clippy lint went before any LLMs was actually insane.

Clippy + Rusts type system would basically ensure my software was working as close as possible to my spec before the first run. LLMs have greatly reduced the bar for bringing clippy quality linting to every language but at the cost of determinism.

tadfisher · 2026-01-26T19:03:14 1769454194

Not trying to sidetrack, but a figure like that is data, not evidence. At the very minimum you need context which allows for interpretation; 9,078 positive author comments would be less impressive if Greptile made 1,000,000 comments in that time period, for example.

fragmede · 2026-01-26T19:33:07 1769455987

over 7 days does contextualize it some, though.

9,078 comments / 7 (days) / 8 (hours) = 162.107 though, so if human that person is making 162 comments an hour, 8 hours a day, 7 days a week?

shimman · 2026-01-27T13:37:07 1769521027

Bro stop trying to deflate the boosters, they got wares to sell and shares to dump.

shagie · 2026-01-26T19:06:24 1769454384

In some code that I was working on, I had

    // stuff
    obj.setSomeData(something);
    // fifteen lines of other code
    obj.setSomeData(something);
    // more stuff

The 'something' was a little bit more complex, but it was the same something with slightly different formatting.

My linter didn't catch the repeat call. When asking the AI chat for a review of the code changes it did correctly flag that there was a repeat call.

It also caught a repeat call in

    List<Objs> objs = someList.stream().filter(o -> o.field.isPresent()).toList();
    
    // ...

    var something = someFunc(objs);

    Thingy someFunc(List<Objs> param) {
        return param.stream().filter(o -> o.field.isPresent()). ...

Where one of the filter calls is unnecessary... and it caught that across a call boundary.

So, I'd say that AI code reviews are better than a linter. There's still things that it fusses about because it doesn't know the full context of the application and the tables that make certain guarantees about the data, or code conventions for the team (in particular the use of internal terms within naming conventions).

realusername · 2026-01-26T19:40:45 1769456445

I had a similar review by AI except my equivalent of setSomeData was stateful and needed to be there in both places, the AI just didn't understand any of it.

james_marks · 2026-01-26T19:45:37 1769456737

When this happens to me it makes me question my design.

If the AI doesn’t understand it, chances are it’s counter-intuitive. Of course not all LLM’s are equal, etc, etc.

wolletd · 2026-01-26T20:07:37 1769458057

Then again, I have a rough idea on how I could implement this check with some (language-dependent) accuracy in a linter. With LLM's I... just hope and pray?

realusername · 2026-01-26T23:53:30 1769471610

I'd agree with that but in the JS world, there's a lot of questionable library designs that are outside of my control.

frde_me · 2026-01-26T23:08:43 1769468923

My reaction in that case is that most other readers of the codebase would probably also assume this, and so it should be either made clearer that it's stateful, or it should be refactored to not be stateful

uoaei · 2026-01-26T19:36:24 1769456184

I'd say I see one anecdote, nothing to draw conclusions from.

fcarraldo · 2026-01-26T21:02:21 1769461341

Why isn’t `obj` immutable?

shagie · 2026-01-26T23:10:25 1769469025

Because 'obj' is an object that was generated by a json schema and pulled in as a dependency. The pojo generator was not set up to create immutable objects.

tayo42 · 2026-01-26T20:32:21 1769459541

Unit tests catch that kind of stuff

shagie · 2026-01-26T23:11:59 1769469119

The code works perfectly - there is no issue that a unit test could catch... unless you are spying on internally created objects to a method and verifying that certain functions are called some number of times for given data.

tayo42 · 2026-01-27T00:28:34 1769473714

Sure and you can do that

shagie · 2026-01-27T01:45:54 1769478354

Trying to write the easiest code that I could test... I don't think I can without writing an excessively brittle test that would break at the slightest implementation change.

So you've got this Java:

    public List<Integer> someCall() {
        return IntStream.range(1,10).boxed().toList();
    }

    public List<Integer> filterEvens(List<Integer> ints) {
        return ints.stream()
                .filter(i -> i % 2 == 0)
                .toList();
    }

    int aMethod() {
        List<Integer> data = someCall();
        return filterEvens(data.stream().filter(i -> i % 2 == 0).toList()).size();
    }

And I can mock the class and return a spied'ed List. But now I've got to have that spied List return a spied stream that checks to see if .filter(i -> i % 2 == 0) was called. But then someone comes and writes it later as .filter(i -> i % 2 != 1) and the test breaks. Or someone adds another call to sort them first, and the test breaks.

To that end, I'd be very curious to see the test code that verifies that when aMethod() is called that the List returned by SomeCall is not filtered twice.

What's more, it's not a useful test - "not filtered twice" isn't something that is observable. It's an implementation detail that could change with a refactoring.

Writing a test that verifies that filterEvens returns a list that only contains even numbers? That's a useful test.

Writing a test that verifies that aMethod returns back the size of the even numbers that someCall produced? That's a useful test.

Writing a test that tries to enforce a particular implementation between the {} of aMethod? That's not useful and incredibly brittle (assuming that it can be written).

nl · 2026-01-27T03:43:07 1769485387

You are correct and the objection is just completely invalid. There's no way anyone would or should write tests like this at the client level.

I think they are just arguing for the sake of arguing.

tayo42 · 2026-01-27T05:11:17 1769490677

You mention the tools you can use to make it happen.

I think we're at the point where you need concrete examples to talk about whether it's worth it or not. If you have functions that can't be called twice, then you have no other option to test details in the implementation like that.

Yeah there's a tradeoff between torturing your code to make everything about it testable and enforce certain behavior or keeping it simpler.

I have worked in multiple code bases where every function call had asserts on how many times it was called and what the args were.

shagie · 2026-01-27T14:14:15 1769523255

In functions that you write, that might be possible.

How would you assert that a given std::vector only was filtered by std::ranges::copy_if once? And how would you test that the code that was in the predicate for it wasn't duplicated?

How would you write a failing test for this function keeping the constraint that you are working with std::vector?

    std::vector<int> doThing(const std::vector<int>& nums) {
        std::vector<int> tmp1;
        std::vector<int> tmp2;
        std::ranges::copy_if(data,
                             std::back_inserter(tmp1),
                             [](int n) { return n % 2 == 0; }
        );
        std::ranges::copy_if(tmp1,
                             std::back_inserter(tmp2),
                             [](int n) { return n % 2 == 0; }
        );
        return tmp2;
    }

tayo42 · 2026-01-28T07:19:41 1769584781

I know how I would do it in python. This is built into the stdlibs testing library, with mocks.

Maybe dependency injection and function pointers for the copy if function. Then you can check the call counts in your tests. But idk the cpp eco system and what's available there to do it.

shagie · 2026-01-28T14:13:06 1769609586

The python code would be

    def some_call():
        return [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
    
    def print_evens(nums):
        for n in filter(lambda n: n % 2 == 0, nums):
            print(n)
    
    def func():
        filtered = list(filter(lambda n: n % 2 == 0, some_call()))
        print_evens(filtered)
    
    if __name__ == "__main__":
        func()

How would you write a failing test that prevents the list from some_call() from having the same filter applied to it twice?

tayo42 · 2026-01-28T16:19:07 1769617147

You would use something like this

https://docs.python.org/3/library/unittest.mock.html#unittes...

Then you would make the mock filter with patch and test the `func` function

psuedo python code would be

   @patch(builtins.filter)
   def test_func_filter_calls(mock_filter):
     mock_filter.return_value = [2,4,6]
     func()
     mock_filter.assert_called_once_with([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])

shagie · 2026-01-28T17:42:56 1769622176

That wouldn't fail though. It was called only once with [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]. The second time it was called with [2, 4, 6, 8, 10].

Likewise, if some_call returned [2, 4, 6, 8, 10] instead, it should only be called with [2, 4, 6, 8, 10] once then.

However, the purpose of this test then becomes questionable. Why are you testing implementation details rather than observable? Is there anything that you could observe that depended on the filter being called once or twice with the same filter function?

tayo42 · 2026-01-28T18:23:57 1769624637

Did you try it? If it doesn't work there's also called once if you scroll up on the doc

And as far as whether it's a good idea or not, I generally wouldn't, but was saying when it is important then you do have these tools available,llms aren't the first thing to check for these mistakes. It's up to the engineer to choose between trade offs for your scenario.

shagie · 2026-01-28T19:38:15 1769629095

Yes. The test passes. https://imgur.com/a/4qlTKlc

To try to monkey patch this in, you would need to also assert that it wasn't called with [2, 4, 6, 8, 10].

At which point, I would again ask "why are you testing that it _wasn't_ called with a given set of values?"

The comment at the root of this is "Unit tests catch that kind of stuff".

... But unit tests aren't for testing internals of implementation but rather observable aspects of a function.

Consider if the code was written so that it was

    def print_evens(nums):
        for n in nums:
            if n % 2 == 0:
                print(n)

instead (with the filter being used in func())

This isn't something that unit tests can (or should) identify. It would come out in a code review that there is redundant functionality in func and print_evens.

Using ChatGPT or another tool to assist in doing code reviews can be helpful (my original premise).

https://chatgpt.com/share/697a64a6-33c0-8011-a0f8-ca4fec74ab...

ChatGPT properly identifies the duplicated functionality (even though the code is using different idioms for doing the filtering for even numbers).

tayo42 · 2026-01-28T20:39:20 1769632760

The test pass because your patching print_even so the 2nd filter is never called.

Which I guess, idk maybe think through the testing more and your code more before jumping to conclusions about how things are?

Testing is one tool you have, and it can test the internal like this. Obviously there's a use for it if its in the Python stdlib

This is in mockito

https://stackoverflow.com/questions/39452438/mockito-how-to-...

this is in google testing library for cpp

https://google.github.io/googletest/gmock_for_dummies.html

> Specify your expectations on them (How many times will a method be called? With what arguments? What should it do? etc.).

If you never heard of this, I guess you learned something new? Im not a tutor though. I would read the docs more and experiment. Maybe chatgpt can help you with how tests can be written.

shagie · 2026-01-29T00:48:06 1769647686

With Mockito, I can mock the returned result of someCall().

However, it also means mocking list.stream() and mocking the Stream for stream.filter() and mock the call stream.toList() to return a new mocked object that has those mocks on it again.

I could catch the object passed in to printEven(...) but that has no history on it to see if filter was called on it before.

Trying to do the filter(...) call would be especially hard since you'd be parameterize it with a code block.

And all this returns to "is this a useful test?"

Testing should only be done on the observable parts of the function. Does printEven only print even numbers?

The tests that you are proposing are testing the implementation of those calls to work in a specific way. "It must call filter" - but if it's changed to a different filter or if it's changed to not use a filter but has the same functionality the code breaks.

Inefficient? Yes. Bad? Yes. Wrong - no. And not being wrong it isn't something that a unit test could validate without going unnecessarily into the implementation of the internals for the method. Internals changing while the contract remains the same is perfectly acceptable and shouldn't be breaking a unit test.

anthonypasq96 · 2026-01-26T20:54:54 1769460894

youre verifying std lib function call counts in unit tests? lmao.

tayo42 · 2026-01-26T21:07:21 1769461641

You can do that with mocks if it's important that something is only called once, or likely there's some unintended side effect of calling it twice and tests woukd catch the bug

anthonypasq96 · 2026-01-26T22:29:38 1769466578

i know you could do it, im asking why on earth you would feel its vital to verify stream.filter() was called twice in a function

noitpmeder · 2026-01-26T21:08:17 1769461697

You're not verifying the observable behavior of your application? lmao

shagie · 2026-01-26T23:25:38 1769469938

How would you suggest tests around:

    void func() {
        printEvens(someCall().stream().filter(n -> n % 2 == 0).toList());
    }

    void printEvens(List<Integer> nums) {
        nums.stream().filter(n -> n % 2 == 0).forEach(n -> System.out.println(n));
    }

The first filter is redundant in this example. Duplicate code checkers are checking for exactly matching lines.

I am unaware of any linter or static analyzer that would flag this.

What's more, unit tests to test the code for printEvens (there exists one) pass because they're working properly... and the unit test that calls the calling function passes because it is working properly too.

Alternatively, write the failing test for this code.

tayo42 · 2026-01-27T00:32:49 1769473969

Idk how exactly to do it in cpp becasue I'm not familiar with the tooling

You could write a test that makes sure the output of someCall is passed directly to printeven without being modified.

The example as you wrote is hard to test in general. It's probably not something you would write if your serious about testing.

shagie · 2026-01-27T00:59:34 1769475574

In C++, the code would look like:

    #include <vector>
    #include <iostream>
    #include <algorithm>

    std::vector<int> someCall()
    {
        return {1, 2, 3, 4, 5, 6, 7, 8, 9, 10};
    }

    void printEvens(const std::vector<int>& nums)
    {
        std::ranges::for_each(nums, [](int n)
        {
            if (n % 2 == 0)
            {
                std::cout << n << '\n';
            }
        });
    }

    int main()
    {
        std::vector<int> data = someCall();
        std::vector<int> tmp;

        std::ranges::copy_if(data,
                             std::back_inserter(tmp),
                             [](int n) { return n % 2 == 0; }
        );
    
        printEvens(tmp);
        return 0;
    }

---

Nothing in there is wrong. There is no test that would fail short of going through the hassle of creating a new type that does some sort of introspection of its call stack to verify which function its being called in.

Likewise, identify if a linter or other static analysis tool could catch this issue.

Yes, this is a contrived example and it likely isn't idiomatic C++ (C++ isn't my 'native' language). The actual code in Java was more complex and had a lot more going on in other parts of the files. However, it should serve to show that there isn't a test for printEvens or someCall that would fail because it was filtered twice. Additionally, it should show that a linter or other static analysis wouldn't catch the problem (I would be rather impressed with one that did).

From ChatGPT a code review of the code: https://chatgpt.com/share/69780ce6-03e0-8011-a488-e9f3f8173f...

wtetzner · 2026-01-27T12:56:31 1769518591

> You could write a test that makes sure the output of someCall is passed directly to printeven without being modified.

But why would anyone ever do that? There's nothing incorrect about the code, it's just less efficient than it should be. There's no reason to limit calls to printEven to accept only output from someCall.

quietbritishjim · 2026-01-26T22:38:29 1769467109

A redundant filter() isn't observable (except in execution time).

You could pick it up if you were to explicitly track whether it's being called redundantly but it'd be very hard and by the time you'd thought of doing that you'd certainly have already manually checked the code for it.

anthonypasq96 · 2026-01-26T22:28:48 1769466528

what happened to not testing implementation details?

gherkinnn · 2026-01-26T19:41:33 1769456493

Opus 4.5 catches all sorts of things a linter would not, and with little manual prompting at that. Missing DB indexes, forgotten migration scenarios, inconsistencies with similar services, an overlooked edge case.

Now I'm getting a robot to review the branch at regular intervals and poking holes in my thinking. The trick is not to use an LLM as a confirmation machine.

It doesn't replace a human reviewer.

I don't see the point of paying for yet another CI integration doing LLM code review.

philipwhiuk · 2026-01-27T00:51:21 1769475081

Currently attempting to get GitLab Duo's review featured enabled as a 'second pair of eyes'. I agree 100% that it's not replacing a human review.

I would on the whole prefer a 'lint-style' tool to catch most stuff because they don't hallucinate.

But obviously they don't catch everything so an LLM-based review seems like an additional useful tool.

storystarling · 2026-01-26T22:03:47 1769465027

I came to the same conclusion and ended up wiring a custom pipeline with LangGraph and Celery. The markup on the SaaS options is hard to justify given the raw API costs. The main benefit of rolling it yourself seems to be the control over context retrieval—I can force it to look at specific Postgres schemas or related service definitions that a generic CI integration usually misses.

philipwhiuk · 2026-01-27T00:54:38 1769475278

Personally I'm hoping that once the bubble bursts and hardware improvement catches up, we start seeing reasonable prices for reasonable models on SaaS platforms that are not scary for SecOps.

Not guaranteed though of course.

roncesvalles · 2026-01-27T02:46:43 1769482003

Exactly. This is like buying a smoothie blender when you already have an all-purpose mixer-blender. This whole space is at best an open-source project, not a (multiple!) whole company.

It's very unlikely that any of these tools are getting better results than simply prompting verbatim "review these code changes" in your branch with the SOTA model du jour.

guelo · 2026-01-27T04:14:56 1769487296

All those llm wrapper companies make no sense.

ohyoutravel · 2026-01-26T20:14:25 1769458465

You’ve found the smoking gun!

justapassenger · 2026-01-26T20:39:46 1769459986

AI code review to me is similar to AI code itself. It's good (and constantly getting better) at dealing with mundane things, like - is the list reversed correctly? Are you dealing with pointers correctly? Do you have off by 1 issues?

Where they suck is high level problems like - is the code actually solving the business problem? Is it using right dependencies? Does it fit into broader design?

Which is expected for me and great help. I'm more happy as a human to spend less time checking if you're managing lifecycle of the pointer correctly and focus on ensuring that code is there to do what it needs to do.

The_Fox · 2026-01-26T20:31:42 1769459502

I installed CodeRabbit for our reviews in GitLab and am pretty happy with the results, especially considering the low price ($15/user/mo I think).

It regularly finds problems, including subtle but important problems that human reviewers struggle to find. And it can make pretty good suggestions for fixes.

It also regularly complains about things that are possible in theory but impossible in practice, so we've gotten used to just resolving those comments without any action. Maybe if we used types more effectively it would do that less.

We pay a lot more attention to what CodeRabbit says than what DeepSource said when use used it.

cbovis · 2026-01-26T19:59:18 1769457558

GH Copilot is definitely far better than just a linter. I don't have examples to hand but one thing that's stood out to me is its use of context outside the changes in the diff. It'll pull in context that typically isn't visible in the PR itself, the sort of things that only someone experienced in the code base with good recall would connect the dots on (e.g. this doesn't conform to typical patterns, or a version of this is already encapsulated in reusable code, or there's an existing constant that could be used here instead of the hardcoded value you have).

bartread · 2026-01-26T20:17:30 1769458650

I don't know that I fully agree with that. I use Copilot for AI code review - just because it's built in to GitHub and it's easy - and I'd say results are variable, but overall decent.

Like anything else AI you need to understand what you're doing, so you need to understand your code and the structure of your application or service or whatever because there are times it will say something that's just completely wide of the mark, or even the polar opposite of what's actually the case. And so you just ignore the crap and close the conversation in those situations.

At the same time, it does catch a lot of bugs and problems that fall into classes where more traditional linters really miss the mark. It can help fill holes in automated testing, spot security issues, etc., and it'll raise PRs for fixes that are generally decent. Sometimes not but, again, in these cases you just close them and move on.

I'd certainly say that an AI code review is better than no code review at all, so it's good for a startup where you might be the only developer or where there are only one or two of you and you don't cross over that much.

But the point I actually wanted to get to is this: I use Copilot because it's available as part of my GitHub subscription. Is it the best? I don't know. Does it add value with zero integration cost to me? Yes. And that, I suspect, is going to make it the default AI code review option for many GitHub subscribers.

That does leave me wondering how much of a future there is for AI code review as a product or service outside of the hosting platforms like GitHub and Gitlab, and I have to imagine that an absolutely savage consolidation is coming.

vimda · 2026-01-26T19:05:29 1769454329

Anecdotally, Claude Bug Bot has actually been super impressive in understanding non trivial changes. Like, today, it noted a race condition in a ~1000 line go change that go test -race didnt pick up. There are definitely issues though. For one, it's non deterministic, so you end up with half a dozen commits, with each run noting different issues. For a second, it tends to be quite in favour of premature optimisation. But over all, well worth it in my experience

ChadNauseam · 2026-01-27T08:08:57 1769501337

I haven't used the bug bot, but I like asking claude code to just review my PR in the command line. Yesterday it found a bug in a data structure I was implementing (it didn't support ZSTs properly). Of course, the fix it suggested was completely wrong, but what are ya gonna do. Still saved me from embarrassing myself before asking for a review

athrowaway3z · 2026-01-26T19:14:35 1769454875

> The SOTA isn't capable of using a code diff as a jumping off point.

Not a jumping off point, but I'm having pretty great results on a complicated fork on a big project with a `git diff main..fork > main.diff`, then load in the specs I keep, and tell it to review the diff in chunks while updating a ./review.md

It's solving a problem I created myself by not reviewing some commits well enough, but it's surprisingly effective at picking up interactions spread out over multiple commits that might have slipped through regardless.

storystarling · 2026-01-27T08:47:24 1769503644

I suspect this is primarily a unit economics problem. To get context beyond the diff you really need the full repository or a robust AST, but the token costs to load that state for every PR make the margins impossible right now.

victorbjorklund · 2026-01-26T21:53:18 1769464398

They 100% catch bugs in code I work on. Is it replacing human review fully? No, not yet. But it is a useful tool. Just like most of us wouldn’t do a code review without having tests, linters etc run first.

RetpolineDrama · 2026-01-26T22:31:10 1769466670

>The SOTA isn't capable of using a code diff as a jumping off point.

The low quality of HN comments has been blowing my mind.

I have quite literally been doing what you describe every working day for the last 6+ months.