Show HN: A light GPT-5 vs. Claude Code comparison

charlielabs.ai

11 points by neom 8 hours ago

Hi HN! Can’t believe I’ve been here over 12 years and this is my first Show HN.

I guess this is two fold, One: I’m doing another startup! Charlie is an agent for TypeScript teams focusing heavily on augmentation. :)

Two: Over the last week or so we put GPT-5 (through our Charlie Agent) head-to-head with Claude Code/Opus on 10 real TypeScript issues pulled from active OSS projects.

Our Results

GPT-5 beat Claude Code on all 10 case-by-case comparisons.

Pull requests generated by GPT-5 resolved 29% more issues than o3.

PR review quality rose 5% versus o3.

Head-to-head case study

We measured testability, description, and overall quality across 10 head-to-head PRs. Testability measures how thoroughly a code change is exercised by meaningful, behavior-focused tests. It considers whether tests are present and aligned with the diff, whether they explore edge cases and real-world scenarios, and whether they avoid vacuous, misleading, or implementation-dependent patterns common in code generated by LLMs.

Description evaluates how clearly and accurately a pull request’s title and summary convey the purpose, scope, and structure of the code change. It emphasizes technical correctness, relevance to the diff, and clarity for future readers — penalizing vague, verbose, or hallucinated explanations often produced by code-generating agents. Quality assesses the substance and craftsmanship of the code change itself — judging whether it is correct, minimal, idiomatic, and free from hallucinated constructs.

It emphasizes clarity, alignment with project norms, and logical integrity, while identifying agent-specific pitfalls like over-engineering, incoherent abstractions, or invented utilities.

Testability: Charlie (0.69) vs Claude (0.55) Description: Charlie (0.84) vs Claude (0.90) Overall Quality: Charlie (0.84) vs Claude (0.65)

Caveats

Single-shot runs; no human feedback loop. Quality score uses a secondary LLM reviewer—subjective but transparent.

Def looking for feedback on more evaluations we can do, also please do nit-pick the prompts, ideas, harness design etc etc. Tell us if this bar (CI + types) is the right one, or what you’d track instead.

On a personal note: I’ve spent my career working on tools to help creators create, I’m extremely passionate about enabling people to do more easily. I am still somewhat uneasy about Gen AI, however I do believe the future is bright, certainly things are going to change - I would encourage you all to stay optimistic builders.

Thanks for taking a look!