GPT-5.3 Codex vs Opus 4.6: We benchmarked both on our production Rails codebase — the results were surprising

1 min read

We use and love both Claude Code and Codex CLI agents.

Public benchmarks like SWE-Bench don't tell you how a coding agent performs on your own codebase.

For example, our codebase is a Ruby on Rails codebase with Phlex components, Stimulus JS, and other idiosyncratic choices. Meanwhile, SWE-Bench is all Python.

So we built our own SWE-Bench!

Methodology:

  1. We selected PRs from our repo that represent great engineering work.
  2. An AI infers the original spec from each PR (the coding agents never see the solution).
  3. Each agent independently implements the spec.
  4. Three separate LLM evaluators (Claude Opus 4.5, GPT 5.2, Gemini 3 Pro) grade each implementation on correctnesscompleteness, and code quality — no single model's bias dominates.
GPT-5.3 Codex vs Opus 4.6 benchmarked on our production Rails codebase

The headline numbers

  • GPT-5.3 Codex: ~0.70 quality score at under $1/ticket
  • Opus 4.6: ~0.61 quality score at ~$5/ticket

Codex is delivering better code at roughly 1/7th the price (assuming the API pricing will be the same as GPT 5.2). Opus 4.6 is a tiny improvement over 4.5, but underwhelming for what it costs.

We tested other agents too (Sonnet 4.5, Gemini 3, Amp, etc.) — full results in the image.

Run this on your own codebase

Works with any stack — you pick PRs from your repos, select which agents to test, and get a quality-vs-cost breakdown specific to your code. Benchmark is in beta — reach out at team@superconductor.com if you want access.