GPT-5.3 Codex vs Opus 4.6: We benchmarked both on our production Rails codebase

We use and love both Claude Code and Codex CLI agents.

Public benchmarks like SWE-Bench don't tell you how a coding agent performs on your own codebase.

For example, our codebase is a Ruby on Rails codebase with Phlex components, Stimulus JS, and other idiosyncratic choices. Meanwhile, SWE-Bench is all Python.

So we built our own SWE-Bench!

Methodology:

We selected PRs from our repo that represent great engineering work.
An AI infers the original spec from each PR (the coding agents never see the solution).
Each agent independently implements the spec.
Three separate LLM evaluators (Claude Opus 4.5, GPT 5.2, Gemini 3 Pro) grade each implementation on correctness, completeness, and code quality — no single model's bias dominates.

GPT-5.3 Codex vs Opus 4.6 benchmarked on our production Rails codebase

The headline numbers

GPT-5.3 Codex: ~0.70 quality score at under $1/ticket
Opus 4.6: ~0.61 quality score at ~$5/ticket

Codex is delivering better code at roughly 1/7th the price (assuming the API pricing will be the same as GPT 5.2). Opus 4.6 is a tiny improvement over 4.5, but underwhelming for what it costs.

We tested other agents too (Sonnet 4.5, Gemini 3, Amp, etc.) — full results in the image.

Run this on your own codebase

Works with any stack — you pick PRs from your repos, select which agents to test, and get a quality-vs-cost breakdown specific to your code. Benchmark is in beta — reach out at team@superconductor.com if you want access.

Methodology:

The headline numbers

Run this on your own codebase

You might also like

How to Make Claude Code use the browser to QA its own work

Anatomy of our CLAUDE.md file