GPT-5.3 Codex vs Opus 4.6: We benchmarked both on our production Rails codebase — the results were surprising
We use and love both Claude Code and Codex CLI agents.
Public benchmarks like SWE-Bench don't tell you how a coding agent performs on your own codebase.
For example, our codebase is a Ruby on Rails codebase with Phlex components, Stimulus JS, and other idiosyncratic choices. Meanwhile, SWE-Bench is all Python.
So we built our own SWE-Bench!
Methodology:
- We selected PRs from our repo that represent great engineering work.
- An AI infers the original spec from each PR (the coding agents never see the solution).
- Each agent independently implements the spec.
- Three separate LLM evaluators (Claude Opus 4.5, GPT 5.2, Gemini 3 Pro) grade each implementation on correctness, completeness, and code quality — no single model's bias dominates.

The headline numbers
- GPT-5.3 Codex: ~0.70 quality score at under $1/ticket
- Opus 4.6: ~0.61 quality score at ~$5/ticket
Codex is delivering better code at roughly 1/7th the price (assuming the API pricing will be the same as GPT 5.2). Opus 4.6 is a tiny improvement over 4.5, but underwhelming for what it costs.
We tested other agents too (Sonnet 4.5, Gemini 3, Amp, etc.) — full results in the image.
Run this on your own codebase
Works with any stack — you pick PRs from your repos, select which agents to test, and get a quality-vs-cost breakdown specific to your code. Benchmark is in beta — reach out at team@superconductor.com if you want access.