Codex vs Claude Code: notes from running both in production
Seven background agents run LangWatch's internal operations: six on Claude Code, one on OpenAI Codex. This Codex vs Claude Code comparison comes from operating both daily: what differs, where each wins, and how we decide.
Rogerio Chaves · July 3, 2026 · ArticleSeven background agents run LangWatch's internal operations. Six run on Claude Code: they triage Dependabot alerts, optimize our slowest ClickHouse queries, draft the weekly changelog, analyze customer activity, fix low-priority tech debt, and verify production deploys. The seventh, the agent that reviews the other agents' pull requests, runs on OpenAI Codex. The engineering team also drives both tools by hand every day.
So when someone asks us about Codex vs Claude Code, the answer comes from months of operating them side by side. This post writes that answer down as of July 2026: what each tool literally does differently, where each one earned its seat in our stack, and how we settle the which-is-better question when a model release reshuffles it every few months.
What each tool actually is
Claude Code is Anthropic's coding agent. The core of it is a terminal CLI: you describe a task, it reads the repo, edits files, runs commands, and loops until the task is done. When it hits something it lacks permission for, it stops and asks. You steer it with a CLAUDE.md file in the repo root. The extension points are hooks (shell commands that fire on lifecycle events like PreToolUse), skills (markdown playbooks it can invoke), subagents, plugins, and MCP servers. The same engine runs inside VS Code and JetBrains, a desktop app, the browser, Slack, and GitHub Actions. claude -p "do the thing" runs it headless for cron jobs and CI, and there is an Agent SDK if you want to build your own agent on the harness.
Codex is OpenAI's version of the same idea, and the architecture rhymes: a CLI (open source, written in Rust), an IDE extension, a desktop app, cloud tasks that run in OpenAI-managed containers, and a GitHub integration that reviews pull requests. Repo instructions live in AGENTS.md. The most distinctive design choice is the safety model. Claude Code leans on permission rules you configure; Codex leans on an OS-level sandbox (Seatbelt on macOS, Landlock and seccomp on Linux) plus approval modes that range from read-only to full access. codex exec is the headless mode, and skills, lifecycle hooks, subagents, and plugins all landed over late 2025 and 2026.
That last part matters because most Codex vs Claude Code comparisons you will find are frozen in mid-2025, when "Claude Code has hooks and Codex does not" was true. It is not true anymore. Both tools today speak MCP as client and server, both package repeatable workflows as skills, both run headless, and both review PRs on GitHub. The differences that remain are real, but they live in the defaults, the governance surface, and the pricing rather than in the feature checklist.
Codex vs Claude Code, side by side
| Dimension | Claude Code | OpenAI Codex |
|---|---|---|
| Harness | Closed-source CLI at the core; same engine in VS Code/JetBrains, desktop, web, Slack, GitHub review, Chrome | Open-source Rust CLI; IDE extension, desktop app, cloud tasks, GitHub review, Chrome extension |
| Models | Claude Opus and Sonnet 4.x lines; switchable per session or per subagent | GPT-5.x Codex line (GPT-5.5 at the time of writing); set in config.toml |
| Instructions and memory | CLAUDE.md per repo, plus auto memory that accumulates across sessions | AGENTS.md per repo, plus a memories feature |
| Context handling | Auto-compaction as the window fills; /compact on demand | Compaction, with PreCompact and PostCompact hooks to script around it |
| Hooks and extensibility | Hooks, skills, plugins, subagents, Agent SDK | Hooks, skills, plugins, subagents, workflows; Agents SDK integration |
| MCP support | Client and server; servers and their permission rules checked into settings | Client and server; stdio and HTTP servers in config.toml, MCP tool search on by default |
| Sandboxing and approvals | Permission rules (allow/deny, fail-closed) plus an opt-in filesystem and network sandbox | Kernel-level sandbox by default plus approval modes (read-only, auto, full access) |
| Enterprise controls | Managed settings pushed by MDM, OpenTelemetry export, analytics dashboard, Analytics and Compliance APIs | Cloud-managed requirements.toml policies, RBAC and SCIM, analytics dashboard, Analytics and Compliance APIs |
| Pricing model | Claude Pro/Max subscriptions ($20 to $200 a month) or Team/Enterprise seats; per-token API; available via Bedrock, Vertex, and Foundry | Included in ChatGPT plans from Free to Enterprise, metered as credits; per-token with an API key |
Everything in that table checks against the two vendors' docs in early July 2026. These products ship weekly, so if you are reading this months later, click through before repeating a row in a buying decision.
Where Claude Code wins for us
Six of our seven agents ended up on Claude Code, and the reasons are mechanical. Each agent in the fleet is a skill, a markdown playbook checked into the repo, plus a schedule. A systemd timer on a shared EC2 box wakes each one with a prompt like "use the optimize-clickhouse-slow-queries skill to do today's run end to end." The ClickHouse optimizer then finds the two slowest production queries of the last 24 hours and opens one advisory PR per query. The tech-debt fixer picks stale low-priority tickets, re-checks that each ticket's premise still holds, and is capped at two PRs per run.
Every prompt ends with a variant of "never merge, never approve": a human makes every merge call. Claude Code holds this whole shape well. Headless sessions run unattended for an hour, skills are the packaging unit, and hooks carry the guardrails. OpenTelemetry export is there when we want to watch the fleet from somewhere other than its Slack channels.
For hand-driven coding the margin is thinner, and it moves with model releases. What keeps Claude Code as the daily driver for most of the team is the customization already sunk into it. Our CLAUDE.md files, the shared skills, the MCP servers wired into checked-in settings: switching costs are the boring, decisive kind.
Where Codex wins for us
The PR reviewer runs on Codex, and it is event-driven rather than scheduled: it gets pinged when a pull request across three of our repos needs review, reads the diff, and comments. One property of this setup grew on us: the model reviewing Claude-written code is not from the family that wrote it. That is about the cheapest independent review you can buy.
Beyond our fleet, three things stand out. The sandbox defaults: even on full autonomy, Codex cannot write outside the workspace or reach the network unless you opened it up. The enforcement sits in the kernel rather than in a rule file, which is a different class of guarantee when nobody is watching the terminal. The open-source CLI: when the harness does something surprising, the Rust source is on GitHub and you can read exactly what it did. And the bundling: Codex usage is included in the ChatGPT plans most companies already pay for, metered through credits, which makes the first month of experimentation an expense-report non-event.
Codex is also where we learned the unattended-cost lesson. We originally auto-dispatched the reviewer on every new PR and turned that off when the review spend outran the plan we were on; it now reviews when a human pings it. The same would have happened on any vendor: unattended loops need a budget owner.
How we decide between Codex and Claude Code
Because no agent in our fleet can merge its own work, a human reads every agent PR. Quality differences therefore show up as review friction rather than as a score: how often we rewrite a diff, how often a review comment misses the point, how often a run burns twenty minutes and produces nothing mergeable. Skills get tested before they ship using scenario simulations, full simulated conversations against the real agent with a judge grading the result. We wrote up how we test agent skills separately.
When gut feel is not enough, we trace sessions. Both harnesses export OpenTelemetry, so sessions, tokens, costs, and tool decisions from both can land in one backend where the argument gets numbers. The enterprise guide walks that path in detail. Cost per merged PR turns out to be the metric that survives pricing changes; cost per token does not.
We also stopped reading third-party Codex vs Claude Code benchmarks to make this choice. Your repo, your skills, and your CI setup dominate whatever deltas those benchmarks measure. Both vendors also ship model updates faster than bloggers re-run their tests. If you want the comparison to mean anything, run both on two weeks of your own tickets and keep the transcripts.
Our split is six to one today. Nobody designed that ratio; it is just where each tool landed after months of PRs. A strong model release on either side will probably move a seat, and keeping the transcripts is what makes moving a seat cheap.
Frequently asked questions
- Is Claude Code better than Codex?
- For some workflows, and the ranking moves with model releases. Claude Code currently has the deeper customization surface across hooks, skills, subagents, and its Agent SDK, while Codex ships a stricter kernel-level sandbox and comes bundled with ChatGPT plans. Run both on two weeks of real tickets from your own repo and compare review friction and cost per merged PR; that beats any external benchmark.
- Can I use Codex and Claude Code together?
- Yes, and mixed fleets are practical. Both read plain markdown instruction files (CLAUDE.md and AGENTS.md), so one repo can serve both. LangWatch runs six background agents on Claude Code and has a Codex agent review their pull requests, which also keeps the reviewing model different from the authoring model.
- How do teams measure which coding agent performs better?
- By instrumenting sessions instead of trusting impressions. Both tools export OpenTelemetry metrics and events covering tokens, cost, sessions, and tool decisions, so both can stream into one observability backend and be compared on outcomes like cost per merged PR or rework rate. Each vendor's own analytics dashboard only sees its own tool.
- What are the cost differences between Codex and Claude Code?
- The structures are similar: subscription tiers that include agent usage (ChatGPT plans for Codex; Claude Pro, Max, Team, or Enterprise for Claude Code) plus a per-token API path for automation. Codex meters plan usage in credits; Claude plans use usage limits that reset on fixed windows. For unattended agents, track cost per completed task, since per-token prices change often on both sides.
- Do Codex and Claude Code both support MCP?
- Yes. Both act as MCP clients and both can run as MCP servers inside other tools. Claude Code configures servers through checked-in settings with permission rules, Codex through config.toml or the codex mcp command, and both vendors support org-level MCP allowlists for company rollouts.