RFC: Make Codex CLI state-of-the-art — built-in prompt engineering, sub-agents, LLM proxy, extensibility & observability

## Context
I’m an everyday GPT-5 user for professional development. The model is excellent, but the Codex CLI chassis still lags behind the best-in-class experience developers are adopting elsewhere—particularly Claude Code CLI, which currently leads because of its embedded prompt engineering and orchestration.

Community references

Critique & context: https://www.youtube.com/watch?v=SOxmiupQm7w
NotebookLLM resume:
The video analyzes the negative perception surrounding GPT-5's launch, arguing that the fault lies not with the model itself, which is considered "amazing," but rather with user interface layers like ChatGPT.com and Cursor. The presenter explains that early insider experiences with the model were the same publicly available version, but optimized with custom system prompts and a different testing environment. He criticized OpenAI's automatic router and the flaws in Cursor's user experience, which resulted in frustrations with slowness and a lack of visual feedback. The video also compares GPT-5 to other models, such as Grok-4 and Anthropic's models, highlighting that while GPT-5 may seem slower, it is superior in output quality and ability to work on complex tasks for long periods. Finally, the presenter discusses a Less Wrong article on the “reverse DeepSeek moment,” suggesting that public appreciation has underestimated GPT-5, which could have negative implications for perceptions of AI progress and policy decisions.

Practical demo showing why embedded prompt engineering wins: https://youtu.be/i0P56Pm1Q3U?si=sZQYNuyTFyg0AQVa
NotebookLLM resume:
This video reveals that Claude Code's effectiveness in code generation isn't due to unique underlying models, but rather to sophisticated and detailed prompt engineering. The author "decoded" the system by intercepting API requests and discovering that Claude uses an extraordinarily long and detailed system prompt, as well as constant tool prompts throughout the message history to guide the model. The video also explores how Claude Code utilizes subagents, activated through detailed tool descriptions, and emphasizes the importance of XML formatting and tags within prompts for better model comprehension, highlighting that prompt tuning is specific to each model family and crucial to optimizing performance.

Why this matters now

A key differentiator of Claude Code CLI is the prompt engineering baked into the tool. That architectural choice is so strong that Anthropic can comfortably let their LLMs be used by other tools—the CLI still stands out and drives adoption. I want that market leadership to move to GPT-5 by giving Codex CLI the same (or better) foundations.

Problems to solve

No built-in prompt engineering layer
We need canonical, versioned, and testable prompt strategies inside the CLI—so teams don’t have to reinvent fragile prompt glue in every repo.

Missing sub-agent orchestration
Pipelines with specialized roles (planner → coder → tester → reviewer) dramatically reduce hallucinations and improve reproducibility.

Lack of a native LLM traffic proxy/inspector
Without transparent logging of prompts, tool calls, costs, and latency, teams can’t debug or govern usage at scale.

Extensibility and observability aren’t first-class
Devs need plugins, MCP integration, logs/traces/metrics, cost guardrails, and dry-run/commit-plan safety for PR workflows.

### Are you interested in implementing this feature?

## Proposal (high level)

### A) First-class, embedded prompt engineering

* **Canonical prompt packs** for common dev tasks (plan, modify, test, review, refactor, migrate).
* **Versioned templates** with changelogs and A/B evaluation harnesses.
* **Policy hooks** (e.g., security/lint/test gates) enforced at the prompt layer.
* Optional **prompt DSL** or structured blocks to minimize prompt drift and improve reproducibility.

### B) Sub-agents orchestration (opinionated defaults, configurable)

* Agents: `planner`, `coder`, `tester`, `reviewer`.
* Shared session memory and **strict context-passing rules**.
* Optional parallelization (e.g., tests and review in parallel after code gen).

### C) Built-in LLM proxy/inspector

* `codex proxy start --record .codex/logs` acting as a transparent MITM for OpenAI/Anthropic/Ollama/etc.
* **Structured JSONL output**: timestamp, session, agent, prompt, tool calls, latency, tokens, cost, brief summaries.
* **Redaction** rules for secrets/compliance.

### D) Extensibility & MCP

* **Plugins** via `codex.yaml` with lifecycle hooks (`beforePlan`, `afterCode`, `onError`, etc.).
* **MCP** (Model Context Protocol) to attach external tool servers (DB query, secrets, CI, cloud).
* **Task-based model routing**: GPT-5 for planning/review, smaller/cheaper models for scaffolding/tests.

### E) Configuration that’s declarative & reproducible

```yaml
# codex.yaml
project: my-repo

models:
  planner: openai:gpt-5
  coder: openai:gpt-5
  tester: openai:gpt-4o-mini
  reviewer: openai:gpt-5

prompts:
  planner: v1.3  # versioned canonical prompt pack
  coder: v2.0
  tester: v1.1
  reviewer: v1.0

agents:
  - name: planner
    system: "Plan stepwise tasks with risks and acceptance checks."
  - name: coder
    tools: [fs.write, git.diff, unit.run]
  - name: tester
    tools: [unit.run, lint.run, security.scan]
  - name: reviewer
    system: "Perform objective code review against defined criteria."

proxy:
  enabled: true
  record_dir: .codex/logs
  redact:
    - OPENAI_API_KEY

plugins:
  - name: mcp-tools
    config:
      servers:
        - url: "http://localhost:8765"
          caps: ["db.query", "secrets.get"]

policies:
  max_cost_usd: 2.50
  max_tokens: 200000
  require_green_tests: true
```

### F) Observability & developer experience

* `codex logs --session <id>`: tail by agent/file/error.
* `codex trace open <id>`: timeline of spans (plan → code → test → review).
* `--dry-run` and `--commit-plan` with Git guards for safe changes.
* Metrics: cost/session, avg stage time, PR approval rate, test flakiness.

### G) Governance & reproducibility

* **Prompt versioning** and optional **seed** to reduce variance.
* **Session snapshots** for bug repro.
* **Policy enforcement** (spend caps, token limits) per repo/environment.

## Suggested user flows

**Orchestrated run**

```bash
codex init
codex run --agents planner,coder,tester,reviewer --project .
```

**Debug with proxy**

```bash
codex proxy start --record .codex/logs
codex run --session feat-auth-refactor
codex logs --session feat-auth-refactor --agent coder
```

**Attach MCP tools**

```bash
codex tools add mcp:http://localhost:8765 --caps db.query,secrets.get
codex run --use-tools mcp
```

## Success metrics (measurable)

* ≥30% of simple changes land PRs with no human edits in a reference repo.
* ≥25% reduction in time from issue → initial PR on pilot projects.
* 100% of sessions produce structured logs/traces with cost tracking.
* `codex.yaml` adopted in ≥3 public community projects within 60 days.

## Incremental roadmap

* **Phase 1 (Foundation):** proxy/inspector, `codex.yaml`, logs/traces, dry-run/commit-plan, policies.
* **Phase 2 (Orchestration):** sub-agents, model routing, prompt packs v1, UI for traces.
* **Phase 3 (Ecosystem):** official plugins/MCP servers, reference repos, migration guide.

## Alternatives & trade-offs

* Relying on third-party proxies is possible, but a **native** proxy simplifies DX, unifies telemetry, and enables cost/policy controls.
* Leaving sub-agents to community plugins risks fragmentation; an **official orchestration path** keeps UX coherent and supportable.

### Additional information

Request

Please keep this issue open for public discussion and triage. The plan is pragmatic and incremental, closely aligned with what developers already rely on in competing CLIs (especially embedded prompt engineering). I love using GPT-5—I just need a Codex CLI worthy of it, with transparency, extensibility, and predictable behavior for production.

Happy to contribute PoCs, examples, and usability testing aligned with the team’s priorities.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

RFC: Make Codex CLI state-of-the-art — built-in prompt engineering, sub-agents, LLM proxy, extensibility & observability #2582

Context

Are you interested in implementing this feature?

Proposal (high level)

A) First-class, embedded prompt engineering

B) Sub-agents orchestration (opinionated defaults, configurable)

C) Built-in LLM proxy/inspector

D) Extensibility & MCP

E) Configuration that’s declarative & reproducible

F) Observability & developer experience

G) Governance & reproducibility

Suggested user flows

Success metrics (measurable)

Incremental roadmap

Alternatives & trade-offs

Additional information

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

RFC: Make Codex CLI state-of-the-art — built-in prompt engineering, sub-agents, LLM proxy, extensibility & observability #2582

Description

Context

Are you interested in implementing this feature?

Proposal (high level)

A) First-class, embedded prompt engineering

B) Sub-agents orchestration (opinionated defaults, configurable)

C) Built-in LLM proxy/inspector

D) Extensibility & MCP

E) Configuration that’s declarative & reproducible

F) Observability & developer experience

G) Governance & reproducibility

Suggested user flows

Success metrics (measurable)

Incremental roadmap

Alternatives & trade-offs

Additional information

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions