-
Notifications
You must be signed in to change notification settings - Fork 6.4k
Description
Context
I’m an everyday GPT-5 user for professional development. The model is excellent, but the Codex CLI chassis still lags behind the best-in-class experience developers are adopting elsewhere—particularly Claude Code CLI, which currently leads because of its embedded prompt engineering and orchestration.
Community references
Critique & context: https://www.youtube.com/watch?v=SOxmiupQm7w
NotebookLLM resume:
The video analyzes the negative perception surrounding GPT-5's launch, arguing that the fault lies not with the model itself, which is considered "amazing," but rather with user interface layers like ChatGPT.com and Cursor. The presenter explains that early insider experiences with the model were the same publicly available version, but optimized with custom system prompts and a different testing environment. He criticized OpenAI's automatic router and the flaws in Cursor's user experience, which resulted in frustrations with slowness and a lack of visual feedback. The video also compares GPT-5 to other models, such as Grok-4 and Anthropic's models, highlighting that while GPT-5 may seem slower, it is superior in output quality and ability to work on complex tasks for long periods. Finally, the presenter discusses a Less Wrong article on the “reverse DeepSeek moment,” suggesting that public appreciation has underestimated GPT-5, which could have negative implications for perceptions of AI progress and policy decisions.
Practical demo showing why embedded prompt engineering wins: https://youtu.be/i0P56Pm1Q3U?si=sZQYNuyTFyg0AQVa
NotebookLLM resume:
This video reveals that Claude Code's effectiveness in code generation isn't due to unique underlying models, but rather to sophisticated and detailed prompt engineering. The author "decoded" the system by intercepting API requests and discovering that Claude uses an extraordinarily long and detailed system prompt, as well as constant tool prompts throughout the message history to guide the model. The video also explores how Claude Code utilizes subagents, activated through detailed tool descriptions, and emphasizes the importance of XML formatting and tags within prompts for better model comprehension, highlighting that prompt tuning is specific to each model family and crucial to optimizing performance.
Why this matters now
A key differentiator of Claude Code CLI is the prompt engineering baked into the tool. That architectural choice is so strong that Anthropic can comfortably let their LLMs be used by other tools—the CLI still stands out and drives adoption. I want that market leadership to move to GPT-5 by giving Codex CLI the same (or better) foundations.
Problems to solve
No built-in prompt engineering layer
We need canonical, versioned, and testable prompt strategies inside the CLI—so teams don’t have to reinvent fragile prompt glue in every repo.
Missing sub-agent orchestration
Pipelines with specialized roles (planner → coder → tester → reviewer) dramatically reduce hallucinations and improve reproducibility.
Lack of a native LLM traffic proxy/inspector
Without transparent logging of prompts, tool calls, costs, and latency, teams can’t debug or govern usage at scale.
Extensibility and observability aren’t first-class
Devs need plugins, MCP integration, logs/traces/metrics, cost guardrails, and dry-run/commit-plan safety for PR workflows.
Are you interested in implementing this feature?
Proposal (high level)
A) First-class, embedded prompt engineering
- Canonical prompt packs for common dev tasks (plan, modify, test, review, refactor, migrate).
- Versioned templates with changelogs and A/B evaluation harnesses.
- Policy hooks (e.g., security/lint/test gates) enforced at the prompt layer.
- Optional prompt DSL or structured blocks to minimize prompt drift and improve reproducibility.
B) Sub-agents orchestration (opinionated defaults, configurable)
- Agents:
planner,coder,tester,reviewer. - Shared session memory and strict context-passing rules.
- Optional parallelization (e.g., tests and review in parallel after code gen).
C) Built-in LLM proxy/inspector
codex proxy start --record .codex/logsacting as a transparent MITM for OpenAI/Anthropic/Ollama/etc.- Structured JSONL output: timestamp, session, agent, prompt, tool calls, latency, tokens, cost, brief summaries.
- Redaction rules for secrets/compliance.
D) Extensibility & MCP
- Plugins via
codex.yamlwith lifecycle hooks (beforePlan,afterCode,onError, etc.). - MCP (Model Context Protocol) to attach external tool servers (DB query, secrets, CI, cloud).
- Task-based model routing: GPT-5 for planning/review, smaller/cheaper models for scaffolding/tests.
E) Configuration that’s declarative & reproducible
# codex.yaml
project: my-repo
models:
planner: openai:gpt-5
coder: openai:gpt-5
tester: openai:gpt-4o-mini
reviewer: openai:gpt-5
prompts:
planner: v1.3 # versioned canonical prompt pack
coder: v2.0
tester: v1.1
reviewer: v1.0
agents:
- name: planner
system: "Plan stepwise tasks with risks and acceptance checks."
- name: coder
tools: [fs.write, git.diff, unit.run]
- name: tester
tools: [unit.run, lint.run, security.scan]
- name: reviewer
system: "Perform objective code review against defined criteria."
proxy:
enabled: true
record_dir: .codex/logs
redact:
- OPENAI_API_KEY
plugins:
- name: mcp-tools
config:
servers:
- url: "http://localhost:8765"
caps: ["db.query", "secrets.get"]
policies:
max_cost_usd: 2.50
max_tokens: 200000
require_green_tests: trueF) Observability & developer experience
codex logs --session <id>: tail by agent/file/error.codex trace open <id>: timeline of spans (plan → code → test → review).--dry-runand--commit-planwith Git guards for safe changes.- Metrics: cost/session, avg stage time, PR approval rate, test flakiness.
G) Governance & reproducibility
- Prompt versioning and optional seed to reduce variance.
- Session snapshots for bug repro.
- Policy enforcement (spend caps, token limits) per repo/environment.
Suggested user flows
Orchestrated run
codex init
codex run --agents planner,coder,tester,reviewer --project .Debug with proxy
codex proxy start --record .codex/logs
codex run --session feat-auth-refactor
codex logs --session feat-auth-refactor --agent coderAttach MCP tools
codex tools add mcp:http://localhost:8765 --caps db.query,secrets.get
codex run --use-tools mcpSuccess metrics (measurable)
- ≥30% of simple changes land PRs with no human edits in a reference repo.
- ≥25% reduction in time from issue → initial PR on pilot projects.
- 100% of sessions produce structured logs/traces with cost tracking.
codex.yamladopted in ≥3 public community projects within 60 days.
Incremental roadmap
- Phase 1 (Foundation): proxy/inspector,
codex.yaml, logs/traces, dry-run/commit-plan, policies. - Phase 2 (Orchestration): sub-agents, model routing, prompt packs v1, UI for traces.
- Phase 3 (Ecosystem): official plugins/MCP servers, reference repos, migration guide.
Alternatives & trade-offs
- Relying on third-party proxies is possible, but a native proxy simplifies DX, unifies telemetry, and enables cost/policy controls.
- Leaving sub-agents to community plugins risks fragmentation; an official orchestration path keeps UX coherent and supportable.
Additional information
Request
Please keep this issue open for public discussion and triage. The plan is pragmatic and incremental, closely aligned with what developers already rely on in competing CLIs (especially embedded prompt engineering). I love using GPT-5—I just need a Codex CLI worthy of it, with transparency, extensibility, and predictable behavior for production.
Happy to contribute PoCs, examples, and usability testing aligned with the team’s priorities.