Most AI tools work in turns. You prompt, it responds, you prompt again. You are always in the loop, which means you leave when you leave. Codex /goal is different. You describe what done looks like, it runs for hours without you, and you come back to a pull request.
The feature, available in Codex CLI 0.128.0 and the Codex desktop app, maintains a persistent objective across turns. It plans, acts, tests its own work, and iterates until the success criteria you defined are met, or until it gets genuinely stuck and asks for help. Claire Vo of ChatPRD ran a 5-hour 45-minute unattended session on a real codebase, wiping hundreds of Sentry error logs while she was doing something else. The agent paused itself, resumed, wrote its own notes, and finished. She came back to a clean error log.
Writing a good goal is harder than writing a good prompt. A prompt can be vague because you are there to course-correct. A goal needs to define done precisely enough that Codex can verify it without asking you. The OpenAI Codex cookbook describes six elements a strong goal needs: the desired outcome, how success gets verified, what must not regress, which resources Codex can use, how to choose among multiple possible next actions, and when to stop and report rather than guess. Writing all six takes practice. Start with these five tasks.
1. Your error tracker backlog
Every small startup has a graveyard of unresolved Sentry errors or Datadog alerts that engineers keep skipping because they are not blocking anything critical. They accumulate over months. The noise makes the real signal invisible, and nobody wants to spend a sprint on cleanup. It is important work that never quite becomes urgent enough to schedule.
This is exactly what /goal is built for. The task is repetitive, bounded, and verifiable. Here is a goal that works: “Investigate the 15 oldest P2 Sentry errors tagged with the ‘api’ prefix. For each one, either fix it if the cause is clear and the change is isolated to a single function, or write a one-paragraph diagnosis explaining the root cause and what a fix would require. Run the test suite after each change. Do not modify any database migration files. Write a summary report at the end listing what was fixed, what was diagnosed-only, and the current test status.”
That last instruction matters: the summary report is what tells Codex when the goal is finished. Without it, the agent has no terminal condition and will try to keep improving.
Claire Vo’s Lenny’s Newsletter demo showed this running for 5 hours and 45 minutes. The error count dropped from hundreds to near zero. She was not watching.
2. Your inbox
This sounds less technical than clearing Sentry errors, and it is. Codex handles email well because the task is the same shape: a list of items, clear rules for each category, a verifiable completion state.
One stat from the same demo gets people’s attention: 3,900 emails reduced to 68 in under four hours. The goal she ran was specific about categories and actions. Newsletter domains: archive. Invoices and receipts: move to a dedicated folder. Emails needing a personal reply: draft a response, do not send. Anything with a legal or contractual angle: flag it and do not touch it.
The “do not touch” constraint is as important as the actions. Without it, Codex will decide that resolving legal emails is in scope. It is not. Telling the goal where to stop is part of telling it when it is done.
A working inbox triage goal follows the same pattern. Name the categories. Name the action for each. Name the things it should surface rather than handle. Vague instructions (“clean up my inbox”) will produce vague results because Codex will guess at what clean means and it will guess wrong.
3. A feature that has been sitting in your backlog
Small, well-scoped features are the ones that get deferred longest. Not because they are hard, but because engineers have larger work in progress and a one-day feature does not justify pulling someone off a sprint. It sits in the backlog for two months and a new quarter starts.
A Codex goal can ship these overnight. The key is what the OpenAI cookbook calls the verification surface: the concrete evidence that proves the feature is done. Without it, Codex has no way to stop.
A goal that works looks like this: “Add a CSV export button to the admin users table in /src/pages/admin/users. The button should appear in the top-right of the table toolbar. Clicking it should trigger a download of all columns currently visible in the table, using the filename ‘users-YYYY-MM-DD.csv’. The export should require the same admin role as the page itself. The feature is done when: the export button appears, clicking it triggers a download with the correct filename, the admin role check is in place, and the existing admin page tests still pass.”
Every clause after “The feature is done when” is the verification surface. It is the part most prompts skip. It is also the part that lets Codex run autonomously instead of asking you whether it is done yet.
No promise the PR ships production-ready without review. Review it. But a working first attempt in the morning is faster than a feature sitting in a backlog through another sprint.
4. Test coverage for your critical paths
Most teams have a short list of features they are nervous to touch because there are no tests for them. The checkout flow. The billing integration. The user settings page. Adding tests after the fact is always the right thing to do and never quite gets scheduled. It requires understanding code that was written months ago, and that time cost lands on whoever draws the short straw.
Codex handles this well. The task is defined and verifiable: either the coverage number goes up or it does not. A goal that works looks like this: “Write a Jest test suite for the checkout flow in /src/flows/checkout. Cover the happy path, the case where payment fails, and the case where an invalid coupon code is applied. Target 80% branch coverage, measured by the existing coverage command in package.json. Do not change the implementation files, only add test files. The goal is done when the coverage command passes at 80% or above and no existing tests fail.”
The 80% target matters because it gives Codex a number to converge on. “Write good tests” does not converge. A specific percentage does. Codex will run the coverage command, see the gap, and add tests until the threshold is reached.
Review the tests before you merge. Coverage numbers will be real when they pass.
5. Updating stale documentation
Six months after a feature ships, the README still describes the old behavior. The API docs reference endpoint parameters that were renamed. The setup guide assumes a Node version that was upgraded twice. Nobody updates these because the work is tedious and does not appear in any metric until a new engineer joins and spends two days following instructions that no longer work.
A documentation goal works best when it has a comparison anchor. “Compare the current authentication endpoints in /src/routes/auth with the documentation in /docs/api/auth.md. Update the documentation to match what the code actually does. If you find a case where the code behavior is wrong and the documentation describes the intended behavior, note it as a discrepancy comment rather than changing the code. The goal is done when the documentation matches every endpoint, parameter, and response shape that the code currently produces.”
The “note it but do not change the code” clause is the constraints element from the six-part framework. Without it, Codex might decide that fixing the code is in scope. It is not, and a goal that starts updating docs and ends up modifying route behavior is a goal that needs to be reviewed very carefully.
Your engineers get back a PR they can approve in minutes rather than write from scratch.
What a goal needs to finish
Each task above follows the same pattern. An outcome you can state clearly. A verification step that Codex can run without asking you. Constraints that define what not to touch. A stop condition for cases where progress stalls.
Remove any of those and the goal either loops indefinitely or halts with a vague request for clarification. The agents that matter for small teams are not the ones that answer your questions during the day. They are the ones that close tickets while you are at dinner. Getting there requires specifying the work precisely enough that the agent knows when it is finished. That turns out to need the same rigor that good product management has always required. You just get to sleep while it runs.
Start with the error tracker tonight. The success criteria are clearer than anything else on the list, the risk of a wrong change is low, and your engineers will notice the difference.
References
| Source | Author / Org | Year | Supports |
|---|---|---|---|
| Codex Goals: How to turn 4-hour tasks into set-it-and-forget-it workflows | Claire Vo, Lenny’s Newsletter | 2026 | 5h45m error log run, 3,900-to-68 inbox example |
| Using Goals in Codex | OpenAI | 2026 | Six-element framework for effective goals |
| OpenAI Codex /goal: The New Long-Horizon Mode | Kingy AI | 2026 | /goal persistence and five-stage autonomous loop |
| Using Goals in OpenAI Codex: Patterns and Case Studies | Chier Hu | 2026 | 5.5-hour unattended debug session case study |
| Codex Use Cases | OpenAI | 2026 | Non-engineer use cases including inbox and PRD drafting |