v1.17
Computer Use Remote, Visual Verification, & Platform Targeting
May 23, 2026
v1.17 turns Computer Use Remote into a fuller host desktop-control pipeline, with direct tool access, screenshot-backed verification, multimodal captures, platform-aware targeting, and clearer safety behavior when host permissions are missing.
🖥️ Host Desktop Control
computer_use_remoteis callable in live sessions — The model can now invoke the remote computer-use tool directly, while availability, trust mode, and re-arm enforcement remain runtime checks.- Host desktop control is separated from Xpra —
computer_use_remoteis now the sole path for controlling the user's host desktop.linux-desktoptargets only Agent Zero's internal Docker/Xpra desktop. - Host-screen requests route more accurately — Host desktop queries rank ahead of the Xpra skill, while explicit Agent Zero Desktop requests continue to target the internal desktop.
👁️ Visual Verification
- Fresh screenshots are required after desktop actions — State-changing desktop actions are considered unverified until a new screenshot visibly confirms the result.
- Agents stop when screenshots are unavailable — If a verification screenshot cannot be captured, the agent must stop instead of guessing or continuing blindly.
- Screenshots return as multimodal results — Computer-use captures now arrive as real vision messages, allowing the model to inspect the screen visually after each action.
- Older captures are pruned — Stale capture payloads are removed to keep long desktop-control sessions from growing context without bound.
🧭 Platform Targeting
- macOS Accessibility targeting — A dedicated macOS skill supports Accessibility structural targeting through
ax_snapshotandax_actionwhen the backend reports those capabilities. - Windows UIA targeting — Windows gains UI Automation guidance for window management, selector passthrough, and click-last workflows.
- Linux AT-SPI and Wayland targeting — Linux snapshots include compact structural tree outlines so agents can select semantic targets more reliably.
- Generic prompts stay backend-neutral — Backend-specific action details live in platform skills, while the generic layer focuses on capability discovery and skill loading.
🔐 Permissions & Safety
- macOS approval denial handled cleanly —
COMPUTER_USE_APPROVAL_REQUIREDnow maps to the existing re-arm-required stop flow, preventing repeated retries or screenshot fallbacks before permission is granted. - Window-hide guidance updated — Ubuntu, GNOME, and Wayland sessions now prefer
Super+HoverAlt+F9. - Keystroke verification is clearer — Guidance now reminds agents that a sent key chord proves only that the keys were sent, not that the requested window action succeeded.
🧠 Vision & Context
- Codex OAuth proxy preserves image inputs — Image content parts are converted to Responses API
input_imageparts instead of being flattened to text. - Multimodal regression coverage added — Screenshot-bearing tool results now have coverage to protect the vision path.
- Screenshot token estimates fixed — Embedded base64 image data URLs are sanitized from prompt token estimates so screenshots no longer inflate context budgets.