v1.17

Computer Use Remote, Visual Verification, & Platform Targeting

May 23, 2026

v1.17 turns Computer Use Remote into a fuller host desktop-control pipeline, with direct tool access, screenshot-backed verification, multimodal captures, platform-aware targeting, and clearer safety behavior when host permissions are missing.

🖥️ Host Desktop Control

  • computer_use_remote is callable in live sessions — The model can now invoke the remote computer-use tool directly, while availability, trust mode, and re-arm enforcement remain runtime checks.
  • Host desktop control is separated from Xpracomputer_use_remote is now the sole path for controlling the user's host desktop. linux-desktop targets only Agent Zero's internal Docker/Xpra desktop.
  • Host-screen requests route more accurately — Host desktop queries rank ahead of the Xpra skill, while explicit Agent Zero Desktop requests continue to target the internal desktop.

👁️ Visual Verification

  • Fresh screenshots are required after desktop actions — State-changing desktop actions are considered unverified until a new screenshot visibly confirms the result.
  • Agents stop when screenshots are unavailable — If a verification screenshot cannot be captured, the agent must stop instead of guessing or continuing blindly.
  • Screenshots return as multimodal results — Computer-use captures now arrive as real vision messages, allowing the model to inspect the screen visually after each action.
  • Older captures are pruned — Stale capture payloads are removed to keep long desktop-control sessions from growing context without bound.

🧭 Platform Targeting

  • macOS Accessibility targeting — A dedicated macOS skill supports Accessibility structural targeting through ax_snapshot and ax_action when the backend reports those capabilities.
  • Windows UIA targeting — Windows gains UI Automation guidance for window management, selector passthrough, and click-last workflows.
  • Linux AT-SPI and Wayland targeting — Linux snapshots include compact structural tree outlines so agents can select semantic targets more reliably.
  • Generic prompts stay backend-neutral — Backend-specific action details live in platform skills, while the generic layer focuses on capability discovery and skill loading.

🔐 Permissions & Safety

  • macOS approval denial handled cleanlyCOMPUTER_USE_APPROVAL_REQUIRED now maps to the existing re-arm-required stop flow, preventing repeated retries or screenshot fallbacks before permission is granted.
  • Window-hide guidance updated — Ubuntu, GNOME, and Wayland sessions now prefer Super+H over Alt+F9.
  • Keystroke verification is clearer — Guidance now reminds agents that a sent key chord proves only that the keys were sent, not that the requested window action succeeded.

🧠 Vision & Context

  • Codex OAuth proxy preserves image inputs — Image content parts are converted to Responses API input_image parts instead of being flattened to text.
  • Multimodal regression coverage added — Screenshot-bearing tool results now have coverage to protect the vision path.
  • Screenshot token estimates fixed — Embedded base64 image data URLs are sanitized from prompt token estimates so screenshots no longer inflate context budgets.