Computer Use Agent

Context

우리의 새로운 파이프라인이 CUA를 핵심 tool로 활용한다. 이 파이프라인에 손대는 겸, 개념 확인

What I Learned

CUAs use the vision capabilities of multimodal models to interpret what’s happening on the screen, and they combine that with an AI agent framework that can plan tasks and reason out what to do next.

주로 visual screenshot, DOM, user instruction 을 활용하여 미션을 수행하는 agent loop system 에 사용.

아래는 Claude code computer tool usage docs example

Code

curl https://api.anthropic.com/v1/messages \
  -H "content-type: application/json" \
  -H "x-api-key: $ANTHROPIC_API_KEY" \
  -H "anthropic-version: 2023-06-01" \
  -H "anthropic-beta: computer-use-2025-11-24" \
  -d '{
    "model": "claude-opus-4-6",
    "max_tokens": 1024,
    "tools": [
      {
        "type": "computer_20251124",
        "name": "computer",
        "display_width_px": 1024,
        "display_height_px": 768,
        "display_number": 1
      },
      {
        "type": "text_editor_20250728",
        "name": "str_replace_based_edit_tool"
      },
      {
        "type": "bash_20250124",
        "name": "bash"
      }
    ],
    "messages": [
      {
        "role": "user",
        "content": "Save a picture of a cat to my desktop."
      }
    ]
  }'

Note

다음 태스크 작업할 때는 직접 써봐야지