Google Gemini

A multimodal AI model with Vision-Language-Action capabilities that perceives screens or videos and executes actions like mouse drags or function calls for computer-use tasks.

Visit Google Gemini →
ai multimodal vision language automation

Want to know if Google Gemini fits your workflow?

Audit My AI Toolkit

Similar Tools in Computer Use Agents

Agent Browser
An open-source CLI tool built in Rust that enables AI agents to directly control browsers via simple commands for hea...
AskUI Vision Agent
A GUI-focused visual AI agent that operates at the OS level, using screenshots to identify UI elements and control in...