Scaling Computer-Use Grounding via User Interface Decomposition and Synthesis

1The University of Hong Kong, 2Salesforce Research
*Equal contribution Corresponding authors
Grounding Abilities

Abstract

GUI Grounding, the ability to map natural language instructions to specific actions on graphical user interfaces, remains a critical bottleneck in computer use agent development. Current benchmarks oversimplify grounding tasks as short referring expressions, failing to capture the complexity of real-world interactions that require software commonsense, layout understanding, and fine-grained manipulation capabilities.To address these limitations, we introduce OSWorld-G, a comprehensive benchmark comprising 564 finely annotated samples across diverse task types including text matching, element recognition, layout understanding, and precise manipulation. Additionally, we synthesize and release the largest computer use grounding dataset Jedi, which contains 4 million examples through multi-perspective decoupling of tasks. Our multi-scale models trained on Jedi demonstrate its effectiveness by outperforming existing approaches on ScreenSpot-v2, ScreenSpot-Pro, and our new benchmark. Furthermore, we demonstrate that improved grounding with Jedi directly enhances agentic capabilities of general foundation models on complex computer tasks, improving from 5% to 27% in OSWorld. Through detailed ablation studies, we identify key factors contributing to grounding performance and verify that combining specialized data for different interface elements enables compositional generalization to novel interfaces. All data, checkpoints, and code are open-sourced and available for future research.

Updates

  • [2025/05/12] Project website and code repository are now public

Results

Grounding Ability

We select several benchmarks for GUI grounding. The most commonly used benchmarks in the past include ScreenSpot-v2, ScreenSpot-Pro, which focuses on high-resolution and professional software charts, and OSWorld-G, which we use to evaluate model performance on fine-grained and functional components.

Model ScreenSpot-v2 ScreenSpot-Pro OSWorld-G
UI-TARS-7B 91.6 35.7 47.5
Operator 70.5 36.6 40.6
Qwen2.5-VL-3B 80.9 25.9 27.3
Qwen2.5-VL-7B 88.8 27.6 31.4
Jedi-3B 88.6 36.1 50.9
Jedi-7B 91.7 39.5 54.1

Agentic Ability

We evaluate our approach on two computer use benchmarks in online environments: OSWorld and WindowsAgentArena. We employ: (1) GPT-4o as the high-level planner that processes user instructions, and (2) Jedi as the grounding model that converts the planner's low-level instructions into executable actions.

Model OS SR WAA SR
GPT-4o (15 steps) 5.0 9.4
UI-TARS-72B (50 steps) 22.7 -
Operator (15 steps) 19.7 -
Operator (50 steps) 32.6 -
Claude 3.7 Sonnet (50 steps) 26.0 -
GPT-4o + Aguvis-72B (15 steps) 17.0 -
GPT-4o + Jedi-3B (15 steps)^ 22.4 ±0.33 29.1 ±0.57
GPT-4o + Jedi-7B (100 steps)^ 27.0 ±1.81 33.7 ±0.82

^Agentic results for Jedi are the average of four runs. More detailed results are available in the paper.

Computer Use Showcase

Task Instruction

I am peer-reviewing my friend's course outline. I think the last paragraph is redundant so I want to add strike-through on words in the last paragraph. Can you do this for me?

Trajectory step 1

Example trajectories of the agent using GPT-4o + Jedi-7B completing tasks from OSWorld and WindowsAgentArena.

Benchmark: OSWorld-G

We develop OSWorld-G, comprising 564 finely annotated samples that systematically cover text matching, element recognition, layout understanding and fine-grained manipulation, with annotations for the element types required to solve each task.

Agent Model Text Matching Element Recognition Layout Understanding Fine-grained Manipulation Refusal Overall
UI-TARS-7B 60.2 51.8 54.9 35.6 0.0 47.5
Gemini-2.5-Pro 59.8 45.5 49.0 33.6 38.9 45.2
Operator 51.3 42.4 46.6 31.5 0.0 40.6
Jedi-3B 67.4 53.0 53.8 44.3 7.4 50.9
Jedi-7B 65.9 55.5 57.7 46.9 7.4 54.1

Dataset: Jedi

To enable robust GUI grounding, we construct Jedi, the world's largest multimodal dataset tailored for computer-use grounding scenarios with 4 million newly synthesized examples through multi-perspective task decoupling.

Acknowledgements

We thank Binyuan Hui, Weilu Xu, Dunjie Lu, Zhiyong Wu, Weiyun Wang, Eric Xin Wang, Yuhao Yang, Junlei Zhang, Victor Zhong for their helpful feedback on discussion around this work.

BibTeX

If you find this work useful, please consider citing our paper:

@misc{xie2025scalingcomputerusegroundinguser,
      title={Scaling Computer-Use Grounding via User Interface Decomposition and Synthesis}, 
      author={Tianbao Xie and Jiaqi Deng and Xiaochuan Li and Junlin Yang and Haoyuan Wu and Jixuan Chen and Wenjing Hu and Xinyuan Wang and Yuhui Xu and Zekun Wang and Yiheng Xu and Junli Wang and Doyen Sahoo and Tao Yu and Caiming Xiong},
      year={2025},
      eprint={2505.13227},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2505.13227}, 
}