Scaling Computer-Use Grounding via User Interface Decomposition and Synthesis

Abstract

GUI Grounding, the ability to map natural language instructions to specific actions on graphical user interfaces, remains a critical bottleneck in computer use agent development. Current benchmarks oversimplify grounding tasks as short referring expressions, failing to capture the complexity of real-world interactions that require software commonsense, layout understanding, and fine-grained manipulation capabilities.To address these limitations, we introduce OSWorld-G, a comprehensive benchmark comprising 564 finely annotated samples across diverse task types including text matching, element recognition, layout understanding, and precise manipulation. Additionally, we synthesize and release the largest computer use grounding dataset Jedi, which contains 4 million examples through multi-perspective decoupling of tasks. Our multi-scale models trained on Jedi demonstrate its effectiveness by outperforming existing approaches on ScreenSpot-v2, ScreenSpot-Pro, and our new benchmark. Furthermore, we demonstrate that improved grounding with Jedi directly enhances agentic capabilities of general foundation models on complex computer tasks, improving from 5% to 27% in OSWorld. Through detailed ablation studies, we identify key factors contributing to grounding performance and verify that combining specialized data for different interface elements enables compositional generalization to novel interfaces. All data, checkpoints, and code are open-sourced and available for future research.

Updates

[2025/05/12] Project website and code repository are now public

Results

Grounding Ability

We select several benchmarks for GUI grounding. The most commonly used benchmarks in the past include ScreenSpot-v2, ScreenSpot-Pro, which focuses on high-resolution and professional software charts, and OSWorld-G, which we use to evaluate model performance on fine-grained and functional components.

Model	ScreenSpot-v2	ScreenSpot-Pro	OSWorld-G
UI-TARS-7B	91.6	35.7	47.5
Operator	70.5	36.6	40.6
Qwen2.5-VL-3B	80.9	25.9	27.3
Qwen2.5-VL-7B	88.8	27.6	31.4
Jedi-3B	88.6	36.1	50.9
Jedi-7B	91.7	39.5	54.1

Agentic Ability

We evaluate our approach on two computer use benchmarks in online environments: OSWorld and WindowsAgentArena. We employ: (1) GPT-4o as the high-level planner that processes user instructions, and (2) Jedi as the grounding model that converts the planner's low-level instructions into executable actions.

Model	OS SR	WAA SR
GPT-4o (15 steps)	5.0	9.4
UI-TARS-72B (50 steps)	22.7	-
Operator (15 steps)	19.7	-
Operator (50 steps)	32.6	-
Claude 3.7 Sonnet (50 steps)	26.0	-
Aguvis-72B w/ GPT-4o (15 steps)	17.0	-
Jedi-3B w/ GPT-4o (15 steps)^{^}	22.4 _±0.33	29.1 _±0.57
Jedi-7B w/ GPT-4o (100 steps)^{^}	27.0 _±1.81	33.7 _±0.82
Jedi-7B w/ o3 (100 steps)^{^}	50.2 _±0.83	-

^{^}Agentic results for Jedi are the average of four runs. More detailed results are available in the paper.

The figure below illustrates the performance of Jedi-7B w/ o3 on OSWorld, where the maximum number of steps is set to 100. We present how performance varies with the number of steps for both Pass@4 and Avg@4 metrics.

Task Instruction

I am peer-reviewing my friend's course outline. I think the last paragraph is redundant so I want to add strike-through on words in the last paragraph. Can you do this for me?

^{Example trajectories of the agent using GPT-4o +
Jedi-7B completing tasks from OSWorld and
WindowsAgentArena.}

Benchmark: OSWorld-G

We develop OSWorld-G, comprising 564 finely annotated samples that systematically cover text matching, element recognition, layout understanding and fine-grained manipulation, with annotations for the element types required to solve each task.

Agent Model	Text Matching	Element Recognition	Layout Understanding	Fine-grained Manipulation	Refusal	Overall
UI-TARS-7B	60.2	51.8	54.9	35.6	0.0	47.5
Gemini-2.5-Pro	59.8	45.5	49.0	33.6	38.9	45.2
Operator	51.3	42.4	46.6	31.5	0.0	40.6
Jedi-3B	67.4	53.0	53.8	44.3	7.4	50.9
Jedi-7B	65.9	55.5	57.7	46.9	7.4	54.1

Text Matching

Original Instruction: Select "As Attachment"

Refined Instruction: Click "As Attachment" in the "Forward messages" dropdown menu under the "Composition" section of the Thunderbird Mail settings.

Element Recognition

Original Instruction: Click on Ellipse icon.

Refined Instruction: Click on Ellipse icon in the toolbar of LibreOffice Impress.

Layout Understanding

Original Instruction: Close the top notification bar.

Refined Instruction: Click the "X" button on the right side of the blue notification bar at the top of the LibreOffice Calc window.

Fine-grained Manipulation

Original Instruction: Select the place between the word "Person" and the number "1".

Refined Instruction: Place the cursor between the word "Person" and the number "1" in the "Name your Chrome profile" text box in the Chrome settings window.

Refusal

Instruction: Click on the email address of Cindy Williams.

Dataset: Jedi

To enable robust GUI grounding, we construct Jedi, the world's largest multimodal dataset tailored for computer-use grounding scenarios with 4 million newly synthesized examples through multi-perspective task decoupling.

Caption Data, Icon

Visual Appearance: The icon is a simple, white outline of a train or subway car on a black background...Functionality: The icon labeled 'krl_access' likely represents access to a commuter rail or train service...

Grounding Data, Icon

Click on the icon with functionality: This icon is typically used to represent a 'density' or 'line spacing' setting in user interfaces...

Grounding Data, Render-based Component

Set the skill level to 4% on the slider.

Grounding Data, Doc Component

Given the following text: "Vestibulum a", find the text in the document and click the space between the continuous character "t" and "i" in the text.

Grounding Data, Slide Component

Select the handle located at the top of the text box that contains the text "Presentation title."

Grounding Data, Sheet Component

Navigate to top left corner of C19.

Caption Data, Layout

Visual Composition: The element consists of a horizontal strip of thumbnail images...Spatial Context: The element is located centrally below the main product image within a pop-up product detail view...User Interaction: The primary function of this element is to allow users to preview different images of the product...Element Type: Image carousel with navigation buttons."

Grounding Data, Layout

The Text input box's intended function: The primary function of this element is to allow users to input their phone number, email, or Skype ID as part of the Microsoft account sign-in process...

Acknowledgements

We thank Binyuan Hui, Weilu Xu, Dunjie Lu, Zhiyong Wu, Weiyun Wang, Eric Xin Wang, Yuhao Yang, Junlei Zhang, Victor Zhong for their helpful feedback on discussion around this work.

BibTeX

If you find this work useful, please consider citing our paper:

@misc{xie2025scalingcomputerusegroundinguser,
      title={Scaling Computer-Use Grounding via User Interface Decomposition and Synthesis}, 
      author={Tianbao Xie and Jiaqi Deng and Xiaochuan Li and Junlin Yang and Haoyuan Wu and Jixuan Chen and Wenjing Hu and Xinyuan Wang and Yuhui Xu and Zekun Wang and Yiheng Xu and Junli Wang and Doyen Sahoo and Tao Yu and Caiming Xiong},
      year={2025},
      eprint={2505.13227},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2505.13227}, 
}