**Yiming Wang$^{2, \star, \spadesuit}$ Da Yin$^{1, \star, \spadesuit, \heartsuit}$ Yuedong Cui$^{1, \star}$ Ruichen Zheng$^{1, \star}$ Zhiqian Li$^1$**

**Zongyu Lin$^1$ Di Wu$^1$ Xueqing Wu$^1$ Chenchen Ye$^1$ Yu Zhou$^1$ Kai-Wei Chang$^1$**

$^1$UCLA $^2$Harvard University $^\star$Co-First Authors $^\spadesuit$Co-Lead$_{\textrm{Alphabetical Order}}$ $^\heartsuit$Equal Advise

Last Updated on Oct 16, 2025 | 📄: Arxiv | :github:: Github | 🤗: Huggingface

We welcome questions and feedback, and are open to discussion and collaboration! Please feel free to contact us at [email protected] and [email protected].

<aside>

TL;DR

Can we train UI agents with a few experience on real environments or even without any?

image.png

Figure 1: Performance highlights of UI-Simulator and its empowered UI-Simulator-Grow. In particular, UI-Simulator could outperform the same data collection process on real-world environments; UI-Simulator-Grow could bring more rapid scaling trend than UI-Simulator.

Why can LLMs simulate digital world?

We notice that most digital UI environments, including web, mobile, and computer, can be represented as structured textual accessibility trees. Pre-training on front-end code and procedural knowledge makes LLMs suitable as a backbone model to synthesize reasonable UI states and state transitions triggered by user actions.

image.png

How can LLMs simulate UI states and transitions?

(Retrieval-Free) Simulation

The transition follows a multi-step pipeline that guides the world simulator to anticipate outcomes, infer coherent and diverse next states, and render them into a structured format.